SINN03 eProceedings

    Harvesting Webpages that contain Mathematical Information

    Winfried Neun
    Scientific Information Systems Department
    Konrad-Zuse-Zentrum für Informationstechnik Berlin (ZIB)

    The aim of the Math-Net project (under the aegis of the International Mathematical Union) is to build up a pool of high quality information on mathematical research and mathematicians worldwide. In the framework of this project we at ZIB are harvesting pages with mathematical contents from the Web. These pages contain, besides simple text information, mathematical formulae or keywords. These formulae are traditionally encoded in LaTeX, but with the emerging new standards like MathML, OpenMath, OMDoc we have to encounter more webpages that use the new standards. Our goal is to retrieve as much semantic information as possible independent of the encoding style used for formulae in a mechanized way by providing extensions to the Harvest software. We finally want to classify the mathematical information in the webpage based on the type of formulae included and completed by mathematical keywords. In this talk we discuss some problems with the automatic detection of semantics which are caused by the encoding schemes. One example is the well-known encoding in MathML, where two different encoding types serving different needs of the users as well as mixed types are defined. Some of the attempts we make to overcome these problems are based on heuristics.



