SINN03 eProceedings

    Construction of Physics Vocabularies from OAI Data and its Application to Linkclassification for the PhysDoc Harvester

    Michael Schlenker
    Institute for Science Networking Oldenburg GmbH

    Using the metadata available via the OAI-PMH a vocabulary of physics terms was automatically created by applying statistical and heuristic filtering and extraction methods to the descriptions of physics resources. The primary target of the filtering and extraction process were phrases of physical relevance to be used later in query expansion and classification. One user for the physics keyword and phrase lists gathered is the SHRIMPS http robot built by Svend Age Biehs. The robot searches known servers which are listed in the PhysDep service and tries to find pages on those sites where physics publications are listed. This complements the shallow depth crawling, as done by the harvest engine without SHRIMPS, with a more in-depth look at specific pages deeper in the page hierarchy. SHRIMPS uses a mix of heuristical and hand-crafted rules, combined with pattern matching on keyword and phrase lists to determine if a webpage contains relevant data and is worth harvesting. This increases the number of documents in the index, but limits the pollution with non relevant documents.


