- http://old.isn-oldenburg.de/projects/SINN/sinn03/proceedings/schlenker.html -

SINN03 eProceedings

Construction of Physics Vocabularies from OAI Data and its Application to Linkclassification for the PhysDoc Harvester

Michael Schlenker
Institute for Science Networking Oldenburg GmbH

Using the metadata available via the OAI-PMH a vocabulary of physics terms was automatically created by applying statistical and heuristic filtering and extraction methods to the descriptions of physics resources. The primary target of the filtering and extraction process were phrases of physical relevance to be used later in query expansion and classification. One user for the physics keyword and phrase lists gathered is the SHRIMPS http robot built by Svend Age Biehs. The robot searches known servers which are listed in the PhysDep service and tries to find pages on those sites where physics publications are listed. This complements the shallow depth crawling, as done by the harvest engine without SHRIMPS, with a more in-depth look at specific pages deeper in the page hierarchy. SHRIMPS uses a mix of heuristical and hand-crafted rules, combined with pattern matching on keyword and phrase lists to determine if a webpage contains relevant data and is worth harvesting. This increases the number of documents in the index, but limits the pollution with non relevant documents.



SINN03 was hosted and organized by the Institute for Science Networking Oldenburg GmbH.
The project SINN is supported by the German Research Network Organisation, with funds of the German Ministry of Education and Research (BMBF) and of the Government of Lower Saxony.

last update: 18. Feb. 2008

© 2003,
Institute for Science Networking Oldenburg GmbH
Ammeländer Heerstr. 121, D-26129 Oldenburg
www.isn-oldenburg.de, info@isn-oldenburg.de