Michael Schlenker
Institute for Science Networking Oldenburg GmbH
Michael.Schlenker_AT_isn-oldenburg.de
Abstract:
Using the metadata available via the OAI-PMH a vocabulary of physics terms was
automatically created by applying statistical and heuristic filtering and
extraction methods to the descriptions of physics resources. The primary target
of the filtering and extraction process were phrases of physical relevance to be
used later in query expansion and classification. One user for the physics
keyword and phrase lists gathered is the SHRIMPS http robot built by Svend Age
Biehs. The robot searches known servers which are listed in the PhysDep service
and tries to find pages on those sites where physics publications are listed.
This complements the shallow depth crawling, as done by the harvest engine
without SHRIMPS, with a more in-depth look at specific pages deeper in the page
hierarchy. SHRIMPS uses a mix of heuristical and hand-crafted rules, combined
with pattern matching on keyword and phrase lists to determine if a webpage
contains relevant data and is worth harvesting. This increases the number of
documents in the index, but limits the pollution with non relevant documents.
DOWNLOAD PRESENTATION (pdf)
|