BACK TO THE PROCEEDINGS
Hungarian PhysNet activities and Mnogosearch (an other choice in place of Harvest)
Kati Szalay, Jozsef Kadlecsik
Computer Networking Center
KFKI Research Institute for Particle and Nuclear Physics, Budapest, Hungary
email: firstname.lastname@example.org, email@example.com
As an important result of our common work with ISN experts the Hungarian active PhysNet mirror has been working since 31.10.2002.
The mirror site has its own domain name. The URL of the mirror site is:www.physnet.hu.
With the Harvest search engine we had some problems, mainly with it's stability and reliability. For example it often crashed suddenly without any reason. Another big problem is, that it is very slow in indexing. For example it often happens at the weekly reindexing of the KFKI Web ( 1.2 GB), that the new indexing can't start because the previous one hasn't finished yet.
Because of the problems we decided to use the mnoGoSearch software
to search KFKI's web and special parts of it, such as physics pages, the so called "fizikai_szemle" (it is a scientific physics journal) and the "Chemonet", the Hungarian information system for chemics.
Now only one Harvest gatherer is kept at our system and it is for the PhysNet querying of the physics pages. Earlier we had four gatherers and four brokers for the mentioned topics.
We are very glad to have seen in Mr Kang-Jin Lee's Todo-List for Harvest: "Add support for importing data from (larbin, webbase, aspseek), mnoGoSearch".
You can find everything about this free search engine on mnoGoSearch homepage,
still I would like to highlight some of the features of it:
our next tasks regarding PhysNet:
We are going
- Full text indexing. Different priority can be configured for body,
title, keywords, description of a document.
- Supporting all widely used single- and multi-byte character sets,
including UTF8, as well as most of the popular Eastern Asia languages.
- Automatic document character set and language guesser for about
70 charset/language combinations
- HTTP/1.0, HTTPS, FTP, NNTP and HTTP Proxy support
- Local file system indexing support (file: URL schema)
- Supporting gzip, deflate, compress content encoding
- Different SQL databases support: MySQL, PostgreSQL, miniSQL, Solid,
Virtuoso, InterBase, Oracle, SyBase, MS SQL, iODBC, unixODBC, etc.
Even multiple types of databases at the same time.
- Storage methods:
- single - all words are stored in one table
- multi - words are stored in different 13 tables depedning of
their length. Usually faster than 'single'.
- crc - 32 bit integer word IDs (CRC32) are stored instead of
words. Very fast.
- crc-multi - crc with multi tables - for big search engined
- cache - word index is stored on disk, URL information in SQL
- Search clusters: a possibility to distribute database between several
- Basic authorization support (to index password protected areas)
- Both HTML documents and plain text files can be indexed
- External parsers support for other file types (pdf, ps, doc etc.)
- Mirroring features
- Stopwords support
- "keywords" and "description" META tags support, user defined META
- Reentry capability. You can run few indexers and searching
processes at the time
- Continual indexing
- HTML templates to easily customize search results
- Boolean query support
- Fuzzy search: different word forms, synonyms, substrings
- C CGI, PHP3, Perl search frontends
- Search on subsection of database
- It is very flexible. You can configure mnoGoSearch to run in different modes,
including 'ftpsearch mode' (searching through URLs rather than their
'link validation' (to check site for bad references), 'netminder'
(What's new since ...?).
- to convince the system managers on our campus -and after that at other important places, such as the physics departments of universities- to set up search engines for their pages
- to extend PhysNet searching to the systems of those institutions
- to ensure that our HTML pages are easily found - metadata
- tu update the PhysDep list
- to extend and develop the Hungarian chapter of PhysDoc.
BACK TO THE PROCEEDINGS