[TriLUG] intranet search engine recommendations
Fri, 10 May 2002 15:28:39 -0400
First, congratulations to the new board and thanks to those who served for the past year.
Second, I'd like to ask for the group's recommendations on an intranet search (engine|tool) which runs on Linux and is suitable for a small to
midsize intranet. I've been experimenting with htdig (distributed with Red Hat Linux) but have run into some apparent limitations:
1) Based on the most current information I could find, htdig cannot update an index for only modified files. For example, if 50 of 25000 fil
es are modified in the course of a day, I'd like to be able to update the index for only the modified files. With htdig, I would have to repa
rse and reindex all 25000 files just to get the 50 updates.
2) htdig (and/or its external parsers) seem to have a very large memory footprint for xls, doc, and pdf files over a few MB in size. Setting
the max_doc_size to a small number (i.e. 500K) would cause most of our documents to be omitted from indexing.
Any recommendations? I'm especially interested in anything that allows indices to be updated on modified files without reindexing unchanged f
iles. I've looked at Google's product, but is quite costly.