[TriLUG] code needed for html search, please

Aaron Joyner aaron at joyner.ws
Thu Feb 6 12:39:20 EST 2014


Having worked for a search company for most of a decade, I'm still
constantly surprised at how hard it really is to do search well.  Unless
this is for something you're willing to sink a lot of time into, you're
going to be hard pressed to get what a modern internet user will consider
"reasonable" search done locally.  For a bit of the flavor of why, read
about 'stemming' and 'tokenizing'.  Here's a good start, relevant to the
Solr suggestion:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

Specifically, consider that if you search for "route", you'll want to find
documents that match:
"routing": requires you to 'stem' from "route" to "routing", either at
index time (usually necessary if you want it to contribute to the scoring
algo) or at query time, or both
"route-aware": requires you to tokenize at index time on dashes, not just
spaces, which may or may not improve your results, depending on the data
and the particular keyword

What to tokenize on, how broadly to stem, having good sources for stemming
pairs... that's just one tiny corner of how complex this problem can get.
 Spelling correction, metadata searching, good ranking... users have been
spoiled to expect all of these things and more.  Yet bad full text search
implementations *abound*, through no fault of the authors of software like
Nutch and Solr, just from implementers failing to grasp the complexity of
the problem, and devote resources accordingly.

Good luck!
Aaron S. Joyner


On Thu, Feb 6, 2014 at 7:11 AM, Michael Peters <mpeters at plusthree.com>wrote:

> In terms of easiest to use and setup, I'd suggest swish-e. It's a bit long
> in the tooth and not as full featured as other full text search engines,
> but it's pretty simple to get going.
>
>
> On 02/06/2014 08:45 AM, M. R. wrote:
>
>> Assume:
>>      a domain, or directory thereof to be searched. E.g. mydomain.com,
>> or mydomain.com/pretzels
>>      a web browser to point at it
>>
>> Objective:
>>      names (url's) of all files/pages in which a specific string is found.
>>      In particular, a string like "star.png" repeated four times
>> adjacent to each other. This would show on browsing such a file "****"
>> or whatever other .png or .jpg one chose to search for.
>>
>> Thanks for actual code or links to possible sources thereof. And a tip
>> of the hat.
>>
>
> --
> Michael Peters
> Plus Three, LP
>
>
> --
> This message was sent to: Aaron S. Joyner <aaron at joyner.ws>
> To unsubscribe, send a blank message to trilug-leave at trilug.org from that
> address.
> TriLUG mailing list : http://www.trilug.org/mailman/listinfo/trilug
> Unsubscribe or edit options on the web  : http://www.trilug.org/mailman/
> options/trilug/aaron%40joyner.ws
> Welcome to TriLUG: http://trilug.org/welcome
>


More information about the TriLUG mailing list