[TriLUG] code needed for html search, please

Kristopher Kane kristopher.kane at gmail.com
Thu Feb 6 10:11:17 EST 2014


Apache Nutch for crawling the domain:  https://nutch.apache.org
Apache Solr for indexing and search API: http://lucene.apache.org/solr/

You can run both on one computer or with the scalability of a Hadoop
cluster.

What does this mean: "This would show on browsing such a file "****" or
whatever other .png or .jpg one chose to search for."

You *probably* can get away with this without writing a single line of
Java.  Nutch is Solr aware (They were all dropped off by the same ship),
so, you're crawling is all configuration settings and  to tell it where
Solr is running.  On the search side, Solr has an HTTP accessible API that
can return results in XML, JSON, text, fortune cookie and several other
choices.  Your search could be as simple as a curl statement.

-Kris




On Thu, Feb 6, 2014 at 8:45 AM, M. R. <13miketele at bellsouth.net> wrote:

> Assume:
>     a domain, or directory thereof to be searched. E.g. mydomain.com, or
> mydomain.com/pretzels
>     a web browser to point at it
>
> Objective:
>     names (url's) of all files/pages in which a specific string is found.
>     In particular, a string like "star.png" repeated four times adjacent
> to each other. This would show on browsing such a file "****" or whatever
> other .png or .jpg one chose to search for.
>
> Thanks for actual code or links to possible sources thereof. And a tip of
> the hat.
> --
> This message was sent to: Kristopher Kane <kristopher.kane at gmail.com>
> To unsubscribe, send a blank message to trilug-leave at trilug.org from that
> address.
> TriLUG mailing list : http://www.trilug.org/mailman/listinfo/trilug
> Unsubscribe or edit options on the web  : http://www.trilug.org/mailman/
> options/trilug/kristopher.kane%40gmail.com
> Welcome to TriLUG: http://trilug.org/welcome
>


More information about the TriLUG mailing list