[TriLUG] Web Site Indexing

Tue Jan 4 20:17:46 EST 2005

On Tue, Jan 04, 2005 at 05:04:28PM -0500, Lance A. Brown wrote:

> From the requester: "The idea is that we would look at all the htm and 
> html files and grab the filename, title, keywords, and all the links" and 
> "... it would print out in export from something, looking like an excel 
> spread sheet."
> 
> I could write a tool to do this, but I don't really have the time.  There 
> must be tools available to crawl a website and generate these kinds of 
> reports, but I'm not finding them.  F/OSS is preferred, but I'm willing to 
> recommend a commercial solution if it'll do the job.
> 
> Can anyone offer a pointer?

Well, I'm not sure if this is exactly what you want, but may make for a
quicker job if you have to write it yourself (and know perl).  But I had
to recently write a script to pull out all the various style tags from a
large website, and it could probably be modified to pull out stuff from
<title> tags and such.  I'll attach it here, feel free to mangle it and
do whatever.

If you don't have direct access to the file structure, the following may
help with getting a local copy:

wget -A htm,html -r -l 20 http://www.website.tld/

Hope it helps. ;)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://www.trilug.org/pipermail/trilug/attachments/20050105/f1bbfea5/attachment.pgp>