OT- Automated data extraction from websites

KRFinch at dstsystems.com KRFinch at dstsystems.com
Thu May 8 17:14:04 CDT 2003


Hello all!

A little off-topic, but I figured that there might be someone in this crowd
that would have some ideas.

One thing I have been noticing is that there is a lot of information out
there on the web that would really be handy to have a portable copy of, in
one way or another.  Most of it is little more than lists of addresses, and
some of it is just reference information, but there are a great many things
that it would be nice to have a local copy of on my laptop hard drive so I
can get at it quickly when I'm not wired in.

The problem is that useful information is not often displayed on the web in
a easily downloadable format.  I think that this is usually intentional,
but I think that most of the time it has never occurred to the information
providers that people would want information presented that way.  I have
found that such information can generally be found pretty easily in an
"unuseful" format, and I was wondering if anyone had ideas on how to
distill that down to a better format automatically.  Here's an example:

*******
I have a high-performance car that I only run High Octane Amoco gasoline
in.  (Call me superstitious, but I have my reasons.)  Since I generally
have a laptop with me wherever I go, and I have been known to travel with
this car, it would be very convenient to have an address list on the laptop
of every Amoco station in the country so I can easily find a place to fill
up when I am out of town.

Now, most gas stations are independent, so Amoco doesn't have that sort of
information at all on their website.  However, just about every Amoco
station has the word "Amoco" in their name, so I can easily do a search of
one of the on-line phonebooks for "Amoco" and get all of the hits in a
given locality.  That works fine for getting a few at a time, but it is
hardly a useful portable way to get the information.

*******

What I was thinking was that it should be possible to have a Bot go and hit
that phonebook over and over again with different zip codes and collect
that information into a directory.  If my Bot started with "00001" and
counted up to "99999", collecting as it went, I should (theoretically) get
a listing of every station in the country along with its address and phone
number.  THAT would be useful.  I think that there are a lot of little
lists like this that could make life easier for me, and none of them are
readily available as a list:

- All branches of my bank and the hours they are open
- All of the ATM's that I can use without paying a fee
- All of the post offices and their hours
- Every FedEx and UPS drop-off location

All of those tend to be listed by closest to a given zip code, only giving
a couple of choices.  These would also be nice:

- Every public library
- Every police station and hospital with an emergency room

But something tells me that those would be more than a simple extraction
from one website.

So, am I crazy?  Can something like this be done?  Would it be easier to
just do it manually?

Thanks!

- Kevin




More information about the Kclug mailing list