OT- Automated data extraction from websites

Carl Sappenfield CSAPPENFIELD at kc.rr.com
Thu May 8 23:33:00 CDT 2003


www.screenscraper.com has a nice utility.  You'll need to know some Java to
automate what you're talking about, but not much.  Last I looked it was free
(beta 0.8something).

----- Original Message -----
From: <KRFinch at dstsystems.com>
To: <kclug at kclug.org>
Sent: Thursday, May 08, 2003 12:13 PM
Subject: OT- Automated data extraction from websites

> Hello all!
>
> A little off-topic, but I figured that there might be someone in this
crowd
> that would have some ideas.
>
> One thing I have been noticing is that there is a lot of information out
> there on the web that would really be handy to have a portable copy of, in
> one way or another.  Most of it is little more than lists of addresses,
and
> some of it is just reference information, but there are a great many
things
> that it would be nice to have a local copy of on my laptop hard drive so I
> can get at it quickly when I'm not wired in.
>
> The problem is that useful information is not often displayed on the web
in
> a easily downloadable format.  I think that this is usually intentional,
> but I think that most of the time it has never occurred to the information
> providers that people would want information presented that way.  I have
> found that such information can generally be found pretty easily in an
> "unuseful" format, and I was wondering if anyone had ideas on how to
> distill that down to a better format automatically.  Here's an example:
>
> *******
> I have a high-performance car that I only run High Octane Amoco gasoline
> in.  (Call me superstitious, but I have my reasons.)  Since I generally
> have a laptop with me wherever I go, and I have been known to travel with
> this car, it would be very convenient to have an address list on the
laptop
> of every Amoco station in the country so I can easily find a place to fill
> up when I am out of town.
>
> Now, most gas stations are independent, so Amoco doesn't have that sort of
> information at all on their website.  However, just about every Amoco
> station has the word "Amoco" in their name, so I can easily do a search of
> one of the on-line phonebooks for "Amoco" and get all of the hits in a
> given locality.  That works fine for getting a few at a time, but it is
> hardly a useful portable way to get the information.
>
> *******
>
> What I was thinking was that it should be possible to have a Bot go and
hit
> that phonebook over and over again with different zip codes and collect
> that information into a directory.  If my Bot started with "00001" and
> counted up to "99999", collecting as it went, I should (theoretically) get
> a listing of every station in the country along with its address and phone
> number.  THAT would be useful.  I think that there are a lot of little
> lists like this that could make life easier for me, and none of them are
> readily available as a list:
>
> - All branches of my bank and the hours they are open
> - All of the ATM's that I can use without paying a fee
> - All of the post offices and their hours
> - Every FedEx and UPS drop-off location
>
> All of those tend to be listed by closest to a given zip code, only giving
> a couple of choices.  These would also be nice:
>
> - Every public library
> - Every police station and hospital with an emergency room
>
> But something tells me that those would be more than a simple extraction
> from one website.
>
> So, am I crazy?  Can something like this be done?  Would it be easier to
> just do it manually?
>
> Thanks!
>
> - Kevin
>
>
>
>




More information about the Kclug mailing list