OT- Automated data extraction from websites
Carl Sappenfield
CSAPPENFIELD at kc.rr.com
Thu May 8 23:33:00 CDT 2003
www.screenscraper.com has a nice utility. You'll need to know some Java to
automate what you're talking about, but not much. Last I looked it was free
(beta 0.8something).
----- Original Message -----
From: <KRFinch at dstsystems.com>
To: <kclug at kclug.org>
Sent: Thursday, May 08, 2003 12:13 PM
Subject: OT- Automated data extraction from websites
> Hello all!
>
> A little off-topic, but I figured that there might be someone in this
crowd
> that would have some ideas.
>
> One thing I have been noticing is that there is a lot of information out
> there on the web that would really be handy to have a portable copy of, in
> one way or another. Most of it is little more than lists of addresses,
and
> some of it is just reference information, but there are a great many
things
> that it would be nice to have a local copy of on my laptop hard drive so I
> can get at it quickly when I'm not wired in.
>
> The problem is that useful information is not often displayed on the web
in
> a easily downloadable format. I think that this is usually intentional,
> but I think that most of the time it has never occurred to the information
> providers that people would want information presented that way. I have
> found that such information can generally be found pretty easily in an
> "unuseful" format, and I was wondering if anyone had ideas on how to
> distill that down to a better format automatically. Here's an example:
>
> *******
> I have a high-performance car that I only run High Octane Amoco gasoline
> in. (Call me superstitious, but I have my reasons.) Since I generally
> have a laptop with me wherever I go, and I have been known to travel with
> this car, it would be very convenient to have an address list on the
laptop
> of every Amoco station in the country so I can easily find a place to fill
> up when I am out of town.
>
> Now, most gas stations are independent, so Amoco doesn't have that sort of
> information at all on their website. However, just about every Amoco
> station has the word "Amoco" in their name, so I can easily do a search of
> one of the on-line phonebooks for "Amoco" and get all of the hits in a
> given locality. That works fine for getting a few at a time, but it is
> hardly a useful portable way to get the information.
>
> *******
>
> What I was thinking was that it should be possible to have a Bot go and
hit
> that phonebook over and over again with different zip codes and collect
> that information into a directory. If my Bot started with "00001" and
> counted up to "99999", collecting as it went, I should (theoretically) get
> a listing of every station in the country along with its address and phone
> number. THAT would be useful. I think that there are a lot of little
> lists like this that could make life easier for me, and none of them are
> readily available as a list:
>
> - All branches of my bank and the hours they are open
> - All of the ATM's that I can use without paying a fee
> - All of the post offices and their hours
> - Every FedEx and UPS drop-off location
>
> All of those tend to be listed by closest to a given zip code, only giving
> a couple of choices. These would also be nice:
>
> - Every public library
> - Every police station and hospital with an emergency room
>
> But something tells me that those would be more than a simple extraction
> from one website.
>
> So, am I crazy? Can something like this be done? Would it be easier to
> just do it manually?
>
> Thanks!
>
> - Kevin
>
>
>
>
More information about the Kclug
mailing list