OT- Automated data extraction from websites

Dan Mesimer dmesimer at kc.rr.com
Thu May 8 18:35:52 CDT 2003


I suppose (depending on whether these web forms you are using are POST
or GET) you could script some sort of wget to pull down the pages.

I was thinking:
if the urls are something like: 
http://www.somedomain.com/place?zip=ZIPCODE

You could just loop through every ZIP Code a pull down a local copy of
each of these web pages.

Then maybe build yourself a small front-end page to access the all of
your local (downloaded) web pages

Dan Mesimer

On Thu, 2003-05-08 at 12:13, KRFinch at dstsystems.com wrote:
> Hello all!
> 
> A little off-topic, but I figured that there might be someone in this crowd
> that would have some ideas.
> 
> One thing I have been noticing is that there is a lot of information out
> there on the web that would really be handy to have a portable copy of, in
> one way or another.  Most of it is little more than lists of addresses, and
> some of it is just reference information, but there are a great many things
> that it would be nice to have a local copy of on my laptop hard drive so I
> can get at it quickly when I'm not wired in.
> 
> The problem is that useful information is not often displayed on the web in
> a easily downloadable format.  I think that this is usually intentional,
> but I think that most of the time it has never occurred to the information
> providers that people would want information presented that way.  I have
> found that such information can generally be found pretty easily in an
> "unuseful" format, and I was wondering if anyone had ideas on how to
> distill that down to a better format automatically.  Here's an example:
> 
> *******
> I have a high-performance car that I only run High Octane Amoco gasoline
> in.  (Call me superstitious, but I have my reasons.)  Since I generally
> have a laptop with me wherever I go, and I have been known to travel with
> this car, it would be very convenient to have an address list on the laptop
> of every Amoco station in the country so I can easily find a place to fill
> up when I am out of town.
> 
> Now, most gas stations are independent, so Amoco doesn't have that sort of
> information at all on their website.  However, just about every Amoco
> station has the word "Amoco" in their name, so I can easily do a search of
> one of the on-line phonebooks for "Amoco" and get all of the hits in a
> given locality.  That works fine for getting a few at a time, but it is
> hardly a useful portable way to get the information.
> 
> *******
> 
> What I was thinking was that it should be possible to have a Bot go and hit
> that phonebook over and over again with different zip codes and collect
> that information into a directory.  If my Bot started with "00001" and
> counted up to "99999", collecting as it went, I should (theoretically) get
> a listing of every station in the country along with its address and phone
> number.  THAT would be useful.  I think that there are a lot of little
> lists like this that could make life easier for me, and none of them are
> readily available as a list:
> 
> - All branches of my bank and the hours they are open
> - All of the ATM's that I can use without paying a fee
> - All of the post offices and their hours
> - Every FedEx and UPS drop-off location
> 
> All of those tend to be listed by closest to a given zip code, only giving
> a couple of choices.  These would also be nice:
> 
> - Every public library
> - Every police station and hospital with an emergency room
> 
> But something tells me that those would be more than a simple extraction
> from one website.
> 
> So, am I crazy?  Can something like this be done?  Would it be easier to
> just do it manually?
> 
> Thanks!
> 
> - Kevin
> 
> 
> 
> 




More information about the Kclug mailing list