I have finally had it with paperwork. This last tax season did me in.
I've talked to a couple people about using OCR to store documents digitally. I know that a few people on the list do this as well. I was wondering if anyone could give me some tips about what works and what doesn't work. Is it better to OCR things? is it better to scan and save a PDF or some other portable document?
Again, TIA
Tim
--- bewkard bewkard@gmail.com wrote:
I have finally had it with paperwork. This last tax season did me in.
I've talked to a couple people about using OCR to store documents digitally. I know that a few people on the list do this as well. I was wondering if anyone could give me some tips about what works and what doesn't work. Is it better to OCR things? is it better to scan and save a PDF or some other portable document?
Unless the document is printed in an OCR-friendly font, you aren't going to have a great deal of success with even modern OCR software. If all you need is to replace a print copy with a visual image, far better to scan the document as a graphic image and then store the graphic image in some graphics format.
If the documents need to be viewed by other people then PDF is a good choice, but if you are going to be the primary viewer of the documents all you really need to do is scan into a graphics format (like EPS, TIFF, or an application-specific but open format like the GIMP's XCF format) and save those files as-is.
Hardcopy tends to outlast digital storage methods, so some companies are using OCR in reverse (printing documents to read in with OCR much later on) to store some of their more long-term information. They are making things very easy for the computer though, because they are using print fonts which are very easy for OCR applications to read.
It costs a lot of money to get a computer to accurately extract information from a printed surface, as the scientists who extracted the earliest recording of the human voice (from a graphite-sheathed cylinder) discovered themselves:
French folk song is 'world's earliest recording', beating Edison by 11 years March 27, 2008
Their experience is not entirely unlike that of trying to use modern OCR software.
____________________________________________________________________________________ You rock. That's why Blockbuster's offering you one month of Blockbuster Total Access, No Cost. http://tc.deals.yahoo.com/tc/blockbuster/text5.com
On Wed, Apr 2, 2008 at 9:22 PM, bewkard bewkard@gmail.com wrote:
I have finally had it with paperwork. This last tax season did me in.
I've talked to a couple people about using OCR to store documents digitally. I know that a few people on the list do this as well. I was wondering if anyone could give me some tips about what works and what doesn't work. Is it better to OCR things? is it better to scan and save a PDF or some other portable document?
Again, TIA
Tim
Fix number Zero. Sadly WinClosed and possibily hardware restricted .
http://www.neatreceipts.com/getorganized?gclid=COmVy_q1v5ICFQEQlwoduljzbw
Mentioned here so any Linux folks wishing to do this can see an established player in the field.
Second is outsourcing:
Which also lists Linux compatibility .
There USED to be document handling services that handled small clients and personal accounts but that have been decimated to almost extincted. I used to work for one.-Let me give a short overview of factors in archiving paper records and ever expecting to see them again. As in indexing and retrieval plans should become frozen BEFORE capture planning . NEVER assume you can discard originals! Which behooves you to use the capture phase to physically archive not only originals but "proof prints." EX:
Scan docs- save file- retrieve file- reprint doc- audit a few or even 100% File the proofs as applicative insurance. Risk assessment's dictates how far you go. I could end my comments with suggesting that the hardware image capture itself is your baseline pass/flunk. Detail lost in a bad scan is GONE. and if you trash the originals? I could bore you with a recollection/tutorial of micrographics camera plus scanner gear that made microfilm images and scans in one pass but will post that only if asked or it is wanted off list.
Oren Beck
816.729.3645
Meant to send this out last night, but apparently it got stuck in drafts...
OCR will never be perfect. And because of that, you will *never know* for sure, where it failed. Once something becomes paper, all it is, is an image. I have never heard of OCR being a format of its own. It's usually used to 'convert' an image into text, stored as text, or convert an image stored as text, put into tags, stored with the image.
I have been storing all my tax and other documents electronically since 2004. I currently store scannedd documents in PDF format. I would prefer a multipage image format like TIFF, but haven't found a good program to do that. PDF is massively more popular.
If I can get an electronic copy from the sender I keep that and ditch the paper. Most banks and financial institutions now offer some form of electronic document delivery because it saves them money. This is usually PDF; Sometimes html. I believe the fewer format transformations I do on it, the better, so I will save it in whatever format I can get it in. If for ANY reason you think you need to print something out just to scan it in, don't. Use CupsPDF or PDF-Print, or something like it. It shows up as a printer in cups, and when you print to it, saves a pdf of what you "printed".
If I have to scan paper, I currently use a program called gscan2pdf. It runs the scanner and can save a multipage pdf file. Before you save, you have the chance to re-arrange the page order, which is handy if your ADF (automatic document feeder) skips a page, or jams. You can also rotate pages. My scanner is attached to the network, so if you remind me the day before, I can load it up, and demo the program at the lug meeting.
On Wed, Apr 2, 2008 at 9:22 PM, bewkard bewkard@gmail.com wrote:
I have finally had it with paperwork. This last tax season did me in.
I've talked to a couple people about using OCR to store documents digitally. I know that a few people on the list do this as well. I was wondering if anyone could give me some tips about what works and what doesn't work. Is it better to OCR things? is it better to scan and save a PDF or some other portable document?
Again, TIA
Tim
Kclug mailing list Kclug@kclug.org http://kclug.org/mailman/listinfo/kclug
On Fri, 2008-04-04 at 14:23 -0500, Billy Crook wrote:
Meant to send this out last night, but apparently it got stuck in drafts...
OCR will never be perfect. And because of that, you will *never know* for sure, where it failed. Once something becomes paper, all it is, is an image. I have never heard of OCR being a format of its own. It's usually used to 'convert' an image into text, stored as text, or convert an image stored as text, put into tags, stored with the image.
I have been storing all my tax and other documents electronically since 2004. I currently store scannedd documents in PDF format. I would prefer a multipage image format like TIFF, but haven't found a good program to do that. PDF is massively more popular.
If I can get an electronic copy from the sender I keep that and ditch the paper. Most banks and financial institutions now offer some form of electronic document delivery because it saves them money. This is usually PDF; Sometimes html. I believe the fewer format transformations I do on it, the better, so I will save it in whatever format I can get it in. If for ANY reason you think you need to print something out just to scan it in, don't. Use CupsPDF or PDF-Print, or something like it. It shows up as a printer in cups, and when you print to it, saves a pdf of what you "printed".
If I have to scan paper, I currently use a program called gscan2pdf. It runs the scanner and can save a multipage pdf file. Before you save, you have the chance to re-arrange the page order, which is handy if your ADF (automatic document feeder) skips a page, or jams. You can also rotate pages. My scanner is attached to the network, so if you remind me the day before, I can load it up, and demo the program at the lug meeting.
On Wed, Apr 2, 2008 at 9:22 PM, bewkard bewkard@gmail.com wrote:
I have finally had it with paperwork. This last tax season did me in.
I've talked to a couple people about using OCR to store documents digitally. I know that a few people on the list do this as well. I was wondering if anyone could give me some tips about what works and what doesn't work. Is it better to OCR things? is it better to scan and save a PDF or some other portable document?
Again, TIA
Tim
Kclug mailing list Kclug@kclug.org http://kclug.org/mailman/listinfo/kclug
Kclug mailing list Kclug@kclug.org http://kclug.org/mailman/listinfo/kclug
Actually, I do recall reading of someone who created a program that would back up (and later retrieve) files from paper. Of course, you couldn't store anything very large with it, but text isn't very large. Might that be a good solution here; or does it have to be human readable?
Anyone ever wonder why banks still use magnetic ink to print the characters on your checks? Because they print in a very specific font and don't rely on a computer analyzing the picture of a character to figure out what it is - magnetic ink is proven and reliable. OCR is a long running problem. I used to play with it way back in the day (like, '94 or '95-ish) on my old Packard Bell laptop. It was slow as sin, but it sort of worked. AFAIK, things have generally only gotten faster due to CPU speed, not really much better at actually deciphering text. If all your papers printed in a very OCR friendly font with strong contrast of ink to paper your accuracy rates would be good, but they will never be 100%, of course. So, OCR is truly a lossy format. Every OCR setup I've ever bothered to read about still needs a proof reader. If a person still has to read, understand and verify every page that is scanned in you still have a load of man hours to deal with just getting the stuff in the system. A good data entry clerk would be a fair match for a proof reader, I'd wager. ;)
Jon.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Jon Pruente wrote: | Anyone ever wonder why banks still use magnetic ink to print the | characters on your checks? Because they print in a very specific font | and don't rely on a computer analyzing the picture of a character to | figure out what it is - magnetic ink is proven and reliable.
Er...without naming names, I know a company that used the (very expensive) magnetic toner cartridges to laser-print checks. This same laser printer was also used for normal printing, requiring swapping out of the magnetic toner with regular toner.
Needless to say, this didn't always work perfectly (typical PEBKAC :), and the bean-counting accountants stopped buying the expensive magnetic toner when they noticed none of their errantly printed checks were ever returned.
Then again, the MICR font they use for check numbers is pretty sub-optimal for human reading, but highly optimized for machine/OCR interpretation. Just like the FedEx/UPS scanners have no issues reading the 1 and 2 dimensional bar-codes printed on just about everything these days.
The real solution will come when you change the problem...instead of trying to get computers to deal with the messy, noisy, analog world of human communication, just augment the humans to be able to easily interface to the pristine, mathematical world of the machine. Hearing aids can begin to do this already. Wired directly into the brain, they allow people who have *NEVER* heard anything and have defective audio 'hardware' in their ears (due to genetics or whatever) to hear normally.
Similar feats for visual information are likely not far behind (at least in a long-term view of human history). I personally look forward to the day when I no longer have to lug around my limited resolution iPod, remembering to plug it in to charge, etc. It will be so much easier to just have it jacked straight into my brain...
- -- Charles Steinkuehler charles@steinkuehler.net
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
I work at the school for the blind, so OCR is a regularly used technology on our campus. It is fairly good for allowing a visually impaired person to have reasonably accurate access to printed material, but even the most expensive setups aren't 100% accurate.
It's kind of like voice recognition - in ideal circumstances you can get about 98-99% accuracy. However, you will never have ideal circumstances in a real-life setting, so actual accuracy can vary from about 75% to 99%. I have seen OCR capture an entire page of 12pt Times text at 100% accuracy, and I've also seen voice recognition systems capture a sizable dictation at 100% accuracy, but it is uncommon.
One thing to know about OCR is that higher resolution and color depth do not make for higher accuracy. We get our best accuracy at around 150-300 DPI and in black-and-white (aka line-art) modes. Higher resolution wont necessarily make the accuracy worse, but it doesn't help and it makes scan times longer and files much larger. Using gray scale or color usually does decrease the accuracy in my experience.
- -- ~Bradley Hook Education Systems Administrator Kansas State School for the Blind 1100 State Avenue Kansas City, KS 66102 Voice: (913) 281-3308 ext. 363 Mobile: (913) 645-9958 Facsimile: (913) 281-3104 http://www.kssb.net
Jon Pruente wrote: | Anyone ever wonder why banks still use magnetic ink to print the | characters on your checks? Because they print in a very specific font | and don't rely on a computer analyzing the picture of a character to | figure out what it is - magnetic ink is proven and reliable. OCR is a | long running problem. I used to play with it way back in the day | (like, '94 or '95-ish) on my old Packard Bell laptop. It was slow as | sin, but it sort of worked. AFAIK, things have generally only gotten | faster due to CPU speed, not really much better at actually | deciphering text. If all your papers printed in a very OCR friendly | font with strong contrast of ink to paper your accuracy rates would be | good, but they will never be 100%, of course. So, OCR is truly a | lossy format. Every OCR setup I've ever bothered to read about still | needs a proof reader. If a person still has to read, understand and | verify every page that is scanned in you still have a load of man | hours to deal with just getting the stuff in the system. A good data | entry clerk would be a fair match for a proof reader, I'd wager. ;) | | Jon. | _______________________________________________ | Kclug mailing list | Kclug@kclug.org | http://kclug.org/mailman/listinfo/kclug | |
****************************************************************************************** Confidentiality Statement: This message and accompanying documents are covered by the Electronic Communications Privacy Act, 18 U.S.C. 2510-2521, and contain information intended for the specified individual(s) only. This information is confidential unless explicitly indicated otherwise. If you are not the intended recipient or an authorized agent responsible for delivering it to the intended recipient, you are hereby notified that you have received this document in error and that any review, dissemination, copying, or the taking of any action based on the contents of this information is strictly prohibited. If you have received this communication in error, please notify the sender immediately by E-mail, and delete the original message. ******************************************************************************************
On Mon, Apr 7, 2008 at 11:02 PM, Bradley Hook bhook@kssb.net wrote:
I work at the school for the blind, so OCR is a regularly used technology on our campus. It is fairly good for allowing a visually impaired person to have reasonably accurate access to printed material, but even the most expensive setups aren't 100% accurate.
What kinds of mistakes seem to be the most common?
Adrian