New improved scans

Mish Middelmann: REALITY BYTES

So there are all these cool ways of creating digital content, from word processors through voice recognition to digital cameras. But what about the stuff you get on plain old paper?

Scanners are the things that turn paper documents into digital form, and nowadays they go together with optical character recognition (OCR) software.

OCR looks at a scanned image and resolves the black and white squiggles into a’s and b’s, then analyses the jumble of letters into words and paragraphs. In other words, it turns a picture of a document into a word processor file that you can edit, locate using a search engine, and save in a fraction of the disk space taken by the scanned image file.

All this stuff has been around, and has sounded great for years, but been more of a pain than it is worth for most people (except for, say, big parcel-express companies who scan waybills and signatures for receipt of parcels). The problems have been:

l the high cost of both scanners and extra- special PCs to drive them;

l the low quality of OCR – it takes almost as long to correct all the mistakes as it would to retype the document; and

l if you decided to store the actual scanned images of the pages, the cost of the disk space was prohibitive.

But now the technology has changed.

Scanners come in every shape and size, and good ones start in the less-than-R2 000 range, such as the smart new HP Scanjet 5100C I tried last week. They are also as easy to operate as a photocopier.

OCR software has come a long way. You can expect your new scanner to come with OCR software that recognises more than 95% of the words on a page, separates images from text, and scans the whole lot directly into popular software packages like Microsoft Word.

Unfortunately 100% accuracy is all but impossible. Badly photocopied documents, messy faxes, underlining and italics all play havoc with OCR, and the solution can’t be a better scanner since that will just be more precise about the unwanted blobs and scratches. As so often with computer technology, what we want is more intelligence in the software.

For example, I scanned in the phrase “then I went to the city” and the OCR software that came with the Scanjet 5100C decided the “I” was the digit “1”. You need the software to analyse the context and use some common sense to work out how to interpret it. This is why people pay for higher quality OCR software such as the professional versions of either Xerox Textbridge or Caere Omnipage, which will set you back nearly R4 000. These better products also highlight suspicious words or phrases to help your proof-reading.

If the limitations of OCR are a problem, nowadays you can afford to store quite a lot of scanned images. With the price of disk space plummeting (about 100 times cheaper than it was 10 years ago), it is conceivable to store documents as images so that you can go back to the “perfect” original either to help correct OCR, or to print paper copies. Once you are dealing with large volumes of scanned documents having the original image is a boon because “raw” OCR text will always have errors in it. Proof-reading large volumes is often impractical, so it is convenient to have a picture of the original in hand to clear up any ambiguities.

Just as computer networks provide the infrastructure for organisations to share all their self-generated electronic documents online, scanners coupled with OCR and the availability of massive amounts of disk storage awaken the possibility of adding all incoming documents to that same store. And Intranets offer the ideal front end to make the search for such documents a more pleasant and productive activity. In the end, for large offices, scanners are likely to be a small part of a large document management system. But you can get started on a small scale for less than R2 000.

Mish Middelmann is a director of Praxis Computing

New improved scans

Latest News

MAIL & GUARDIAN

ABOUT

SUBSCRIPTIONS

FOLLOW

FLAGSHIP EVENTS

LEGAL & CORRECTIONS

RESOURCES