Darren Hartford's Developer Blog: December 2011

One of the roles I fulfill is working heavily in Data, or Document, Capture.

This covers a wide range:

Document Capture (or Document Content Management/Records Management as the modern term) - Index a couple of fields to be able to search/retrieve the image/document later. The second part is where you store, search, retrieve after the indexes have been captured, but that's for another time and not the focus here.

Data Capture - Collect information from paperwork for use by systems. The original image/document is not relevant after capture except as a reference. Usually unstructured documents or low volume documents.

Forms Processing - Collect information from paperwork in a fast, repeatable process. The original image/document is not relevant after capture except as a reference. Forms processing is an advanced form of Data Capture where if you have consistent forms (structured documents) where the data elements are always in the same location on the form and there is (practically) no variance in the forms/data locations.

Back to the topic at hand - Tabular Capture, or being able to OCR and Key information that is in table format from images that may have come from output systems, scanning, faxing, or other means and trying to turn it BACK into data.

How do we obtain information from tables on paper?

Forms Processing - one answer, zones. Form Processing is designed to collect information from data points on the image/document where the data element is always in the same position. If the first column/first row of a table is always 5" from the top, 1.35" from the left side, has a width of 2" and a height of 1", you zone that area. By zoning, OCR knows where to go exactly for the information, and can be tuned in how it reads the elements (I only expect numeric values here, so there will be no lowercase-L or Oh's or upper case I's or Z's). Also, by zoning, manual entry becomes easy as well as they can look directly at the location. And then exporting, hey, you already know the context of the data element because it was in a specific location, so you already know it is row 1/column 1 to put it in the right location for your export.

Phew.....lots of good stuff with Zones, or sometimes called 'Zonal OCR'. And you don't even need OCR to use zones. Downside? Lot of time in setup and tuning. Lots of time. And you need the right tools in your capture suite to support it. And again, it doesn't even have to use OCR, just setting up zones for manual capture and your export is a gain.

So...what happens when the paperwork has tables but the paperwork is sporadic, non-consistent, unstructured, and may have a high rate of change you not only have no control over, but no upfront notification of the changes? Examples you ask -- Invoices are the biggest culprit, but there are many others out there.

Answer? Well....this is where some companies have innovative approaches to the problem, but from my point of view nothing has been great yet. The column locations are likely different between tables (i.e. first column on one invoice is the product ID, another it is the description, yet another invoice it is the quantity). Some approaches to using regular expressions (regex in shorthand) to detect the context of the data have been tried, but a unit price, calculated price, discount price, and total price all look the same and again could be shuffled around column-wise depending on the invoice. Others have some basic attempts at image analysis to do table detection, and try to OCR the headers for context of the columns (but, running into the problem that invoices have different column header names for the same semantic meaning, while in others the headers have inverse-coloring (white text on black background)))...of all, this is probably the best automation approach but is very immature at the moment.

All good attempts to automate the unstructured tabular capture problem, and maybe in controlled scenarios they work great. But in the real world, lets face it - a human being will need to help figure out how the table is structured and the context of the data elements so it can be captured appropriately (whether OCR or manual again doesn't matter), but done in such a way to be efficient and productive.

Posting here if anyone has found anything, if not, if you stumbled on this blog in a hope to solve this specific problem -- at least you are not alone!

Darren Hartford's Developer Blog

Thursday, December 08, 2011

Document Capture and Tables/Tabular/Invoices (ocr)

Sunday, December 04, 2011

JavaEE 6 app servers compared

Labels

Blog Archive

About Me