Thursday, December 08, 2011

Document Capture and Tables/Tabular/Invoices (ocr)

One of the roles I fulfill is working heavily in Data, or Document, Capture.

This covers a wide range:

Document Capture (or Document Content Management/Records Management as the modern term) - Index a couple of fields to be able to search/retrieve the image/document later. The second part is where you store, search, retrieve after the indexes have been captured, but that's for another time and not the focus here.

Data Capture - Collect information from paperwork for use by systems. The original image/document is not relevant after capture except as a reference. Usually unstructured documents or low volume documents.

Forms Processing - Collect information from paperwork in a fast, repeatable process. The original image/document is not relevant after capture except as a reference. Forms processing is an advanced form of Data Capture where if you have consistent forms (structured documents) where the data elements are always in the same location on the form and there is (practically) no variance in the forms/data locations.

Back to the topic at hand - Tabular Capture, or being able to OCR and Key information that is in table format from images that may have come from output systems, scanning, faxing, or other means and trying to turn it BACK into data.

How do we obtain information from tables on paper?

Forms Processing - one answer, zones. Form Processing is designed to collect information from data points on the image/document where the data element is always in the same position. If the first column/first row of a table is always 5" from the top, 1.35" from the left side, has a width of 2" and a height of 1", you zone that area. By zoning, OCR knows where to go exactly for the information, and can be tuned in how it reads the elements (I only expect numeric values here, so there will be no lowercase-L or Oh's or upper case I's or Z's). Also, by zoning, manual entry becomes easy as well as they can look directly at the location. And then exporting, hey, you already know the context of the data element because it was in a specific location, so you already know it is row 1/column 1 to put it in the right location for your export.

Phew.....lots of good stuff with Zones, or sometimes called 'Zonal OCR'. And you don't even need OCR to use zones. Downside? Lot of time in setup and tuning. Lots of time. And you need the right tools in your capture suite to support it. And again, it doesn't even have to use OCR, just setting up zones for manual capture and your export is a gain.

So...what happens when the paperwork has tables but the paperwork is sporadic, non-consistent, unstructured, and may have a high rate of change you not only have no control over, but no upfront notification of the changes? Examples you ask -- Invoices are the biggest culprit, but there are many others out there.

Answer? Well....this is where some companies have innovative approaches to the problem, but from my point of view nothing has been great yet. The column locations are likely different between tables (i.e. first column on one invoice is the product ID, another it is the description, yet another invoice it is the quantity). Some approaches to using regular expressions (regex in shorthand) to detect the context of the data have been tried, but a unit price, calculated price, discount price, and total price all look the same and again could be shuffled around column-wise depending on the invoice. Others have some basic attempts at image analysis to do table detection, and try to OCR the headers for context of the columns (but, running into the problem that invoices have different column header names for the same semantic meaning, while in others the headers have inverse-coloring (white text on black background)))...of all, this is probably the best automation approach but is very immature at the moment.

All good attempts to automate the unstructured tabular capture problem, and maybe in controlled scenarios they work great. But in the real world, lets face it - a human being will need to help figure out how the table is structured and the context of the data elements so it can be captured appropriately (whether OCR or manual again doesn't matter), but done in such a way to be efficient and productive.

Posting here if anyone has found anything, if not, if you stumbled on this blog in a hope to solve this specific problem -- at least you are not alone!

Sunday, December 04, 2011

JavaEE 6 app servers compared

Thank you Antonio!

Baseline/platform sizing of different JavaEE containers (disk, ram, startup).

The more complex metrics of scalability (cpu/mem increase as add more load), performance (first-call as well as high concurrency), and cluster/ha require constants on the OS/hardware/VM/JVM that takes quite a bit more setup and time. At least the above are relatively constant.

Friday, June 03, 2011

hypervisor (vm) and jvm (java) and SLA and costs

I've been testing several approaches to optimize the platform that the applications run on. This blog post is just a brain dump without any clear direction other than current thoughts.

Most of the applications I work it would fall under the equivalent of the JavaEE6 web-profile (jpa/web or jpa/ejb/web) with a couple that have messaging that, in reality, could be modified to work with other async-style approaches (while messaging also supports distributed work efforts, most of the applications aren't reaching a critical mass where then need to distribute that work).

So, what are we talking about platform wise?

*jboss or tomcat (or, more appropriately, the new TomEE as an option)


*OS to run it on (preferably with iSCSI and similar large-disk-space mounting support).

*hypervisor to run multiple guest OS/vm/appcontainers.

Some of the general goals are reduce diskspace/memory, maximize the number of applications that can run on a piece of hardware, while still protecting or segregating applications from each-other so if in our haste to 'time to market' an application will only hurt itself and not any others. Failover/disaster-recovery is also a consideration, with a minor emphasis on time-to-increase-capacity-and-associated-downtime but that is not as critical.

App Container

Jboss has been doing some wonderful things with the new jboss7 AS stack. I haven't finished my memory review, but I hope they got the 'memory bloat' under control. Jboss 4.0.x series with one application can run in under 128MB in most cases, while Jboss 5.x and 6.x series for the SAME app need to double-to-triple to 256MB/364MB.

-jboss deployment bonus: The ability to deploy an application's 'configuration' beside it as a SAR in the same deployment directory as the application WITHOUT needing to modify the server itself is HUGE. I do not understand why people do not take more advantage of the SAR benefits. You create your application binary once, then vet/test with one SAR configuration, take the SAME binary to your staging/pre-deploy/uat/stress-testing/etc environments with different SAR configurations, then again move the SAME binary to production with a different SAR configuration. What you tested is what went live.

-And, once you setup the SAR configuration for the environment...leave it there and update the application binary with changes (assuming no additional configurations). The least variables to mess around with the better!

TomEE is a new player and haven't reviewed it yet.

Jonas unfortunately has never gave reason to peak my interest.

Geronimo & Glassfish are additional options, but also do not provide any significant reason to change from Jboss (which I have the most experience/skill in).

Tomcat/Jetty are decent web-only platforms, but would not be considered as part of the strategy related to inability to support the full necessary stacks.

Conclusion: Jboss still in the win, but if Memory is a constraint be wary of jboss5/6 versus the older jboss 4.0.x series. The new Jboss7AS is a significant rewrite and will hopefully address this, as well as additional scenarios.


This is where it gets interesting....

*jboss again comes out with the Boxgrinder project so that you can have predictable/repeatable platforms. This is kind of an outsider as it doesn't directly relate to any of the above areas, but is a way towards combining and using them in a cool (or more predictable...less variables) fashion.

*Azul has their new Zing JVM/OS combo-solution that will run on hypervisors (and is optimized). But, at a price of $5k-$6k per 'server', but I haven't touched/tested/or discussed if a server represents a single JVM that can run multiple appcontainers or not.

*Oracle has a not-very-discussed JVM/OS combo-solution that will also run on hypervisors called Maxine Virtual Edition:
-GPL licensed/forever open sourced.
-takes queues from openJDK, so will continue to keep updated with recent JDK updates.
-not 'production' ready...if this can get some more steam, this is definately a good place to go.

Away from the cool stuff, and back to reality --

Just Enough Operating System (JEOS) continues to be a buzzword but with no real meat or applied solutions. The Boxgrinder project above does try to help with some pre-defined approaches to a JEOS for the different linux OS distributions. CentOS is still a popular choice for low-cost options, and the guys there are trying there best to get CentOS 6 out the door even while RHEL 6.1 gets released -- if you want the faster turn around, pay for it and get the benefit of testing and security announcements, otherwise free CentOS is free but help them out.

hypervisor (virtualization)

Hypervisor battle is pretty hot right now, with no real clear winner yet.

With Xen and KVM as the current front-runners on the open-source server hypervisor segment (with others close behind), it's not really black and white which one to pick although Xen has a little bit of an edge with Citrix backing and Paravirtualization support.

VMWare, hyper-v (which announced CentOS support?!), and other commercials also offering some competitive advantages over the open source alternatives (for a price).

Wednesday, May 04, 2011

Alfresco as an Image Archive Server (TIFF/fax/scan images)

Currently evaluating Alfresco CE 3.4.d for use as an Image Archive/Record Content Management Server. Definition is to store multi-page TIFF images that have 2-6 custom attributes that must be searchable to retrieve the associated images.

The most common usecase that doesn't involve company-specific attributes as an example is storing incoming Fax images where you want to store attributes such as the number dialed to come in (enterprise w/ DID or similar fax setup), date it came in, number it came from (if available). For the number dialed in, you could instead say 'Department'.

Anyway, this post isn't about the custom attributes piece, this is for the image piece.

Req 1, allow to store and view multipage TIFF images (preferably without requiring a TIFF plugin that will likely change on Office upgrades).
Alfresco by default does not handle multipage TIFF. In fact, 3.4.d the supplied ImageMagick doesn't even support TIFF (see /alfresco/common/bin 'convert -list configure', DELEGATES line, should see TIF and it isn't there). 3.4.e DOES support TIF, but only for windows and 64-bit linux, and only the *first* page of the TIF.

Luckily, this wonderful community member of the open source product Alfresco already had a solution:

With additional modifications to remove ImageMagick, OpenOffice, and other ancillary services that were not needed for something soley to be a TIFF-based Image Server, a rather slim solution that with the default 'SHARE' interface is a good solution. I do have 3.4.d working with this solution, and will be doing a more enterprise-oriented tomcat deploy opposed to the installer approach and feel quite confident in how Alfresco team architected the product to support each companies' unique needs.

Current problem: The FLASH previewer is good, but the challenge with multi-page TIFF is that the tiff2pdf conversion isn't that's the pdf2swf that is taking 1/4 to 1/2 a second per page.

Research notes for TIFF 2 PDF conversion those interested:

ImageMagick 6.5.4 seems to work, but has huge/escalating memory requirements as TIFF's grow for tiff2pdf:

Memory requirements of 600MB-3GB of system ram (non jvm heap) per image conversion (but fast, 1-4 seconds).
3GB is related to a 7mb test file that seems to have some bad TIF encoding, however
3GB is only because moved to swap space, it may be more.

instead, use a newer version:


  sudo yum groupinstall "Development Tools"
  sudo yum install rpmdevtool libtool-ltdl-devel
sudo yum install djvulibre-devel tcl-devel freetype-devel ghostscript-devel libwmf-devel jasper-devel lcms-devel bzip2-devel librsvg2 librsvg2-devel liblpr-1 liblqr-1-devel libtool-ltdl-devel autotrace-devel

rpmbuild --nodeps --rebuild   ImageMagick-6.6.9-7.src.rpm

cd /home/dhartford/build/RPMS/i686
sudo rpm -ihv --force --nodeps ImageMagick-6.6.9-7.i686.rpm

In the end, same memory requirements (600MB-3GB).

Alternatives reviewed:
A separate medium has been suggested, such as TIFF to GIF, then GIF to PDF:
${img.exe} ${source} gif:- | convert gif:- ${target}
slightly better, but the edge case of 3GB ram still occurs. Also increases diskspace with additional medium.

Switches to work around potential problem areas do not seem to matter:
${img.exe} -monochrome -compress Fax ${source} ${target}
No difference.

TIFF to PNG, may get more performance from GraphicsMagick:
--not tested

libtiff has a direct **tiff2pdf** that simply 'wraps' the image with PDF headers without
doing dpi/sizing/re-rendering like the ImageMagick/GraphicsMagic approach (which,
under the covers, uses libtiff to read the tiff then sends the resulting image
through image processing for dpi/resolution modifications and then sends it
to Ghostscript to generate the resulting PDF). Note that imagemagick and
graphicsmagick under the covers also uses libtiff anyway for TIFF decoding.

BEST OPTION from testing, tiff2pdf modification testing seems to be around:
Memory requirements of 10MB-80MB of system ram (non jvm heap) per image conversion, ~1 second fast.
--some issues around if bad TIF encoding sending to stdout/stderror, creates an exit status preventing completion in Alfresco transformer.
Asking mailing list if there is a quiet/silent mode so tries best-attempt at conversion without
causing the exit status.
There is no 3GB ram issue (instead 80MB over ~10 sec for the 7MB tiff/99 pages). 
*NOTE: The 7MB example came back as 99 pages in SWF previewer. Using separate system TIFF and PDF viewers, also 99 pages, so consistent.

Research notes on the PDF viewer(s) when used with TIFF 2 pdf conversion:

version 0.8.1 does not paginate tiff2pdf conversions, causing repeating cycle in the flash previewer.

NOTE: alternate viewer:

REVIEWED:, only has rpms up to 0.8.1, and there have been several releases since then.

TODO: 64-bit centos binary:

mkdir /opt/swftools
cd /opt/swftools

tar -xzvf swftools-0.9.1.tar.gz

yum install zlib-devel libjpeg-devel giflib-devel freetype-devel gcc gcc-c++ make

cd swftools-0.9.1
./configure --disable-lame  --prefix=/opt/swftools/swftools-0.9.1-bin/
make install

Diskspace footprint for /opt/swftools including source code, configure, make, and binary:

Tuesday, February 15, 2011

Javamelody performance & usage statistics

One of the hidden gems in the open source world is a project called Javamelody.

I've been using this since late 2009 to help refactor/modify design and code based on usage-based findings. It is not a profiler, not a click-n-fix, not a quickly-fix-your-problems tool. It is a tool to get you the information, over time, that you need to make Strategic decisions about design/code.

It gets all tiers of statistics within a single application -> the application's UI calls, business (ejb/facade/spring) calls, and sql calls.

Recently I finally submitted a patch for GWT-RPC detailed statistics I've been using for a while to help, again from a strategic point of view, refine some products.


Monday, January 24, 2011

Web UI upgradability

One of the areas that has been an issue over time is taking an application, say deployed to jboss 3.0 or 3.2.3, and try to upgrade it to jboss 4.0.5. Or tomcat 4 to tomcat 5. Or any upgrades at all.

Real-world experience with struts (1.0/1.1)/JSP sites have several challenges in upgrading. Whether they are container based or implementation base, I have never had success with 'easy' upgrades.

JSF appears to have similar issues. I myself kept running into performance issues everytime I've attempted a JSF implementation, so relying on this post to confirm similar issues:

Now, onto one known savior - GWT is upgrade compatible. I have successfully upgraded 1.3 to 1.5, 1.5 to 2.0, 2.0 to 2.1, and 1.3 to 2.0 (I haven't tried direct to 2.1). The only upgrade issues were 1.3 to higher versions dealing with RPC changes.

Monday, January 17, 2011

Testing GWT-RPC, and why to be careful about jumping to RequestFactory

Just a copy of what I posted on StackOverflow:

The only caveat I would put in is that RequestFactory uses the binary data transport (deRPC maybe?) and not the normal GWT-RPC.

This only matters if you are doing heavy testing with SyncProxy, Jmeter, Fiddler, or any similar tool that can read/evaluate the contents of the HTTP request/response (like GWT-RPC), but would be more challenging with deRPC or RequestFactory.