Teaching Your Application to Read

comments

Geeky salesman have spent decades selling us a pipedream of the end of “Tree based communication”. To date not much has changed and I’m not holding my breath. In the mean time, what do we do with all this paper? If like me you are dabbling with ways to get your code to convert the physical to the digital, I’ve got news for you: Smarter people have already solved the problem for you and it’s simply a matter of plug and play.

imageThere is part of the world that us computer nerds often tend to hate. Hard copy. A physical information format you can actually hold in your non-Neo physical hands, Paper. I haven’t just gone out and become a print/paper guy overnight, but I have been playing with a project that takes paper in and gives digital goodness out, because whether you subscribe to reality or not, the plain and simple fact is this medium isn’t going anywhere soon.

Let me get my glasses

Building apps that behave like humans is hard. Building ones that see the world and interpret it like we do is even harder – it’s because of this that developers at Google, Microsoft and many other large companies have invested so heavily in projects like reCAPTCHA so that they can try and teach computers to read the way humans do.

But what if you happen to not have millions in research funding, computer scientists filling your development teams, and still want to add the ability for your app to read documents and structured handwriting?

This is where OCR software comes in, and more directly for us developers, OCR software that offers an easy to use API for plugging your app into its 20/20 vision. A great tool for achieving this is the ABBYY FlexiCapture SDK which allows you to quickly add both document OCR, and even more interestingly document recognition and categorization to your apps. This means not only does your app quickly gain the ability to read documents, invoices and printed handwriting, but also the ability to act like a mail clerk and categorize any of your documents for later storage and review.

Standing on the Shoulders of Giants

Whenever you plug in a third party API into your project you get a pretty good view of how much they’ve thought about how us lowly code monkeys will consume their product/service. To put a point on it, less is more, I want to get up and using a product with as little friction as possible. The chaps at ABBYY have ticked the box here and you can get a feel for their API by taking a look at what is involved in converting the following image into machine readable XML. All I have to do is scan the image once to create a Document Definition with parts of the document I’d like to capture and its off to the races.

 

image

[DllImport("FCEngine.dll", CharSet = CharSet.Unicode), PreserveSig]
private static extern int InitializeEngine(string devSN, string reserved1, string reserved2, out IEngine engine);
[DllImport("FCEngine.dll", CharSet = CharSet.Unicode), PreserveSig]
private static extern int DeinitializeEngine();

void ProcessImage()
{
    IEngine engine = null;
    InitializeEngine("SWAT1000000000000000", null, null, out engine);

    var processor = engine.CreateFlexiCaptureProcessor();



//Add our definition file processor.AddDocumentDefinitionFile("LicenceseDefinition.fcdot"); //Add an image to scan processor.AddImageFile("license.jpg"); var document = processor.RecognizeNextDocument(); //output a file "LicenseOutput.xml" with the results of the scan processor.ExportDocumentEx(document, "MyOutputFolder", "LicenseOutput.xml", null); DeinitializeEngine(); }

Output:

<?xml version="1.0" encoding="UTF-8"?>
<form:Documents xmlns:form="http://www.abbyy.com/FlexiCapture/Schemas/Export/FormData.xsd"
                xmlns:addData="http://www.abbyy.com/FlexiCapture/Schemas/Export/AdditionalFormData.xsd">
    <_License:_License xmlns:_License="http://www.abbyy.com/FlexiCapture/Schemas/Export/License.xsd">
        <_License>
            <_Name>McLOVIN</_Name>
            <_DriversLicenseNumber>01-47-87441</_DriversLicenseNumber>
            <_AddressLine1>892 MOMONAST</_AddressLine1>
            <_AddressLine2>HONOLULU, HI 96820</_AddressLine2>
            <_IssueDate>06/18/1998</_IssueDate>
        </_License>
    </_License:_License>
</form:Documents>

As a Web Developer by trade I don’t spend much of my time close to the bare metal, so although the use of calls to unmanaged code shown above makes me feel a little dirty,  the fact that I only need such a tiny amount of code to get me to a working solution makes it even more worth my while.

And before you go thinking I’ve spent hours and hours with the product, luckily for me ABBYY has been nice enough to provide a huge bunch of working code samples in their SDK pack. Code Samples are the commercial equivalent of providing access to their unit tests so you can see how to actually build stuff. Less than 10 minutes in, and I have my app scanning fake Hawaiian drivers licenses – Awesome!

image

Although I won’t be covering it in this blog post, one of the more exciting features that comes along with the FlexiCapture SDK, is the ability to self learn documents that you feed it. As if all those warnings Sarah Connor gave us had fallen on deaf ears…

Input/Output

You’ll note that in my code above I load a JPEG image into the FlexiCapture SDK as a source, but if you’re scanning documents, screenshots, video frames, or even a blurry smart phone photo the FlexiCapture SDK can take in a feast of formats in all sorts of quality and DPI:

  • PDF
  • Bitmap
  • PCX
  • JPEG Black and white
  • JPEG 2000 (12 subformats)
  • TIFF (10 subformats)
  • GIF (Because we all take high rez photos/scans as GIF these days)
  • PNG
  • DjVu
  • JBIG2

This means you can easily write applications that take all sorts of image or scanner data in and give you easy to read data out – you can easily see how receiving faxed, emailed or uploaded images from disparate sources becomes a lot easier for your application to make sense of. Once its taken a peak at your incoming data sources it can then pump data out in a whole bunch of other formats that support structure data. Take a look at my McLovin output in a bunch of formats:

CSV Text File

image

Excel File

image

A full dBase Database (thanks to DBF Viewer Plus for allowing me to even see this!)

image

It also allows export to XML (shown in my initial classy sample), and a whole heap of image formats as well – although seeing you’ve just imported an image this might not come in overly handy unless for archival.

You do get the picture though – ABBYY FlexiCapture isn’t just a one trick pony, but more a Swiss Army knife for character recognition and image to text scanning.

What makes this even more powerful is that it does all of the above recognition input/output trickery in 198 languages.

That’s right, I said:

198 languages!!!!

This is priceless when you consider that 91% of the world doesn’t speak (and therefore write) in English so I'm blown away by this number.

Included Tooling

We really should have listened to Sarah Connor, as one of the coolest features of the ABBYY FlexiCapture SDK is that it auto learns documents for later classification. If you are scanning large amounts of documents, it can learn to read and classify new types of documents simply by you feeding them through the API – handy for times when a supplier or client changes the design of their invoicing.

For everything else there is a great set of included tooling, with apps for classifying documents, defining new document definitions, and as a reviewer making changes to captured text that might not have the best confidence level; They even have a form designer in case you would like to create new forms for later scanning.

image
The review console

image

The document definition studio for defining fields

image

Form designer in action

Summary

If you’re working on a .Net project that needs to consume paper data sources such as forms, invoices, or emailed ID information for later review, the ABBYY FlexiCapture SDK is definitely worth taking a look at for your next OCR project. I know that the speed to get up and running really did make my life easier, and all of the tooling that comes along with it made the product feel like a very complete solution. As a c# guy the only negative I would say about the whole experience is the fact that the c# API requires the use of non-native code, although when considering the amount of image processing and heavy grunt work that the API must be doing in the background to pull off it’s magic I understand why this decision was made.

Now we just need to figure out a way to stop Judgement Day.

 


I received the product mentioned above for free in the hope that I would mention it on my blog. Regardless, I only recommend products or services I use personally and believe my readers will enjoy. I am disclosing this in accordance with the Federal Trade Commission's 16 CFR, Part 255: "Guides Concerning the Use of Endorsements and Testimonials in Advertising."