Zonal OCR, Rubberbanding and Machine Learning

Hello all.

This post is a thought exercise for something I haven’t thought yet how to approach for a challenge related to PDFs and automatic classification, splitting and renaming.

I assume MBS’s OCR and DynaPDF plug-ins already can handle some of what I’m asking, provided I code everything from scratch. I’m asking for ideas on implementation in case there are other approaches or specific tips and tricks you can think of. Or perhaps other tools I can use as helpers that may be easier to implement.

Of course, if there’s ready-made software that already does what I want in a local installation then that’s also a great choice. I’d rather not reinvent the wheel, but most of what I’ve seen is server-side enterprise solutions or cloud-based ones I’m forbidden from using.

This is for the company I work for, so developing something is only one of the options and we could happily go with a third party tool (developed in Xojo or not), in case you know of any.

What we need is a way to scan a stack of files into a single PDF and have a program do something with the PDF depending on content.

This stack of papers is a contract package that includes several pages produced by us (so I can include a barcode to identify), documents included by the customer (which I can identify by specific text in them) and some documents like identity IDs or powers of attorney (shape recognition).

1.-Search for specific barcodes within the PDF pages (barcode recognition)
2.-Search for specific text (ocr)
3.-Try to identify specific shapes and formats (for example, an ID card or legal seals for powers of attorney)

What the program should do is split the PDF stack of pages into three parts:
1.-Documents originated by our systems (all of which have a barcode identifying type of document, contract ID and page)
2.-Documents included by the customer contractually required (typically their ID and powers of attorney)
3.-Invoice

Each part should then be a separate PDF, with the documents it contains in a specific order and with the Table of Contents reflecting what page has which individual parts (so the PDF for the second part could have three pages, and the TOC specifies which one is the ID, which one if the Powers of Attorney and which one couldn’t be identified).

I had thought that the best way to do this would be to analyze the PDF and make a list of all that’s encountered that fits the bill, and use that list to come up with the ordering, splitting and naming.

Rubberbanding comes up in the interactive part. Sometimes recognition may not be accurate but you could manually select a part of a document and have it be recognized and/or “learned”, so in the future it uses the same zone for matching.

ML/IA was something I was thinking about: Since every time you validate an automated selection or every time you manually rubberband a template you’re essentially “teaching” the app, it could use that to make better predictions.

Some sample document management systems that already do some of what I want, but are either abrsurdly expensive (as they’re designed for huge multinationals with thousands upon thousands of documents per day, like insurance claims management) or can’t be used because they’re cloud-based:

OpenKM, Zonal OCR: https://www.openkm.com/old/en/openkm-zone-ocr.html
ScanToPDF, Barcode Recognition: https://www.scantopdf.com/product/barcode-recognition
SimpleIndex, closest to what I want: https://www.simpleindex.com/Barcode_Recognition/PDF_barcode_recognition.asp
Parascript: https://www.parascript.com/form-automation/
Zbar, Open Source Barcode Recognition in images: https://en.wikipedia.org/wiki/ZBar
Rossum, AI-based document form data extraction: https://rossum.ai
DocParser, PDF Document data extraction: https://docparser.com

I missed Chronoscan: https://www.chronoscan.org/features_zonal_ocr.asp

I would suggest looking at:
https://cloud.google.com/vision/docs/ocr?refresh=1

and especially this: https://aws.amazon.com/textract/?nc2=h_ql_prod_ml_text
… which is a new service that I have not had time to try out but looks very promising.

[quote=463983:@Jim Meyer]I would suggest looking at:
https://cloud.google.com/vision/docs/ocr?refresh=1

and especially this: https://aws.amazon.com/textract/?nc2=h_ql_prod_ml_text
… which is a new service that I have not had time to try out but looks very promising.[/quote]

I just realized I never replied to this :slight_smile:

This looked very interesting, but for the purposes of what I wanted to do, access to the Internet would probably not be available as a rule, and if it was the information is sensitive enough to not be desirable.