Need advice on parsing data out of a PDF

Clifford_Antrim · July 2, 2024, 6:53pm

I know very little about the structure of PDF files, and having poked around for a couple of days I think I need some advice/guidance on how to get started.

I need to be able to parse out some data from various PDF files. The files are all quotes from several different vendors and the one thing they all have in common is the data I need (part numbers, descriptions, costs, etc) are all in some form of a “table”. I’m not sure if the “tables” would be actual PDF tables (I think that’s a thing) or just text objects more or less lined up in rows and columns.

I have been exploring my sample PDFs with the MBS DynaPDF plugin and I can get the page coordinates of the various text objects. So one approach would be to organize the text by coords, look for some keywords that would be in the column headers, and read the row data until I’m no longer in the “table”.

Is there a better way?

Jean-Yves_Pochez · July 2, 2024, 7:20pm

this python opensource code could be a good start to understand how to parde tables inside a pdf

Clifford_Antrim · July 2, 2024, 8:03pm

thanks! I’ll check it out…

Clifford_Antrim · July 3, 2024, 8:55pm

For posterity:

The DynaPDFMBS plugin has a table object it can write into a PDF, but when writing it converts everything to text and lines and doesn’t actually create a PDF table object. And there is no way for the plugin to read a PDF table object - it just reads it as text objects and lines

I also did some inspection of the sample files I was given and it seems that all the data for the tables is encoded in a stream object, so I can’t just read the PDF in as a text file and regex my way to finding a table.

Right now it seems that extracting the text (which the plugin can do) and then analyzing each text object’s coordinates to determine rows and columns is my best shot at getting the data out.

TimStreater · July 3, 2024, 9:07pm

Then presumably you might also have something that looks like a paragraph of text but is in fact an image.

Clifford_Antrim · July 3, 2024, 10:04pm

Well, the MBS plugin can decode the streams and extract the text im interested in. I have no idea why the would be encoded into a stream in the first place, but they are.

Eric_Williams · July 4, 2024, 2:39am

“stream” is just PDF-speak for “a sequence of bytes” or “a sequence of objects”.

Clifford_Antrim · July 4, 2024, 1:13pm

Yes. I guess the point I was trying to make was the table data was an encoded binary stream and I couldn’t just read it as text. But the MBS plugin can, so its not really a problem.

Matthew_Combatti1 · July 4, 2024, 4:39pm

You could use a vision API to extract all desired entities in whatever format you desire; or even ocr and then have an AI based API extract the entities into whatever scheme you desire. We use this method for our automated mortgage processing and underwriting software and medical transcription softwares with less error than human/manual entry. (Less than 0.01% error compared to 1-3% human/ or incorrect parsing errors.) It can even parse and read handwritten text.

Clifford_Antrim · July 4, 2024, 6:39pm

that’s a thought. thank you…

John_Balestrieri · July 4, 2024, 8:50pm

Unfortunately, PDF doesn’t have the concept of tables. Or multi-line text, either. It’s loose string graphics, and the string is broken up into sub-objects if there’s non-standard spacing within a line.

Eric_Williams · July 4, 2024, 10:22pm

I think you’re on the right track by extracting the text and analyzing the coordinates. With some careful grouping you should end up with contiguous text runs that construct a table.

Clifford_Antrim · July 5, 2024, 1:33pm

@John_Balestrieri - I was pretty sure I read somewhere that it does and the syntax is a lot like an HTML table. I exported a table from a Pages document and Apple puts the TABLE, TR, TD keywords/names in there:

15 0 obj
<< /Type /StructElem /S /Table /P 10 0 R /K [ 16 0 R ] >>
endobj
16 0 obj
<< /Type /StructElem /S /TBody /P 15 0 R /K [ 17 0 R 18 0 R 19 0 R 20 0 R 21 0 R ] >>
endobj
17 0 obj
<< /Type /StructElem /S /TR /P 16 0 R /K [ 22 0 R 23 0 R 24 0 R 25 0 R ] >>
endobj
22 0 obj
<< /Type /StructElem /S /TD /P 17 0 R /K [ 26 0 R ] >>
endobj

But maybe that’s just for accessibility?

Eric_Pousse · July 5, 2024, 1:59pm

I work with DynaPDFMBS to extract vectorial objects from pdf files in my app RealCADD with good results.
I made many tests with different pdf files and I can say that pdf files are not all written as they should be.
For example, the pdf language knows dotted lines.
But some apps explode them as many little lines and spaces when they draw them.
And when you read the pdf files, you have many little lines but not one dotted line…
The next step for me in RealCADD will be to convert them into one dotted line but you have the same thing for circle, rectangle, polygon…
So even if pdf language could know table concept, you can have apps which don’t use it to draw tables.

Clifford_Antrim · July 5, 2024, 2:03pm

Well, that’s the truth!

Jean-Yves_Pochez · July 5, 2024, 5:20pm

yes the parser has to be very robust …