Extracting Text from a PDF

arron_lee · March 20, 2014, 3:49am

Here are some code I have ever used to extract text from PDF files:

[code]Imports Yiigo.PDF
Imports Yiigo.Image

Public Function ObtainTextBBox(doc As PDFDoc, pageNum As Integer, index As Integer, count As Integer) As
Quadrilateral()
Dim page As PDFPage = DirectCast(doc.ObtainDocPage(pageNum), PDFPage)
Dim tmp As Quadrilateral() = page.ObtainBoxes(index, count)
Return tmp

Public Sub Search(doc As PDFDocument, pageNum As Integer, index As Integer, txt As String)
Dim p As PDFPage = DirectCast(doc.ObtainDocPage(pageNum), PDFPage)
Dim res As PDFSearchResult = p.Search(index, txt)
For Each idxAndCntPair As Integer() In res.Results
Next
End Sub[/code]
And here are the relevant tutorials about how to extact text form PDF files. I hope it helps. Good luck.

Best regards,
Arron

Norman_P · March 20, 2014, 4:36am

On OS X there’s a few free options installed already

an automator action you can run at the cmd line
mdimport (oddly enough it spits this out)
see http://hintsforums.macworld.com/showthread.php?t=126913

Grant_Singleton · January 2, 2015, 1:29am

I currently use the CLI utility PDFTK successfully on Windows to extract text from a text only PDF. It works for me…but I’ll soon be needing a Mac solution to start work on a new cross platform project.

Cheers
Grant

Bob_Coleman · January 2, 2015, 1:37am

PDFtk Server is Windows, OSX and Linux: https://www.pdflabs.com/tools/pdftk-server/

Norman_P · January 2, 2015, 3:49am

mdimport - just google for “mdimport extract text pdf” and you’ll find a few links

They may be adequate

Grant_Singleton · January 2, 2015, 5:19am

My particular purpose requires the PDF to be uncompressed, as most are compressed with Flatecode and ASCII85, then read as I need to know where on the page the text appears to correctly import it. Once uncompressed into pure ASCII text I read the whole PDF into an array and create custom class objects of the PDF text with x & y integers of the page position.

Obviously not what everyone wants or needs, but works well for me.

Cheers
Grant

Christian_Schmitz · January 2, 2015, 9:17am

if you want the best, how about looking at our DynaPDF Pro plugin?