Extracting Text from a PDF

Here are some code I have ever used to extract text from PDF files:

[code]Imports Yiigo.PDF
Imports Yiigo.Image

Public Function ObtainTextBBox(doc As PDFDoc, pageNum As Integer, index As Integer, count As Integer) As
Quadrilateral()
Dim page As PDFPage = DirectCast(doc.ObtainDocPage(pageNum), PDFPage)
Dim tmp As Quadrilateral() = page.ObtainBoxes(index, count)
Return tmp

Public Sub Search(doc As PDFDocument, pageNum As Integer, index As Integer, txt As String)
Dim p As PDFPage = DirectCast(doc.ObtainDocPage(pageNum), PDFPage)
Dim res As PDFSearchResult = p.Search(index, txt)
For Each idxAndCntPair As Integer() In res.Results
Next
End Sub[/code]
And here are the relevant tutorials about how to extact text form PDF files. I hope it helps. Good luck.

Best regards,
Arron

On OS X there’s a few free options installed already

I currently use the CLI utility PDFTK successfully on Windows to extract text from a text only PDF. It works for me…but I’ll soon be needing a Mac solution to start work on a new cross platform project.

Cheers
Grant

PDFtk Server is Windows, OSX and Linux: https://www.pdflabs.com/tools/pdftk-server/

mdimport - just google for “mdimport extract text pdf” and you’ll find a few links

They may be adequate

My particular purpose requires the PDF to be uncompressed, as most are compressed with Flatecode and ASCII85, then read as I need to know where on the page the text appears to correctly import it. Once uncompressed into pure ASCII text I read the whole PDF into an array and create custom class objects of the PDF text with x & y integers of the page position.

Obviously not what everyone wants or needs, but works well for me.

Cheers
Grant

if you want the best, how about looking at our DynaPDF Pro plugin?