Extracting Text from a PDF

is there a (free) way to extra text from a PDF from a Xojo app?

X-platform would be great but for my immediate need is for Windows.

A piece of software generates PDF reports from sample data acquired by an instrument and it can be 30-100 files at time… but the file names are not meaningful… I have to open each file manually to see what sample it is from and rename it with the sample before I can send out the reports. Tedious to say the least.

If these were text files it would be trivial to write such a utility… but looking at the PDFs in TextWrangler it looks like the text is encoded somehow.

Thanks,

  • Karen

Is there an MBS plugin for this?

I’d start here… http://www.codeproject.com/Articles/7056/Code-to-extract-plain-text-from-a-PDF-file

Yes. http://www.monkeybreadsoftware.net/dynapdf-dynapdfmbs-method3.shtml

[quote]DynaPDFMBS.ExtractPageText as String
method, DynaPDF, MBS DynaPDF Plugin (dynapdf3), class DynaPDFMBS,
Plugin version: 12.5, Mac: Yes, Win: Yes, Linux: Yes, Console & Web: Yes, Feedback.

Function: Extracts text of current open page.
Notes:
This is a convenience function so you don’t need to use DynaPDFStackMBS class yourself.
Returns the text of the page. Use EditPage() to open a page and than EndPage() to close it.

If you have problems with asian characters, please make sure you use SetCMapDir and load the CMAPs.[/quote]

DynaPDF will not just give you the text commands in the PDFs. It supports a wide range of encoding and compression options. Also it can very well translate back from font glyphs to actual characters using CMAP files.
While we have also PDFKit functions on Mac to extract text, they fail sometimes with asian characters. DynaPDF does it.
And you have someone where you can send in PDFs with trouble to fix bugs in the library.

Karen,

if Acrobat Reader is present on the client machine you could do that with inter application communication (instantiating a PDFobject, loading the file and copy the text).
I did that years ago (with Delphi that is). This site is THE site for everything PDF related and has many code examples in VB.

To get you started, read this document about Acrobat and IAC and this document about the IAC api.

Thanks all.

I don’t have the Dyna PDF plugin and for this utility just for my own use, I can’t justify the cost. I also don’t have time to figure out from scratch how to extract text from a PDF and and code the whole thing in xojo.

I’ll take a look at the Adobe IAC approach and see if I CAN figure out how to use it in Xojo…

But if anyone knows of a free DLL that I can use from Xojo to extract the text, I would appreciate it.

Thanks,

  • Karen

You can get a CLI called pdftohtml that will convert a PDF to an HTML or XML file. You’d run it through a Shell, then use regular Xojo code to extract the information you need. A quick search revealed this:

http://www.addictivetips.com/windows-tips/convert-pdf-files-to-html-format-in-windows-mac-and-linux/

I have used this on my Mac, having installed it though MacPorts, but you can probably include it in your app’s folders so you always know where to find it. I just don’t know what, if any, dependencies it has.

[quote=35468:@Kem Tekinay]You can get a CLI called pdftohtml that will convert a PDF to an HTML or XML file. You’d run it through a Shell, then use regular Xojo code to extract the information you need. A quick search revealed this:

http://www.addictivetips.com/windows-tips/convert-pdf-files-to-html-format-in-windows-mac-and-linux/
[/quote]

Thanks Kem.,

That got me on the right track and found pdftotext CLI executables for mac and Windows (Linux as well but I don’t do linux) that I can easily use through the shell… the component code is GP but i won’t be distributing this so it does not matter…

But if I did find a use for these that I did want to distribute, what does GPL require?

-Karen

[quote=35492:@Karen Atkocius]Thanks Kem.,

That got me on the right track and found pdftotext CLI executables for mac and Windows (Linux as well but I don’t do linux) that I can easily use through the shell… the component code is GP but i won’t be distributing this so it does not matter…

But if I did find a use for these that I did want to distribute, what does GPL require?

-Karen[/quote]

Usually people just make sure there is a license file for the program in question with a link to the source online. That takes care of the requirement of making the source available.

GPL pieces in commercial software can lead to someone later suing you to release the source code of your app under GPL, too.
That’s the reason, we try to avoid GPL software when building commercial apps.

True but my understanding it that Karen is talking about executing a separate utility. That could be replaced by any tool that could do the same thing.

I am executing it in a shell… The only thing in my code that would be specific to pdftotext would be the actual command line call syntax as the parameters for another package would be different.

I wrote a nifty little app that extracts a single page or a whole document from a PDF. It also turns any recordset into a PDF and I have code to turn any ListBox into a PDF. Have fun!

Thanks to MBS…

@David: that’s without source code :frowning:

Do these solutions work with encrypted/compressed PDFS as well… as extracting text from an uncompressed PDF is quite easy

Do you mean the text?

I use the the command line tool pdftotext on Mac and Windows called from Xojo code for that now. it’s GPL but for use in the shell I don’t think it’s an issue.

if you are doing with Xojo code I would love to see it! :wink:

This I can do with pure Xojo code with the help of Asher Dunn’s open source classes.

Some links that might be useful:

Xpdf: http://www.foolabs.com/xpdf/about.html
PDFtk: http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/
Ghostscript: http://www.ghostscript.com/

Yes. Apple even provides an Automator script to extract every line of test from a PDF into a text file. I use dynaPDF to count the pages, show a slider and allow the user to slide to a single page, view the extracted text in a TextArea and copy/paste or save it.

I cannot provide all the code since it holds serial numbers, but this is what I use to extract a single page of text (or the whole lot) from a PDF.

[code]Function getPDFtoTextWAD(pdf As myDynaPDFMBS, PDFFile As FolderItem, pageNumber As Integer) As String
Dim tempInt, tempInt2, ImportFlags As Integer
Dim tempReturn As String

if pdf = nil then Return “”
if PDFFile = nil or not PDFFile.Exists then Return “”

Call pdf.CreateNewPDF(nil)
ImportFlags = BitwiseOr(pdf.kifContentOnly, pdf.kifImportAsPage)
Call pdf.SetImportFlags(ImportFlags)

if pdf.OpenImportFile(PDFFile, pdf.kptOpen, “”) <> 0 then Return “”

Call pdf.ImportPDFFile(1, 1.0, 1.0)
if pageNumber > pdf.GetPageCount then Return “”

select case pageNumber
case Is < 1 'extract all pages
tempInt2 = pdf.GetPageCount
for tempInt = 1 to tempInt2
if pdf.EditPage(tempInt) then
tempReturn = tempReturn + pdf.ExtractPageText
if PDFToTextInsertFormFeeds and tempInt <> tempInt2 then tempReturn = tempReturn + chr(14) 'add a Form Feed character, if not the final page
Call pdf.EndPage
end if
next
case else 'extract a single page
if pdf.EditPage(pageNumber) then
tempReturn = pdf.ExtractPageText
Call pdf.EndPage
else
Return “”
end if
end select

Return tempReturn

Exception err
if not commonWAD.getHandleExceptionWAD(err, "Method: " + CurrentMethodName) then Raise err

End Function
[/code]