Text Search with DynaPDF

Daniel_Wilson · January 18, 2018, 9:06am

Hi All,

I am trying to get coordinates of particular strings in a document. I thought this would be simple as most PDF viewers offer a search feature that places overlays over any occurrence of the string you searched for. I am essentially trying to implement the same feature using Dyna PDF.

I have been digging through the manuals and examples for some time but havn’t been able to find a solution. The closest I got was DYNAPDFSTACKMBS. As I understand this is some sort of cursor, but I think it can only select pages at time.

I’m pretty stuck with this one, any help would be appreciated!

Christian_Schmitz · January 18, 2018, 9:16am

The “Extract text” project shows text extraction in chunks as they are on the page.
You can search in that text and use the coordinates to highlight it.

Brandon_Warlick · January 18, 2018, 8:11pm

Daniel, I just finished some code recently that does the same thing. My goal is to extract the text in a specific area along with the coordinates of the found text. The text extraction works well and after trying multiple methods, I settled on DynaPDFMBS.ExtractText. It does just what I want, including only returning part of a piece of text if not all of it is within the defined area.

However, getting the coordinates of the found text is more of a challenge. I ended up writing my own method by modifying one of Christian’s examples that is using the DynaPDFStackMBS class. I had a hard time figuring out exactly how the example code worked and how it determines which “chunks” of text go together. I understand that it is not an exact science, but whatever code they came up with for the ExtractText function seems to work well.

What I ended up with works but is rough. It picks up coordinates of found text that is not exactly the same found text that the ExtractText function finds. It is very close though and seems to be ok. There are lots of problems with this approach as it is using two slightly different methods. One method for extracting the text and another for finding the coordinates. It would be much, much better if Christian or DynaPDF could add a parameter to allow the ExtractText function to return the coordinates of the text that it finds. That would make this very easy and consistent.

Christian_Schmitz · January 18, 2018, 8:23pm

Well, if you like to know where text is displayed, you need to use parse interface, multiply all the matrices while going though the document, keep a stack of state and than on the fly compare text portions to your find strings.

Maybe I find time to make an example for that someday.

Brandon_Warlick · January 18, 2018, 9:21pm

I did that but could not replicate the results of the ExtractText function.

Since the ExtractText function can find text within coordinates, it must know the coordinates of the text. I don’t see why it could not return them. I am requesting that you consider improving the ExtractText function to optionally return the coordinates of the text.

Christian_Schmitz · January 18, 2018, 9:45pm

Okay, I got a new example for you to try:

https://www.dropbox.com/sh/8tffcwsnjkhxhzo/AAA7S9T7-yv1xJ4u9g2sY27Oa?dl=0

Let me know if it works for you.

Daniel_Wilson · January 18, 2018, 10:38pm

Thanks Brandon! Appreciate you sharing your knowledge with me!

Christian, thanks for the example! I’ll give it a shot.

Daniel_Wilson · January 19, 2018, 1:20am

So far as I understand from the example, I need to go through all the DynaPDFTextRecordWMBS in kerningW of each PDFtext, find if any of these (or combination of) contains my search string, and use the DynaPDFTextRecordWMBS.width to find the text’s location inside the PDFtext which has a known location.

Seems complicated enough! Is this the way? Or am I missing a simple solution?

DynaPDFTextRecordWMBS seem quite sporadic in length. The string you are searching for might be completely within a TextRecord, making calculating the exact coordinates impossible. Is this right?

Christian_Schmitz · January 19, 2018, 8:52am

you may need to search over several records.

Christian_Schmitz · January 19, 2018, 9:17am

I fixed a few more issues.
The final one is in the dropbox and will be included in the next prerelease.

Daniel_Wilson · January 21, 2018, 11:56pm

Thanks! Could you please repost the link?

I’m getting

404
That file isnt here anymore
Someone mightve deleted the file or disabled the link.
Learn more

Christian_Schmitz · January 22, 2018, 6:58am

It should be included for 18.0:

https://www.monkeybreadsoftware.com/xojo/download/plugin/Prerelease/

Now in the last beta Or soon in the final release.