Using DynaPDF parser to find characters

Please check out the DynaPDFParserMBS class in MBS Xojo DynaPDF Plugin. This class allows you to:

  • Parse a page
  • Extract text
  • Find text
  • Replace text
  • Find characters
  • Delete text
  • Write changes back to page

You can limit the search to a part of the page or the whole page and use various options like whether the text search is case insensitive.

Today we want to show you how you can identify the exact position of any character in a PDF. Like this picture where we show all characters with a box, even for mirrored or rotated text:

Let us show the code for this. You may review the example project Text Positions with parser and see where we load the PDF. Once it is loaded, we initialize the DynaPDFParserMBS object. We use the kstMatchAlways here to have it not look for a particular text, but to report the position of every character:

	// now do search and replace
	Dim Parser As New DynaPDFParserMBS(p)
	Dim area As DynaPDFRectMBS = Nil // whole page
	Dim SearchType As Integer = DynaPDFParserMBS.kstMatchAlways
	Dim ContentParsingFlags As Integer = DynaPDFParserMBS.kcpfEnableTextSelection
	
	If parser.ParsePage(1, ContentParsingFlags) Then
		Dim index As Integer = 0
		
		Dim found As Boolean = Parser.FindText(area, SearchType, "")
		While found
			
			Dim r As DynaPDFRectMBS = parser.SelBBox
			Dim t As New PDFText
			t.Text = parser.SelText
			t.rect = r
			t.index = index
			t.points = parser.SelBBox2
			
			texts.Append t
			index = index + 1
			
			found = Parser.FindText(area, SearchType, "", True)
		Wend	
	End If

The loop runs while we have more text. For each character, we get the selection text and the bounding box as an array of points. You can of course just get the rectangle, but that won’t handle rotated text. We continue the loop with calling FindText again and passing true to continue search.

In the paint event of the window, we draw the PDF page first. Then we loop over the found text pieces and show each character surrounded with the box drawn from the points we got:

	For Each t As PDFText In texts
		Dim points() As DynaPDFPointMBS = t.points
			
		g.ForeColor = &c00FF00
		g.DrawLine points(0).X * factor, points(0).Y * factor, points(1).X * factor, points(1).Y * factor
		g.DrawLine points(1).X * factor, points(1).Y * factor, points(2).X * factor, points(2).Y * factor
		g.DrawLine points(2).X * factor, points(2).Y * factor, points(3).X * factor, points(3).Y * factor
		g.DrawLine points(3).X * factor, points(3).Y * factor, points(0).X * factor, points(0).Y * factor	
	next

As shown you can know from each character where it is. You may use DeleteText function to precisely cut text and remove individual characters from the PDF page. Or annotate the PDF page. Like you could add WebLinks to specific words once you know the surrounding rectangle.

Please try the example project and let us know what questions you have. The recent addition of SelBBOx2 and SelText properties in v24.1 are based on customers asking for them.

Bonjour Christian,

Is there the possibility to parse also vectorial objects?

Yes, you can use DynaPDFParseInterfaceMBS class to get all kind of drawing commands as events.