Sample PDF Code - # Of Pages

Tom_Dixon · February 14, 2014, 5:52pm

Does anyone her have any sample code that would allow me to retrieve a count of the number of pages in a PDF File?

OR, If anyone has any classes available that will return basic info on a PDF for stuff like page size, number of pages, color space, etc, I would be very appreciative. I’m working on a simple PDF workflow automation app for internal use only and occasionally being able to retrieve info without having to purchase plugins designed to create PDFs would be more desirable.

DaveS · February 14, 2014, 6:17pm

Page Count … /COUNT which better match the pairs in KIDS

4 0 obj
<<
   /Type /Pages
   /Kids [6 0 R]
   /Count 1
   /Rotate 0
>>
endobj

Page Size in POINTS ( divide by 72)

6 0 obj
<<
   /Type /Page /Parent 4 0 R
   /MediaBox [0 0 612 792]
   /CropBox [36 36 576 756]
   /Resources 5 0 R
   /Contents 7 0 R
>>
endobj

Color space (for a specific JPEG image]

7 0 obj
<< /Length 3113 >>
stream
q
1 0 0 1 0 0 cm
16 0 0 16 136 656 cm
BI
/Width 16
/Height 16
/ColorSpace /DeviceRGB
/BitsPerComponent 8
/Filter /DCTDecode

and while we are at it

1 0 obj
<<
   /Author (Not Provided)
   /Title (Not Provided)
   /Subject(My Application.debug)
   /CreationDate (D:20131228144141-00'00')
   /ModDate (D:20131228144141-00'00')
   /Producer (SimplePDF for XOJO)
   /Creator (My Application.debug)
>>
endobj

NOTE : the number pairs (ie. 4 0 obj etc) are relative to a particular PDF and may or may not be the same as any other PDF document.
Also, the author may have chosen to obsfucate things by placing each parameter in a child object, making parsing more difficult
And lastly… the entire document may be encrypted leaving little if any of this information in clear text

Tom_Dixon · February 14, 2014, 7:04pm

Thanks Dave. I appreciate the help. I guess I should have been a little clearer. I don’t have a clue how to use what you just provided. I seldom have a need to crack open a file other than a text file so I was hoping for an example on what I do from the point I get the PDF as a folder item to read this data using the format you provided.

DaveS · February 14, 2014, 7:20pm

You need to be able to parse the contents of the PDF file and find items like those I showed above…
It is sometimes a trivial task… but 90% of the time it is near impossible without some complex coding on your part.

The examples I provided are from clear text PDF files with everything in a simple logical configuration. But most times you won’t find that… You will find encrypted PDF contents, and/or those items spread all over the internals of the document. And without an intimate knowledge of how PDF works, you will never be able to unravel it.

I spent months studying the PDF whitepapers in order to create my SimplePDF class… and that class MAKES PDF files, not decypher them

Russ_Tyndall · February 14, 2014, 8:15pm

[quote=65336:@Tom Dixon]Does anyone her have any sample code that would allow me to retrieve a count of the number of pages in a PDF File?
[/quote]

What platform?

Tom_Dixon · February 14, 2014, 8:16pm

OK. That helps a LOT! I have the luxury that all these PDFs are created by the same application internally. I tried opening a cople of examples in Wordpad and found <</Count 2/Kids[13 0 R 14 0 R]/Type/Pages>> in one file that had 2 pages and <</Count 4/Kids[5 0 R 6 0 R 7 0 R 8 0 R]/Type/Pages>> in another file that had 4 pages. For my purpses at present, I presume I can open them in Xojo as a text file and read until I find /Count in the current line and capture the count amount that follows. Since they should be consistent, do you see any glaring problems with that?

Tom_Dixon · February 14, 2014, 8:17pm

Russ this will be on Windows.

Russ_Tyndall · February 14, 2014, 8:29pm

I have a class that uses XPDF to pull out some info about PDFs (size, page count, etc…) but I have only used it on OS X. It looks doable on Windows.

XPDF is GPL so you have respect that licensing structure.

Frankly, I think Dave’s approach is better simply because it has less reliance on third parties.

Christian_Schmitz · February 14, 2014, 8:56pm

If you need a full feature professional PDF library, well we have a MBS Xojo DynaPDF Plugin.

Tom_Dixon · February 14, 2014, 9:45pm

This seems to work for my purposes

Sub GetPageCount(pdfFile As FolderItem) As String Dim rowFromFile,strAry(),countAry() As String Dim i,n As Integer Dim FilePattern As RegEx Dim FilePatternMatch as RegExMatch Dim t As TextInputStream FilePattern = New RegEx FilePattern.SearchPattern = "^<</Count.*$" Try t = TextInputStream.Open(pdfFile) t.Encoding = Encodings.UTF8 While Not t.EOF rowFromFile = t.ReadLine FilePatternMatch = FilePattern.search(rowFromFile) If Not (FilePatternMatch = Nil) then strAry = Split(rowFromFile,"/") countAry = Split(strAry(1)," ") t.close Return countAry(1) End If Wend t.close Catch e As IOException t.Close MsgBox("Error accessing file.") End Try End Sub

DaveS · February 14, 2014, 10:51pm

You are relying on the fact the COUNT will always be the first item in the PAGES dictionary entry, and that the whitespace preceding the entry will be constant. This may or may not be true…
Even the same software may place those items in different locations based on how other items affect the PDF internal structure.

Tom_Dixon · February 14, 2014, 10:59pm

[quote=65394:@Dave S]You are relying on the fact the COUNT will always be the first item in the PAGES dictionary entry, and that the whitespace preceding the entry will be constant. This may or may not be true…
Even the same software may place those items in different locations based on how other items affect the PDF internal structure.[/quote]
Yes I’m aware of that. I’m also not returning from the method if nothing is found. This was quick and dirty to see if it worked. Likely knowing the method these PDFs are created, and that they are never less than 2 pages or more than 4, I can rely on the dictionary entry to be consistent. I’ll give it some more thought but if you have any suggestions I’m open to hearing them.