Does anyone her have any sample code that would allow me to retrieve a count of the number of pages in a PDF File?
OR, If anyone has any classes available that will return basic info on a PDF for stuff like page size, number of pages, color space, etc, I would be very appreciative. I’m working on a simple PDF workflow automation app for internal use only and occasionally being able to retrieve info without having to purchase plugins designed to create PDFs would be more desirable.
Page Count … /COUNT which better match the pairs in KIDS
4 0 obj
/Kids [6 0 R]
Page Size in POINTS ( divide by 72)
6 0 obj
/Type /Page /Parent 4 0 R
/MediaBox [0 0 612 792]
/CropBox [36 36 576 756]
/Resources 5 0 R
/Contents 7 0 R
Color space (for a specific JPEG image]
7 0 obj
<< /Length 3113 >>
1 0 0 1 0 0 cm
16 0 0 16 136 656 cm
and while we are at it
1 0 obj
/Author (Not Provided)
/Title (Not Provided)
/Producer (SimplePDF for XOJO)
/Creator (My Application.debug)
NOTE : the number pairs (ie. 4 0 obj etc) are relative to a particular PDF and may or may not be the same as any other PDF document.
Also, the author may have chosen to obsfucate things by placing each parameter in a child object, making parsing more difficult
And lastly… the entire document may be encrypted leaving little if any of this information in clear text
Thanks Dave. I appreciate the help. I guess I should have been a little clearer. I don’t have a clue how to use what you just provided. I seldom have a need to crack open a file other than a text file so I was hoping for an example on what I do from the point I get the PDF as a folder item to read this data using the format you provided.
You need to be able to parse the contents of the PDF file and find items like those I showed above…
It is sometimes a trivial task… but 90% of the time it is near impossible without some complex coding on your part.
The examples I provided are from clear text PDF files with everything in a simple logical configuration. But most times you won’t find that… You will find encrypted PDF contents, and/or those items spread all over the internals of the document. And without an intimate knowledge of how PDF works, you will never be able to unravel it.
I spent months studying the PDF whitepapers in order to create my SimplePDF class… and that class MAKES PDF files, not decypher them
[quote=65336:@Tom Dixon]Does anyone her have any sample code that would allow me to retrieve a count of the number of pages in a PDF File?
OK. That helps a LOT! I have the luxury that all these PDFs are created by the same application internally. I tried opening a cople of examples in Wordpad and found
<</Count 2/Kids[13 0 R 14 0 R]/Type/Pages>> in one file that had 2 pages and
<</Count 4/Kids[5 0 R 6 0 R 7 0 R 8 0 R]/Type/Pages>> in another file that had 4 pages. For my purpses at present, I presume I can open them in Xojo as a text file and read until I find /Count in the current line and capture the count amount that follows. Since they should be consistent, do you see any glaring problems with that?
Russ this will be on Windows.
I have a class that uses XPDF to pull out some info about PDFs (size, page count, etc…) but I have only used it on OS X. It looks doable on Windows.
XPDF is GPL so you have respect that licensing structure.
Frankly, I think Dave’s approach is better simply because it has less reliance on third parties.
If you need a full feature professional PDF library, well we have a MBS Xojo DynaPDF Plugin.
This seems to work for my purposes
Sub GetPageCount(pdfFile As FolderItem) As String
Dim rowFromFile,strAry(),countAry() As String
Dim i,n As Integer
Dim FilePattern As RegEx
Dim FilePatternMatch as RegExMatch
Dim t As TextInputStream
FilePattern = New RegEx
FilePattern.SearchPattern = "^<</Count.*$"
t = TextInputStream.Open(pdfFile)
t.Encoding = Encodings.UTF8
While Not t.EOF
rowFromFile = t.ReadLine
FilePatternMatch = FilePattern.search(rowFromFile)
If Not (FilePatternMatch = Nil) then
strAry = Split(rowFromFile,"/")
countAry = Split(strAry(1)," ")
Catch e As IOException
MsgBox("Error accessing file.")
You are relying on the fact the COUNT will always be the first item in the PAGES dictionary entry, and that the whitespace preceding the entry will be constant. This may or may not be true…
Even the same software may place those items in different locations based on how other items affect the PDF internal structure.
[quote=65394:@Dave S]You are relying on the fact the COUNT will always be the first item in the PAGES dictionary entry, and that the whitespace preceding the entry will be constant. This may or may not be true…
Even the same software may place those items in different locations based on how other items affect the PDF internal structure.[/quote]
Yes I’m aware of that. I’m also not returning from the method if nothing is found. This was quick and dirty to see if it worked. Likely knowing the method these PDFs are created, and that they are never less than 2 pages or more than 4, I can rely on the dictionary entry to be consistent. I’ll give it some more thought but if you have any suggestions I’m open to hearing them.