How to validate PDF

Lee_Miller · January 3, 2016, 5:42pm

I’m going to allow users with a desktop app (Windows and OSX) to upload a PDF file to a web server.
The question is: I would like to validate that the file is indeed a PDF file and not just rely on checking to see if the extension says “PDF”. Can this be done totally within Xojo?

Please point me in the right direction

Thanks

Alexander_van_der_Linden · January 3, 2016, 5:53pm

%PDF-1.5 %µµµµ 1 0 obj <</Type/Catalog/Pages 2 0 R/Lang(nl-NL) /StructTreeRoot 24 0 R/MarkInfo<</Marked true>>>> endobj 2 0 obj <</Type/Pages/Count 1/Kids[ 3 0 R] >> endobj 3 0 obj <</Type/Page/Parent 2 0 R/Resources<</ExtGState<</GS5 5 0 R/GS9 9 0 R>>/XObject<</Meta6 6 0

This is a copy of the first bytes of a PDF file. If you read the first 8 bytes -%PDF-1.5- it tells you if the file is a PDF and what version of PDF is used, in this case 1.5.

Yes, you can do this with Xojo. There are many ways to accomplish this, you could use the BinaryStream or even the TextInputStream object.

Lee_Miller · January 3, 2016, 6:03pm

IF it only takes 8 bytes to find out, it would seem that the BinaryStream would be first choice? Would you concur?

Alexander_van_der_Linden · January 3, 2016, 6:08pm

Dim readFile as FolderItem = GetOpenFolderItem("text/plain") If readFile <> Nil Then Dim ReadStream as BinaryStream = BinaryStream.Open(readFile, False) ReadStream.littleEndian = true Textarea1.Text=ReadStream.Read(256,encodings.MacRoman) End If

This example code reads 256 bytes from a BinaryStream. To make sure you could read N bytes until you encounter “%”, it seems that the PDF info is between these delimiters.

Lee_Miller · January 3, 2016, 6:18pm

Thanks - Now it’s time to experiment!

Tim_Hare · January 3, 2016, 7:54pm

I would use a TextInputStream. It feels a little more lightweight and is just as capable of getting an arbitrary number of bytes from the beginning of a file. Don’t be fooled by the “Text” designation. It isn’t limited.

Lee_Miller · January 3, 2016, 10:32pm

SO you think that it would be just as simple with TextInputStream as BinaryStream. I will try a test in both then. Thanks

DaveS · January 4, 2016, 2:22am

remember the first FIVE bytes will be “%PDF-” followed by a 3 character number “#.#” (could be 1.0 to 1.7)
then usually followed by a CR, LF or CRLF (so it could be 1 or 2 bytes)
then another “%” followed by at least 4 (maybe more) NON-ASCII (ie. greater than 0x7F), but they can be ANYTHING

so thats a minimum of 14 or 15 depending on the linefeed sequence

Lee_Miller · January 4, 2016, 4:07am

It would seem that to confirm that this file is a PDF, I would only need the first 4 bytes, checking the last 3. Is it possible for this information to move further in and that is the reason I need to read in so many bytes?

DaveS · January 4, 2016, 6:14am

Looking even at the 1st 14 or 15 bytes and assuming they match the critieria set forth by the PDF specificiation does not by any means guarantee the file is a valid PDF file …
If the file does NOT start with this sequence it is NOT a PDF, but if it does, it only MIGHT be a PDF

Christian_Schmitz · January 4, 2016, 8:31am

With DynaPDF I can create PDFs with over a dozen different versions.

Including 2.0 where the file header starts with “%PDF-2.0”.

Lee_Miller · January 4, 2016, 12:59pm

Dave, when you said “might be a PDF”, did you mean as a possibility that someone could take a file that is not a PDF, go in and change the header by placing “%PDF-” there, and then attempt to pass it off as a PDF?

Christian_Schmitz · January 4, 2016, 2:00pm

only reading the PDF structure could make sure it’s a valid PDF.
and not a half downloaded file where the end is missing.

DaveS · January 4, 2016, 3:36pm

That is exactly what I meant… not that it would be malicious, just that it does not guaraantee a valid PDF

Michel_Bujardet · January 4, 2016, 6:25pm

That, or a PDF file that did not complete a download, or an ill formed one (bugged program generated it). I doubt many people will create fake PDF files.

Lee_Miller · January 4, 2016, 6:40pm

Christian, I thought that reading the opening segment of the file was the way to determine if it was in fact a PDF file. It sounds like you are saying something more. Might I also need to look at the end to confirm it’s structure?

Norman_P · January 4, 2016, 6:43pm

To validate that its indeed a conformant PDF you’d have to pen & read it
Otherwise it could be malformed in the middle

This is not unique to PDF’s
Many file formats the only way to know if they really ARE what the first few bytes say it is is to read the entire thing

Tim_Hare · January 4, 2016, 6:47pm

There are 3 possibilities:

Not a PDF
Valid, well-formed PDF
Corrupted PDF

The absence of the header will identify #1. You have to decide if you care about #3. Your statement about “validate PDF”, together with your apparent lack of knowledge, makes that ambiguous. If all you want to know is if the file is supposed to be a PDF, as opposed to a word processing document or spreadsheet, then checking the header is sufficient. The onus is on the user to provide a non-corrupted PDF file. If you want to know if the file is in fact a “valid” PDF - legal pdf structure and not corrupted in any way - then the only way to do so is to read the entire thing and analyze its contents, which is a much larger task.

Alexander_van_der_Linden · January 4, 2016, 7:09pm

In this PDF you can read all about valid PDF files and how to test for corrupted PDFs. Norm is right, if well-formedness is important for you check the whole thing.

You could take however a shortcut and that is to use the Adobe PDF reader to test the file for you. In this PDF you can read how to control the Reader from code. I did this years ago with Delphi that is, but you can do this with Xojo.

Norman_P · January 4, 2016, 7:27pm

Only on Windows