I’m going to allow users with a desktop app (Windows and OSX) to upload a PDF file to a web server.
The question is: I would like to validate that the file is indeed a PDF file and not just rely on checking to see if the extension says “PDF”. Can this be done totally within Xojo?
Dim readFile as FolderItem = GetOpenFolderItem("text/plain")
If readFile <> Nil Then
Dim ReadStream as BinaryStream = BinaryStream.Open(readFile, False)
ReadStream.littleEndian = true
This example code reads 256 bytes from a BinaryStream. To make sure you could read N bytes until you encounter “%”, it seems that the PDF info is between these delimiters.
I would use a TextInputStream. It feels a little more lightweight and is just as capable of getting an arbitrary number of bytes from the beginning of a file. Don’t be fooled by the “Text” designation. It isn’t limited.
remember the first FIVE bytes will be “%PDF-” followed by a 3 character number “#.#” (could be 1.0 to 1.7)
then usually followed by a CR, LF or CRLF (so it could be 1 or 2 bytes)
then another “%” followed by at least 4 (maybe more) NON-ASCII (ie. greater than 0x7F), but they can be ANYTHING
so thats a minimum of 14 or 15 depending on the linefeed sequence
It would seem that to confirm that this file is a PDF, I would only need the first 4 bytes, checking the last 3. Is it possible for this information to move further in and that is the reason I need to read in so many bytes?
Looking even at the 1st 14 or 15 bytes and assuming they match the critieria set forth by the PDF specificiation does not by any means guarantee the file is a valid PDF file …
If the file does NOT start with this sequence it is NOT a PDF, but if it does, it only MIGHT be a PDF
Dave, when you said “might be a PDF”, did you mean as a possibility that someone could take a file that is not a PDF, go in and change the header by placing “%PDF-” there, and then attempt to pass it off as a PDF?
Christian, I thought that reading the opening segment of the file was the way to determine if it was in fact a PDF file. It sounds like you are saying something more. Might I also need to look at the end to confirm it’s structure?
The absence of the header will identify #1. You have to decide if you care about #3. Your statement about “validate PDF”, together with your apparent lack of knowledge, makes that ambiguous. If all you want to know is if the file is supposed to be a PDF, as opposed to a word processing document or spreadsheet, then checking the header is sufficient. The onus is on the user to provide a non-corrupted PDF file. If you want to know if the file is in fact a “valid” PDF - legal pdf structure and not corrupted in any way - then the only way to do so is to read the entire thing and analyze its contents, which is a much larger task.
In this PDF you can read all about valid PDF files and how to test for corrupted PDFs. Norm is right, if well-formedness is important for you check the whole thing.
You could take however a shortcut and that is to use the Adobe PDF reader to test the file for you. In this PDF you can read how to control the Reader from code. I did this years ago with Delphi that is, but you can do this with Xojo.