Common Document Validation

I work in a Windows Network environment. We’re interested in purchasing or building a system that does document validation on our documents across our servers. We’re needing a system that goes through our servers and file systems, logging directories of documents, performing a hash on the file and storing that information. If the hash has changed from previous hashed values, trigger a document validation process. We want to make sure that the file is a valid format of the file that it says that it is. Examples: Is the file MyFile.PDF a valid PDF file? Is the file MyFile.xlsx a valid Excel file?

Common file types would be:

  • PDF
  • TIFF
  • DOC and DOCX
  • XLS and XLSX
  • JPG
  • GIF
  • PNG
  • … etc.

There are some PDF validation tools such as XPDF, PDFInfo, and ImageMagick but is there something more general? Is there a system to purchase that would do what I’ve stated above? I can build the system (with Xojo) if need be, but purchasing is preferable. If not, a command line validation routine would be very helpful.

Thanks in advance for any advice.

Here’s something you might be able to adapt. There is a tool called LibreOffice that includes a command-line tool, soffice (available on Windows and Linux too and might have different names, I don’t remember). LibreOffice can open the documents you mentioned and convert them to, say, PDF or text. soffice can do that on the command line too, but if it can’t figure out the type of document, it will generate an error.

So my idea is to use soffice on the command line to convert each document to a PDF that will be deleted and check for an error.

As an aside, we use soffice to great effect. Clients can create templates in MS Word and our app will copy those templates, replace placeholders with actual data from a range of records, then use soffice to generate PDFs that get attached to those records.

Thanks Kem! That’s a great solution for office documents.

I’ve also found a tool within ImageMagick called “identify” that will help provide meta data on some of the graphic file formats (jpg, gif, png, etc.) that I can use for that set of files. It may also do PDF and TIFF files as well but I need to test that.

http://www.imagemagick.org/script/identify.php

soffice is not limited to Office documents, it will read (and convert) image formats too.

Handy. I’ll do some performance testing in the next couple of days and let you know how it goes. Thanks for the tip.

[quote=282549:@Kevin Cully]I work in a Windows Network environment. We’re interested in purchasing or building a system that does document validation on our documents across our servers. We’re needing a system that goes through our servers and file systems, logging directories of documents, performing a hash on the file and storing that information. If the hash has changed from previous hashed values, trigger a document validation process. We want to make sure that the file is a valid format of the file that it says that it is. Examples: Is the file MyFile.PDF a valid PDF file? Is the file MyFile.xlsx a valid Excel file?

Common file types would be:

  • PDF
  • TIFF
  • DOC and DOCX
  • XLS and XLSX
  • JPG
  • GIF
  • PNG
  • … etc.

There are some PDF validation tools such as XPDF, PDFInfo, and ImageMagick but is there something more general? Is there a system to purchase that would do what I’ve stated above? I can build the system (with Xojo) if need be, but purchasing is preferable. If not, a command line validation routine would be very helpful.

Thanks in advance for any advice.[/quote]

Hmm… Is that not possible with some virus scanners ?

I would think that a virus scanner would tell you if a file had a virus or malware of some kind. I’m not sure if it would tell you whether the JPG was a valid image file or not. Do you have an example that might apply?

All those file types should have typical beginning bytes that you can check.

Everything else depends on how deep you want to go. The ultimate test should be opening/showing the files for the file type the are supposedly. What do you expect to find? Wrong file types? Something more sinister?

Everything that can go wrong, will go wrong. We’re wanting to track valid changes, in addition to inadvertent changes, as well as sinister goings on.

  • We’d like to track the dates that files change for valid reasons.
  • We’d like to know when something goes wrong. Perhaps an application overwrites one file with another. Bad bad bad.
  • What if a file has partially been saved, and is truncated?
  • What if we get a ransomware virus on a workstation, and it runs rampant rewriting files. It’d be good to know earlier than later. Yes, we have up-to-date virus and malware definitions but we’re looking for a “Second opinion”.