Identify File type by Content

I need to insure (as much as possible) that a given file is NOT an SQLite database (either plain or encrypted)

Basically I am looking for files that are known NOT to fit the SQLite format, and since an encrypted file looks like gibberish, I am finding those I know, and rejecting them, anything else I let SQLite tell me with an 11/26 error when attempting to connect

It looks at the first 16 bytes of each file, and looks for known file signatures.
right now it can identify PDF GIF HTM XML JPG PNG TIF

Does anyone have other signature data that I should include?

Public Function IdentifyFile(f as FolderItem) as boolean
  Dim i As Integer
  Dim bs As BinaryStream
  Dim s As String
  Dim hx As String
  bs = BinaryStream.Open(f, False)
  s=bs.read(16,Encodings.utf8)
  bs.close
  hx=""
  //
  // PDF GIF HTM XML JPG PNG TIF
  //
  If Left(s,15)="SQlite format 3" Then hx="SQL"
  If Left(s,4) ="%PDF"   Then hx="PDF"
  If Left(s,6) ="GIF89a" Then hx="GIF"
  If InStr(s,"HTML")>0   Then hx="HTM"
  If InStr(s,"XML")>0    Then hx="XML"
  //
  If hx="" Then 
    For i=1 To Len(s)
      hx=hx+Right("00"+Hex(AscB(Mid(s,i,1))),2)
    Next i
  End If
  //
  If Left(hx, 8)="89504E47" Then hx="PNG"
  If Left(hx,18)="FFD8FFE000104A4649" Then hx="JPG"
  If Left(hx, 8)="4D4D002A" Or Left(hx,8)="49492A00" Then hx="TIF"
  //
  // at this point if hx<>3 char long then 
  // "maybe" its an Encrypted SQL file... so assume it is
  If Len(hx)>3 Then hx="SQL"
  //
  Select Case hx
  Case "PDF"
    hx="PDF document file."
  Case "JPG","GIF","TIF","PNG"
    If hx="TIF" Then hx="TIFF"
    hx=hx+" graphics file."
  Case "HTM"
    hx="HTML web file."
  Case "XML"
    hx=hx+" file."
  End Select
  //
  If hx<>"SQL" Then 
    Call newMsgBox("Invalid SQLite Database", _
    "This file looks like it is "+hx+" and not an SQLite Database", _
    customMsgBox.msgOkOnly)
  End If
  //
  Return (hx="SQL")
End Function

Take first 1000 chars and count how many are letters or digits. For an encrypted file you have rate below 50%.
And if you only count 2 or 3 letter runs it is much lower.

So just by this probability check you would sort out PNG, JPEG, Tiff and others just buy the text they include.

Uh… not really… an Unencrypted SQLite file has maybe 30 or 40 “letters” in the first 1k
while an ENCRYPTED file actually has many more 0x20-0x7F characters in the first 1k
So that doesn’t help.
PNG, GIF etc. also have a load of non-ASCII values but they have a specific signature allowing me to reject them as a possible SQLite database. An un-encrypted one has a signature as well…
It is the files where the contents are “binary” like data, that “might” be an encrypted SQLite database, but if I had more signatures of known file formats, I could reject them more gracefully… Otherwise Sqlite keeps asking for a password, and leaves it up to the user to wonder “why won’t this open for me”

If the need is to be able to determine whether file is a SQLite database, then the SQLite database specification provides the definitive answer:

[quote]1.2.1. Magic Header String

Every valid SQLite database file begins with the following 16 bytes (in hex): 53 51 4c 69 74 65 20 66 6f 72 6d 61 74 20 33 00. This byte sequence corresponds to the UTF-8 string “SQLite format 3” including the nul terminator character at the end.[/quote]

As soon as the file does not have the header above, it is not an SQLite database. This approach will not however tell what kind of file the “not SQLite” file actually is. Your approach provides much richer information, but it is going to be quite a lot of work to identify everything that is’nt an SQLite database.

Not a true statement.

that statement SHOULD read

If you read my post, and the attached code, you will see that I look for that header…
that is not the issue…
the issue is a file that contains “gibberish”… this file MIGHT be an Encrypted SQLite file (I know it won’t be an un-encrypted one)… but it MIGHT be something else… it is those “something else” I want to weed out, so I have a higher probablilty that the file IS an SQLIte file (either with the header, or not identified as something else)

and I’m not looking for “every possible” non-SQLITE format , just the ones that MOST people might have

You may want to add signatures of various compressed files, such as *.zip or *.gz etc

That is exactly what I was asking for… if people had signatures that they knew of…
Yes I can research them myself… but I might not think of all of the reasonable ones, and if others already had the information it would save time

https://en.wikipedia.org/wiki/List_of_file_signatures might help

actually, that helps ALOT

yes of course… (hand slaps forehead) Need to go back for more caffeine.

or just let a person select whatever file and when you try to connect (with or without a password) you’ll get an error which definitively tells you whether SQLite recognizes it

That is what I DON’T want to do…

you have to open every file you’ve identified just to find out that it is / is not a sqlite file buy checking for a signature
hows that any different than just asking sqlite to try & open it as a db ?

ask sqlite to open it and it will tell you

  1. its openable as an unencrypted DB
  2. its openable but require a password (so now you can re-prompt for one)
  3. its not a db

which is what youre trying to do on your own

To save the user extra work…
If it is in fact an unencrypted SQLite database… then no problem… it breezes right on by.
If It failes to open with either an error 11 or an error 26… then it MIGHT be an encrpted database (and it might not be)
The only way to tell is to inform the user that it MIGHT be, and “Hey do you know the password?”
If I know that the folderitem they chose CANNOT be an SQLite database, then I say Hey its a PNG (or whatever), try again
Is it the same number of dialogs? Yeah… But if the user accidently chose an invalid file and did not realize it (yeah it can happen), the get all p*ssed off when the password they KNOW to be correct isn’t working.

And since it is interogating one file, the overhead is a split fraction of a second

So… #1 would be easy… the file opens with no errors
but if it DOES (11 or 26) then you still don’t know

if its 11 or 26 you can say “I cannot open this”

Or you’re going to assume that there will never be a valid encrypted db that has one of those handful of bytes as the first 3 or 4 ?
That might be a bold assumption unless you have docs to the contrary that say sqlite will not do such a thing

One of the tools I have and use wont open dbs that aren’t named with a specific extension (annoying as hell) since “they arent databases”. Uh huh. I drop into sqlite3 in terminal and voila it opens sans extension. And it opens if I drag & drop the db on to the app icon. Just not in its file > open dialogs etc. And it doesnt put these in the recently used list - ah yeah they’re not databases (despite it having it open)
Another has other similar “oh but I know better than you” safe guards and, you guessed it, drop into sqlite3 and there’s the db. Open & ready to work with

Theres obviously a balance between too safe and too restrictive
Personally I’d be ok with “Hey this might be an encrypted db whats the password ?” and if I supply it and then it says “well this wont open so its really not a db” I’m ok with it

I’d have to agree with Norman on this one.

If someone tries to open Tadpole.exe with tadpole they shouldn’t get upset if it don’t open and keeps asking for a password. If you do continue this route you could have the message say ‘This appears to be…’ but still give the user the choice to enter a password. No point in preventing the user from trying to open a non-SQLite file.

+1

I think that Norman’s approach is more practical and Dave’s approach is a very polite and helpful one for his customers.