Detect Base-64

I recently discovered I have a data file that has been collected from various sources over time, and found it it has some lines of text that are “clear text” and some that are base-64 encoded.

Is there a “simple” way to take a string and determine if it IS base-64 data? I need to convert all the data to “clear” text, and while I could do it one record at a time using “Eye-Ball Reader v1.0”, that would be very time consuming

If I remember correctly, there are not header/footers in base64 (but not 100% sure because I there are probably multiple implementations).

I would just look at some basic Base64 rules like:

  • The string must only contain a- z, A-Z and 0-9 (Err… that’s 62 so there must be 2 more.)
  • The string is a multiple of 4. If not, it has trailing =='s
  • and maybe some more.

and then try to decode it.

After that, run your EBR v1.0.

from C code:

                    // 00000000001111111111222222
                    // 01234567890123456789012345

static const char* cvt = “ABCDEFGHIJKLMNOPQRSTUVWXYZ”

                    // 22223333333333444444444455
                    // 67890123456789012345678901
                      "abcdefghijklmnopqrstuvwxyz"

                    // 555555556666
                    // 234567890123
                      "0123456789+/";

There is a high probability of spaces in clear text. There are none in base-64.

“high” is not good enough… I need to be 100% sure

And it seems that the DECODEBASE64 function has zero internal error checking… it returns a string of gibberish if the incoming string is not Base64…

The rules seems to say

  • must be multiple of 4 in length
  • must only contain [a-z] [A-Z] and [0-9] with possible trailing “=” to pad length

but then “12345678” meets that criteria, but decodes to gibberish
for my situation I’m not overly concerned with b64 data that is all numeric, but…

maybe check the data after decoding?
Maybe you have a checksum inside or a header?

hard part is YOU recognize it as gibberish when you view it
but writing code to recognize it’s gibberish is much harder

[quote=219933:@Norman Palardy]hard part is YOU recognize it as gibberish when you view it
but writing code to recognize it’s gibberish is much harder[/quote]
exactly… although the gibberish did have non encoded characters (diamond ?)

As far as checksum etc… this is existing data… so I have to deal with what is coming to me.

DecodeBase64 “should” issue an exception… and it can be done…
I was just hoping to not have to create a complicated RegEx wrapper …

Then to complicate matters… I have since discovered that some of the Base64 data, once properly decoded then contains either RTF or HTML code… and I need the end result to be nothing but Text and Linefeeds…

The parts that are encoded, are they properly encoded? What I mean is, proper Base64 should start the line and be wrapped at … 72 characters, I think, but would have to check. The last line should end with CR+LF if there is going to be something after it.

Or is it mixed in with other text?

(It might help if you posted a sample.)

[quote=219937:@Dave S]DecodeBase64 “should” issue an exception… and it can be done…
[/quote]

No error. Same thing if the string is gibberish and not Base64.

[quote=219948:@Kem Tekinay]The parts that are encoded, are they properly encoded? What I mean is, proper Base64 should start the line and be wrapped at … 72 characters, I think, but would have to check. The last line should end with CR+LF if there is going to be something after it.

Or is it mixed in with other text?

(It might help if you posted a sample.)[/quote]

Base64 can be wrapped or not.

again… I have a mix of “clear” text, and base-64 encoded strings
the Base64 string ARE properly encoded, and are NOT wrapped (most are not more than a single line as it is)

if IsBase64(s) then  s=DecodeBase64(s)
.... do something with "s"

You can’t.

There two more. + and / or something. And I think the ='s are limited to maximum 2 or 3.

Since there is no header/footer to indicate the start and stop, you’ll end up with probability.
12345678 is too short to have a high probability. As in language guessing/detection. The more text you have, the more likely you guessed correctly.

[quote=219924:@Dave S]@Tim Hare There is a high probability of spaces in clear text. There are none in base-64.
“high” is not good enough… I need to be 100% sure[/quote]

You can make 100% sure with your own eyes because you understand English and recognize other human languages. Unless you build a huge IA program to do the same, you cannot use the same method.

There are rules about the way English is constructed that you can use, though.

Tim is right, the space is one major difference between Base64 and clear text. Others include the fact that phrases in clear text usually start by an uppercase followed by lower case, and end up with coma. There are combinations of consonants that do not exist in English; there are combinations of vowels that do not exist in English. Words usually do not exceed a certain length. We do not routinely employ “Honorificabilitudinitatibus” and even less complex chemical things.

I do not think a computer can get to 100%, but it can get pretty darn close to a statistical certainty.

Could this be used for your purposes?

[code]
Private Function IsBase64(s As String) As Boolean
Dim result As Boolean

result = (s = EncodeBase64(DecodeBase64(s)))

return result
End Function[/code]

[quote=219973:@Alwyn Bester]Could this be used for your purposes?

[code]
Private Function IsBase64(s As String) As Boolean
Dim result As Boolean

result = (s = EncodeBase64(DecodeBase64(s)))

return result
End Function[/code][/quote]

Brilliant :slight_smile:

except it won’t work

s="12345678"
result = (s = EncodeBase64(DecodeBase64(s)))
result and s are equal

but the interim Decode is wrong… it is encoding inproperly decoded data

There is no way for the computer to know if “12345678” is clear text or base64. I don’t know what you mean by “improperly decoded data”. “12345678” is perfectly valid base64 data. The decoded string is binary data that is meaningless to a human, but perfectly reasonable to a machine.

Alwyn’s function is useful in that it will detect a string that is not valid base64 without resorting to a regex (one could debate which approach is better, but the result would be about the same). DecodeBase64 seems to just ignore invalid characters in the string. To me, that does feel like a bug.

ok… improper as it is not human readable text… .which all the data I’m involved with would be

it decodes to “D7 6D F8 E7 AE FC”

which doesn’t match any normal encoding

I think the Base64 functions are lacking a bit, as I mentioned the ObjC/Swift version DOES catch this as an error,
but I need to catch this in XOJO…

So a robust function is going to need more… as Alwyns while a great idea, has too many failure cases

s="aHomelyBrownCow2"

case in point… It DOES catch strings that contain characters not on the approved Base64 list,

the issue here is that “validly encoded as base 64” ? human readable
teach the program how to recognize human readable text & you’re good