I recently discovered I have a data file that has been collected from various sources over time, and found it it has some lines of text that are “clear text” and some that are base-64 encoded.
Is there a “simple” way to take a string and determine if it IS base-64 data? I need to convert all the data to “clear” text, and while I could do it one record at a time using “Eye-Ball Reader v1.0”, that would be very time consuming
[quote=219933:@Norman Palardy]hard part is YOU recognize it as gibberish when you view it
but writing code to recognize it’s gibberish is much harder[/quote]
exactly… although the gibberish did have non encoded characters (diamond ?)
As far as checksum etc… this is existing data… so I have to deal with what is coming to me.
DecodeBase64 “should” issue an exception… and it can be done…
I was just hoping to not have to create a complicated RegEx wrapper …
Then to complicate matters… I have since discovered that some of the Base64 data, once properly decoded then contains either RTF or HTML code… and I need the end result to be nothing but Text and Linefeeds…
The parts that are encoded, are they properly encoded? What I mean is, proper Base64 should start the line and be wrapped at … 72 characters, I think, but would have to check. The last line should end with CR+LF if there is going to be something after it.
[quote=219937:@Dave S]DecodeBase64 “should” issue an exception… and it can be done…
[/quote]
No error. Same thing if the string is gibberish and not Base64.
[quote=219948:@Kem Tekinay]The parts that are encoded, are they properly encoded? What I mean is, proper Base64 should start the line and be wrapped at … 72 characters, I think, but would have to check. The last line should end with CR+LF if there is going to be something after it.
again… I have a mix of “clear” text, and base-64 encoded strings
the Base64 string ARE properly encoded, and are NOT wrapped (most are not more than a single line as it is)
if IsBase64(s) then s=DecodeBase64(s)
.... do something with "s"
There two more. + and / or something. And I think the ='s are limited to maximum 2 or 3.
Since there is no header/footer to indicate the start and stop, you’ll end up with probability.
12345678 is too short to have a high probability. As in language guessing/detection. The more text you have, the more likely you guessed correctly.
[quote=219924:@Dave S]@Tim Hare There is a high probability of spaces in clear text. There are none in base-64.
“high” is not good enough… I need to be 100% sure[/quote]
You can make 100% sure with your own eyes because you understand English and recognize other human languages. Unless you build a huge IA program to do the same, you cannot use the same method.
There are rules about the way English is constructed that you can use, though.
Tim is right, the space is one major difference between Base64 and clear text. Others include the fact that phrases in clear text usually start by an uppercase followed by lower case, and end up with coma. There are combinations of consonants that do not exist in English; there are combinations of vowels that do not exist in English. Words usually do not exceed a certain length. We do not routinely employ “Honorificabilitudinitatibus” and even less complex chemical things.
I do not think a computer can get to 100%, but it can get pretty darn close to a statistical certainty.
There is no way for the computer to know if “12345678” is clear text or base64. I don’t know what you mean by “improperly decoded data”. “12345678” is perfectly valid base64 data. The decoded string is binary data that is meaningless to a human, but perfectly reasonable to a machine.
Alwyn’s function is useful in that it will detect a string that is not valid base64 without resorting to a regex (one could debate which approach is better, but the result would be about the same). DecodeBase64 seems to just ignore invalid characters in the string. To me, that does feel like a bug.