Text, strings, and conversion from base64

Since text seems to be the way of the future, I’m trying to do as much as possible as text. I read character strings from a socket, and since I’m told by the sender what MIME charset is in use, I can concatenate these into a body, and then do something like:

body = DecodeBase64 (body, tenc).ToText

Here, body is a Text, and tenc is the appropriate Encodings based on what the remote end told me.

For the most part, this works. Sometimes, however, the remote end lies. It tells me it’s sending us-ascii, but in fact it has sent Windows-1252. Or some other combination. So - the data is not trustworthy. But at the same time I want as much good data as possible out of what is sent. The best I appear to be able to do from the above is get an exception and just use that to return an empty string.

In PHP I can do thus:

$retstr = @iconv ($charset, 'UTF-8//IGNORE', $intstr);

which asks for the string given charset to be converted to UTF-8, ignoring any bad chars. This is not optimal as, in the resulting text, there’s no indication that bad chars have been skipped.

Can I do better than just getting an empty result when the remote end is dishonest?

Being able to optionally and explicitly “recover” from invalid data during decoding sounds like a great feature request.

Interestingly, when I did some tests in a test project and split the statement up as:

[code]dim intext, outtext As Text, mystr as String, tenc as TextEncoding

intext = “SGVyZSBpcyBzb21lIHTDqXh0”
tenc = GetInternetTextEncoding (“US-ASCII”)
mystr = DecodeBase64 (intext, tenc)
outtext = mystr.ToText
[/code]

then the first statement ran OK, even with bad (actually one UTF-8 char) data. It was the final statement which failed with an exception.

I bet I could write a function to check UTF-8 encoding and replace all bad characters with a placeholder.

[quote=241672:@Tim Streater]Interestingly, when I did some tests in a test project and split the statement up as:

[code]dim intext, outtext As Text, mystr as String, tenc as TextEncoding

intext = “SGVyZSBpcyBzb21lIHTDqXh0”
tenc = GetInternetTextEncoding (“US-ASCII”)
mystr = DecodeBase64 (intext, tenc)
outtext = mystr.ToText
[/code]

then the first statement ran OK, even with bad (actually one UTF-8 char) data. It was the final statement which failed with an exception.[/quote]

String is an unintelligent bag of bytes tied to a user-specified encoding. Text, on the other hand, actually enforces some invariants and prevents a lot of common mistakes.

Perhaps DecodeBase64 could replace unknown characters with the Unicode Replacement Character (#FFFD), as an alternative to getting an exception. The exception is handy during debugging, but unless I can catch it in some useful way, less helpful during production.

I looked at the result of an improper encoding on a string containing numerous accented characters: none of them does anything but a lozenge.

The best I can think of is to try encodings (UTF8, WindowsANSI,WindowsLatin1, whatever) until you get to the right one, with a series of Try-Catch.