Can I improve on this?

TimStreater · March 2, 2016, 11:06pm

In converting my existing application to Xojo, one small part of what I am doing is this. I have some data in a Text variable which is base64-encoded. To go with that, I have an encoding to apply to the decoded data. The difficulty is that the encoding may be wrong or the decoded data may contain some characters that cannot be decoded with the supplied encoding. I want, however, to keep as much good data as possible.

After reading a lot of the documentation, I discovered today that the TextEncoding.ConvertDataToText method allows for lossy conversion. So I’ve put together this small example:

  Dim  myblk as Xojo.core.MemoryBlock, tmptxt As MemoryBlock, encoding as Xojo.Core.TextEncoding
  Dim  i as Integer, charset, intxt, fintxt as Text, mybytes() as byte
  
  charset = "ASCII"                                 // I'm told the data is ASCII (it's actually UTF-8)
  intxt   = "SGVyZSBpcyBzb21lIHTDqXh0"              // This is the base-64 encoded data
  tmptxt  = DecodeBase64 (intxt)                    // Base64-decode data into a classic memory-block
  
  for i=0 to tmptxt.Size-1
    mybytes.append (tmptxt.Byte(i))                 // Copy the classic memory-block into an array of bytes
  next
  
  try
    encoding = xojo.Core.TextEncoding.FromIANAName (charset)
  Exception
    encoding = xojo.Core.TextEncoding.FromIANAName ("Windows-1252")
  end try
  
  myblk = new xojo.Core.MemoryBlock (mybytes)                  // Make a xojo.core.memoryblock from the array of bytes (another copy)
  fintxt = encoding.ConvertDataToText (myblk, true)            // Finally convert to Text, allowing bad characters to be dropped
  
  msgbox (fintxt)

To effect this, I have to go through several copying steps and wondered if what I have here can be improved upon.

Marco_Hof · March 3, 2016, 9:54am

I don’t think I have the answer on your ‘lossy’ requirement but maybe this helps:

In a perfect world, you want something like this:

  Dim intxt As String  = "SGVyZSBpcyBzb21lIHTDqXh0"
  msgbox (DecodeBase64(intxt, Encodings.UTF8))

However, this only works if you know what the encoding is (text comes from a webpage, it was created on the local machine, you created it… etc.).
If you the text comes from different places and it wasn’t specified what encoding it was in, prepare for a nightmare. There is no 100% way to detect the encoding of a text file.
The best way I could come up with to get close is by going through a lot of steps and mainly striking out what it cannot be.

First, I do some testing on the text file itself. If it has a BOM, UTF16, if it’s pure ASCII etc. Those can be fairly easily detected.

If still unsure, the best way I found is to check the text with ‘IsValidData’. I made a function that returns a dictionary with encodings that are possible:

  ....
  If Encodings.MacChineseTrad.IsValidData(txt) Then
    ValidEncodings.Value("MacChineseTrad") = Encodings.MacChineseTrad.internetName
  End If
  
  If Encodings.MacCroatian.IsValidData(txt) Then
    ValidEncodings.Value("MacCroatian") = Encodings.MacCroatian.internetName
  End If
  
  If Encodings.MacCyrillic.IsValidData(txt) Then
    ValidEncodings.Value("MacCyrillic") = "Mac OS Cyrillic"
  End If
  .... etc.

(Since I never found a way to use the encoding -names/-types as variables, I had to specify them all and go through them one at a time.)

After that, all I can do is present the text to the user with a popup menu that has the remaining/possible encodings. In my case, I do some additional stuff like guessing the language and present the most logical encoding (for that language) as default.

I do the switching between encodings in a function like this:

  (txt As String, encoding As String)
  
  If encoding = "ASCII" Then
    Return DefineEncoding(txt, Encodings.ASCII)
    
  ElseIf encoding = "DOSArabic" Then
    Return DefineEncoding(txt, Encodings.DOSArabic)
    
  ElseIf encoding =  "DOSBalticRim" Then
    Return DefineEncoding(txt, Encodings.DOSBalticRim)
    
  ElseIf encoding =  "DOSCanadianFrench" Then
    Return DefineEncoding(txt, Encodings.DOSCanadianFrench)
   .... etc.

As far as the ‘lossy’ requirement, I think that’s even more guessing. In my case, I get a lot of garbage and the above works in most cases. But if (for example) arabic text was converted to ISO-8859-1 and then saved as UTF-8, there’s not much you can do.

(btw, be aware that when converting encodings, there are issues with 64 bits compiled Apps.)

TimStreater · March 3, 2016, 11:17am

Oh, I’m told what the encoding is supposed to be. I’m told via such as:

ContentType: text/plain charset=utf8;

But that doesn’t guarantee that the supplied text will all be legal utf8. At the same time the user would like to get as much out of it as possible. The best I can do is to believe what I’m told, and then be able to decode as much as possible, without getting a runtime exception.

Mostly the problems are caused by buggy software at the remote end (all this text I’m handling comes over the network). One I saw recently was from a camera website I follow. Someone had posted there and added an emoticon (a gif) at the end of their post. The resulting update to me was all legal utf8 except where the emoticon would have been - half a dozen illegal characters instead.

Emile_Schwarz · March 3, 2016, 11:22am

Unfortunately, this is the fate of text (data ?) from the internet: you think this is text, but you can get control characters (invisibles for the eyes) embedded in it !

Xojo talk about gremlins characters.

Kem_Tekinay · March 3, 2016, 11:30am

I have methods in my M_String package that might help, like choosing the encoding by analysis and converting to valid UTF8 by dropping invalid bytes. (M_Encoding.ByAnalysis and M_String.MakeValidUTF8) You can find the package here:

http://www.mactechnologies.com/index.php?page=downloads#m_string

Marco_Hof · March 3, 2016, 5:54pm

[quote=250800:@Tim Streater]Oh, I’m told what the encoding is supposed to be. I’m told via such as:
But that doesn’t guarantee that the supplied text will all be legal utf8. At the same time the user would like to get as much out of it as possible. The best I can do is to believe what I’m told, and then be able to decode as much as possible, without getting a runtime exception.
[/quote]
In that case, I would try doing a DefineEncoding- on the text to whatever the encoding should be, then ConvertEncoding to UTF-8 and then feed it to Kem’s MakeValidUTF8.
To avoid runtime exceptions, IsValidData should help.