Encodings.UTF32BE.IsValidData crashes

Marcus_Kuba · February 14, 2015, 3:21pm

We have a software that for example doesn’t recognize the "-Text-Delimiters while importing CSV to a database and simply imports them, resulting in Strings that begin and end with ". Also, the source-delimiter is “,” instead of “;”. So i began a more flexible CSV-to-CSV-Converter which changes field-delimiters and removes the text-delimiters from the fields (if a field-value has a field-delimiter in it, it is replaced so that the number of fields stays correct).
The problem is now, that I have to know the encoding of the input-file before I can analyze it, so I included Kem’s wonderful M_String-Modules and tried it the following way (this is the Method to load the source file and split it’s lines into an array for further analysis):

  Dim Quellfile As TextInputStream
  ReDim QuellZeilen(-1)
  Dim s As String
  If f <> NIL And f.Exists then
    Quellfile = TextInputStream.Open(f)
    s = Quellfile.ReadAll
    s = ReplaceLineEndings(ConvertEncoding(DefineEncoding(s,M_Encoding.ByAnalysis(s,false)),Encodings.UTF8),EndOfLine)
    QuellZeilen = Split(s,EndOfLine)
    Quellfile.Close
    MsgBox(Str(QuellZeilen.Ubound+1)+" Zeile(n) gelesen")
  else
    MsgBox("Fehler beim Lesen der Quelle")
  end if

This works with a CSS-File which I got from an iPhone-App and which has a BOM.
But the original file I got from our customer is from a windows-machine and has no BOM, and when I try to load that one, the app crashes. I tracked it down in the debugger at the point where the ByAnalysis-Method calls “Encodings.UTF32BE.IsValidData(src)”.
I then inserted a block

    if Encodings.UTF32BE.IsValidData(s) then
      MsgBox "OK"
    else
      MsgBox "KO"
    end if

after the Quellfile.ReadAll and now it is crashing at the first “if…”-line. I tried it with different files but it seems the UTF32BE.IsValidData crashes on every single file that is not valid instead of just returning false. I’m developing unter Windows 7 Pro 64bit and I have tried with Xojo 2014 r2.1 and r3.2.

Joe_Ranieri · February 14, 2015, 3:31pm

Please file a bug report.

Marcus_Kuba · February 14, 2015, 4:23pm

#38189

As a workaround I have now changed the corresponding line in the ByAnalysis-Method so that the result is always false (at the moment I can not think of the possibility that I will ever get a UTF32BE-File to handle), but now the next problem arises:

ByAnalysis now returns UTF16BE as Encoding, which is definitely not the true and results in Japanese or Chinese output. In fact, it is DOS/ASCII, but Encodings.DOSLatin1.IsValidData(string) also returns true for UTF16LE… so what can I do?

Beatrix_Willius · February 14, 2015, 5:14pm

That’s one of the problems with checking the encoding. You only weed out the grossly wrong encodings. Either you allow the user to change the encoding or you first try ASCII before trying UTF16LE.