TextEncoding.IsValidData not working properly... alternatives ?

Tobias_Eichner · December 1, 2013, 3:45pm

Hi,

for data import (text data), I’m using the following piece of code to automatically determine the type of import file:

If Encodings.UTF8.IsValidData(SampleLines) Then 'SampleLines As String; contains the first 100 lines of the file in question. StreamEncoding = "UTF8" 'StreamEncoding As String; used for internal processing. ElseIf Encodings.UTF16LE.IsValidData(SampleLines) Then StreamEncoding = "UTF16LE" ElseIf Encodings.UTF16BE.IsValidData(SampleLines) Then StreamEncoding = "UTF16BE" ElseIf Encodings.UTF32LE.IsValidData(SampleLines) Then StreamEncoding = "UTF32LE" ElseIf Encodings.UTF32BE.IsValidData(SampleLines) Then StreamEncoding = "UTF32BE" ElseIf Encodings.ISOLatin9.IsValidData(SampleLines) Then StreamEncoding = "ISOLatin9" ElseIf Encodings.ISOLatin1.IsValidData(SampleLines) Then StreamEncoding = "ISOLatin1" ElseIf Encodings.ISOLatin2.IsValidData(SampleLines) Then StreamEncoding = "ISOLatin2" ElseIf Encodings.ISOLatin3.IsValidData(SampleLines) Then StreamEncoding = "ISOLatin3" ElseIf Encodings.ISOLatin4.IsValidData(SampleLines) Then StreamEncoding = "ISOLatin4" ElseIf Encodings.ISOLatin5.IsValidData(SampleLines) Then StreamEncoding = "ISOLatin5" ElseIf Encodings.ISOLatin6.IsValidData(SampleLines) Then StreamEncoding = "ISOLatin6" ElseIf Encodings.ISOLatin7.IsValidData(SampleLines) Then StreamEncoding = "ISOLatin7" ElseIf Encodings.ISOLatin8.IsValidData(SampleLines) Then StreamEncoding = "ISOLatin8" Else // Do something else with the file (removed, since it does not belong to the problem). End If

The order of above If/ElseIf is not done by chance, because I found that Unicode text files are also returning TRUE when being tested for ISOLatin9 (although they really contain characters matching the Unicode table). ISOLatin9-encoded files are returning FALSE when being tested for Unicode, as they should. Therefore, I have to check for Unicode first, then ISOLatin9 and so on.

Well… is there a more reliable way of testing the coding scheme of a text file ? On Unix/Linux (and also OS X, of course), the “file” command is quite reliable and much better than IsValidData. Unfortunately, I also need a Windows solution.

Do you have any ideas ? Maybe my code is wrong or there are other ways to manage this (please, only true RS/Xojo solutions, no third party stuff).

Thank you very much.
Tobias.

Tobias_Eichner · December 1, 2013, 4:04pm

Sorry, this was probably my fault.

After trying to manually read the BOM to at least get a reliable Unicode detection, I found that I did something stupid: I read in the file as a TextInputStream, not considering that this automatically converts into UTF-8… re-done the same using BinaryStream and the detection was reliable.

Have a nice weekend.
Tobias.

Kem_Tekinay · December 1, 2013, 4:57pm

Just FYI, the M_String module on my web site also includes M_Encoding, a variety of encoding-related functions, including analysis of a string to determine its best encoding with optional BOM detection.

Norman_P · December 1, 2013, 5:48pm

Files are just runs of bytes not “characters”, just bytes, and EVERY byte will be legal in EVERY single byte encoding - because they all allow from &h00 to &hFF (any single byte value)

Now you CAN determine if the encoding is NOT legal UTF8 etc since there are byte sequences that are illegal in certain encodings like UTF-8. UTF16 etc may also have values that are illegal.

And for certain bytes you will not know because UTF8 and ASCII and most of the single bytes encodings all overlap from chars &h00 to &h7F - so every one of those will say “true” this is UTF-8 or latin1 etc.

For instance this simple bit of code in the open event of a listbox shows why this is a “guess”
This comes back as legal in UTF16BE, ISOLatin 9,1,2,3,4, and 5 (not UTF8 as I knew this would fail there)

[code]dim mb as new memoryblock( 256 )

for i as integer = 0 to 255
mb.byte(i) = i
next

If Encodings.UTF8.IsValidData(mb) Then
me.addrow “UTF8”
end if
If Encodings.UTF16LE.IsValidData(mb) Then
me.addrow “UTF16LE”
end if
If Encodings.UTF16BE.IsValidData(mb) Then
me.addrow “UTF16BE”
end if
If Encodings.UTF32LE.IsValidData(mb) Then
me.addrow “UTF32LE”
end if
If Encodings.UTF32BE.IsValidData(mb) Then
me.addrow “UTF32BE”
end if
If Encodings.ISOLatin9.IsValidData(mb) Then
me.addrow “ISOLatin9”
end if
If Encodings.ISOLatin1.IsValidData(mb) Then
me.addrow “ISOLatin1”
end if
If Encodings.ISOLatin2.IsValidData(mb) Then
me.addrow “ISOLatin2”
end if
If Encodings.ISOLatin3.IsValidData(mb) Then
me.addrow “ISOLatin3”
end if
If Encodings.ISOLatin4.IsValidData(mb) Then
me.addrow “ISOLatin4”
end if
If Encodings.ISOLatin5.IsValidData(mb) Then
me.addrow “ISOLatin5”
end if
If Encodings.ISOLatin6.IsValidData(mb) Then
me.addrow “ISOLatin6”
end if
If Encodings.ISOLatin7.IsValidData(mb) Then
me.addrow “ISOLatin7”
end if
If Encodings.ISOLatin8.IsValidData(mb) Then
me.addrow “ISOLatin8”
end if[/code]

If you narrow the loop to only do the first 128 chars from &h00 to &h7F you get
UTF8, UTF16LE, UTF16BE, Latin 9,1,2,3,4,5

As long as you under stand that what you THINK it is is JUST an educated guess then you should be OK

Tobias_Eichner · December 1, 2013, 8:15pm

@Norman: Sure, I understand. Does TextEncodings.IsValidData use a similar algorithm as the “file” command ? I ask, because I find this working quite reliable.

Norman_P · December 1, 2013, 10:43pm

I’m not sure what you mean by the “File” command ?

J_Andrew_Lipscomb · December 2, 2013, 4:05am

Wait a minute. That test data is not valid UTF-16. It contains unpaired surrogates. If it’s reporting as valid in your test program, that’s a bug.

Norman_P · December 2, 2013, 4:35am

Copy & paste the code I posted into the default window open event of a brand new project and see for yourself
That may indeed be a bug

Tobias_Eichner · December 2, 2013, 2:15pm

@Norman: Regarding “file” command:

[code]FILE(1) BSD General Commands Manual FILE(1)

NAME
file – determine file type

SYNOPSIS
file [-bcdDhiIkLnNprsvz] [–mime-type] [–mime-encoding] [-f namefile]
[-m magicfiles] [-M magicfiles] file
file -C [-m magicfiles]
file [–help]

DESCRIPTION
This manual page documents version 5.04 of the file command.

 file tests each argument in an attempt to classify it.  There are three
 sets of tests, performed in this order: filesystem tests, magic tests,
 and language tests.  The first test that succeeds causes the file type to
 be printed.

…
[/code]

I found this being quite reliable… so if RS/Xojo would be able to implement this in a cross-platform way, it would be a great addition to the language.

Norman_P · December 2, 2013, 4:13pm

AH THAT command
Here I thought you meant in the Xojo language

Tobias_Eichner · December 3, 2013, 8:15pm

@Norman: I’m curious. How is IsValidData working exactly ? For example, can it differ between Latin1 (ISO-8859-1) and Latin9 (ISO-8859-15), which are more or less equal (besides the char and few special chars).

Norman_P · December 3, 2013, 9:48pm

There are only BYTES - characters are a consequence of knowing what encoding to apply to those bytes.
The character is a particular encoding applied to certain bytes that say “this byte value should map to this character”
That’s it
The wikipedia page here explains it very clearly
http://en.wikipedia.org/wiki/ISO/IEC_8859-15

&hA4 in 8859-1 is one character
&hA4 in 8859-15 is the Euro character
Same byte - different encodings different characters as a result

The upshot is “how can it know what you intend” ?

Tim_Hare · December 3, 2013, 10:09pm

If the FILE command appears to be doing a better job, it is because it is making some assumptions which may trend towards being true (eg., users of this environment tend to use Latin-1 as opposed to some other Latin encoding), but which are no more accurate than any other guess. Any sequence of bytes is valid in any single-byte encoding. There is absolutely no way to tell the difference. The best you can ever do is identify a byte sequence that isn’t valid in a particular multi-byte encoding.

Norman_P · December 3, 2013, 10:16pm

I could swear I said that earlier