Thanks to Kem, I will forever associate him with encodings!
With that said, here is my question. When I load a text file to a TextInputStream, is there away to determine what encoding it is? The data I am working with is a mix of UTF8 and UFT16. I am trying to normalize the process of being able to read the input and wasnt sure the best way to detect native encoding.
IsValidData only tells you wether or not the data can be represented with such an encoding. It doesn’t tell you which encoding was intended. For example:
[code]Var TestString As String = “Hello World”
TestString = TestString.ConvertEncoding(Encodings.UTF16LE).DefineEncoding(Nil)
Var IsValidUTF8 As Boolean = Encodings.UTF8.IsValidData(TestString)
Var IsValidUTF16LE As Boolean = Encodings.UTF16LE.IsValidData(TestString)
Var IsValidUTF16BE As Boolean = Encodings.UTF16BE.IsValidData(TestString)
Var IsValidUTF32LE As Boolean = Encodings.UTF32LE.IsValidData(TestString)
Var IsValidUTF32BE As Boolean = Encodings.UTF32BE.IsValidData(TestString)[/code]
IsValidUTF8, IsValidUTF16LE, and IsValidUTF16BE will all be True.
If I recall correctly, for UTF16, the first byte of each 16 bit character code will be a zero for the majority of characters. So a simple statistical test should be able to tell the difference between UTF8 and UTF16. I haven’t tested this, but you could give it a try:
Public Function UTF8or16(t As TextInputStream) as String
dim txt As string = t.Read(100,Encodings.ASCII) 'read as bytes
t.PositionB=0 'reset file pointer to beginning of file
dim nullRatio as double = (CountFields(txt,chr(0))-1)/lenb(txt)
'nullRatio will be close to zero for UTF8, and close to 0.5 for UTF16.
if nullRatio > 0.25 then
return "UTF16"
Else
return "UTF8"
end if
End Function
[quote=475670:@Robert Weaver]If I recall correctly, for UTF16, the first byte of each 16 bit character code will be a zero for the majority of characters. So a simple statistical test should be able to tell the difference between UTF8 and UTF16. I haven’t tested this, but you could give it a try:
Public Function UTF8or16(t As TextInputStream) as String
dim txt As string = t.Read(100,Encodings.ASCII) 'read as bytes
t.PositionB=0 'reset file pointer to beginning of file
dim nullRatio as double = (CountFields(txt,chr(0))-1)/lenb(txt)
'nullRatio will be close to zero for UTF8, and close to 0.5 for UTF16.
if nullRatio > 0.25 then
return "UTF16"
Else
return "UTF8"
end if
End Function
[/quote]
The byte order depends entirely on endianness… by definition.
Byte order won’t matter in the routine I posted. It just counts zero bytes wherever they occur.
@Kem Tekinay I interpreted the original post to mean that he receives two different types of files, either UTF8 or UTF16, not a mix of encodings in a single file.
I should have clarified… the source txt data I am working with has files either UTF8 or UTF16. I am going to take the above advice to work on logic in order to interact with each file.