encodings = don't worry be happy

Thanks to Kem, I will forever associate him with encodings!

With that said, here is my question. When I load a text file to a TextInputStream, is there away to determine what encoding it is? The data I am working with is a mix of UTF8 and UFT16. I am trying to normalize the process of being able to read the input and wasnt sure the best way to detect native encoding.

any help will be appreciated.

from here : determining a text files character set. - General - Xojo Programming Forum

there is a method isValidData, you can call from Encodings.UTF8 (or 16).isValidData will tell you if your string can be utf8 or 16

https://documentation.xojo.com/api/text/encoding_text/textencoding.html#textencoding-isvaliddata

IsValidData only tells you wether or not the data can be represented with such an encoding. It doesn’t tell you which encoding was intended. For example:

[code]Var TestString As String = “Hello World”
TestString = TestString.ConvertEncoding(Encodings.UTF16LE).DefineEncoding(Nil)

Var IsValidUTF8 As Boolean = Encodings.UTF8.IsValidData(TestString)
Var IsValidUTF16LE As Boolean = Encodings.UTF16LE.IsValidData(TestString)
Var IsValidUTF16BE As Boolean = Encodings.UTF16BE.IsValidData(TestString)
Var IsValidUTF32LE As Boolean = Encodings.UTF32LE.IsValidData(TestString)
Var IsValidUTF32BE As Boolean = Encodings.UTF32BE.IsValidData(TestString)[/code]

IsValidUTF8, IsValidUTF16LE, and IsValidUTF16BE will all be True.

If I recall correctly, for UTF16, the first byte of each 16 bit character code will be a zero for the majority of characters. So a simple statistical test should be able to tell the difference between UTF8 and UTF16. I haven’t tested this, but you could give it a try:

Public Function UTF8or16(t As TextInputStream) as String
  dim txt As string = t.Read(100,Encodings.ASCII) 'read as bytes
  t.PositionB=0 'reset file pointer to beginning of file
  dim nullRatio as double = (CountFields(txt,chr(0))-1)/lenb(txt)
  'nullRatio will be close to zero for UTF8, and close to 0.5 for UTF16.
  if nullRatio > 0.25 then
    return "UTF16"
  Else
    return "UTF8"
  end if
End Function

[quote=475670:@Robert Weaver]If I recall correctly, for UTF16, the first byte of each 16 bit character code will be a zero for the majority of characters. So a simple statistical test should be able to tell the difference between UTF8 and UTF16. I haven’t tested this, but you could give it a try:

Public Function UTF8or16(t As TextInputStream) as String dim txt As string = t.Read(100,Encodings.ASCII) 'read as bytes t.PositionB=0 'reset file pointer to beginning of file dim nullRatio as double = (CountFields(txt,chr(0))-1)/lenb(txt) 'nullRatio will be close to zero for UTF8, and close to 0.5 for UTF16. if nullRatio > 0.25 then return "UTF16" Else return "UTF8" end if End Function [/quote]

The byte order depends entirely on endianness… by definition.

If the file mixes encodings, I’d call that a binary file and get the structure from its author.

Byte order won’t matter in the routine I posted. It just counts zero bytes wherever they occur.

@Kem Tekinay I interpreted the original post to mean that he receives two different types of files, either UTF8 or UTF16, not a mix of encodings in a single file.

Thank you all!

I should have clarified… the source txt data I am working with has files either UTF8 or UTF16. I am going to take the above advice to work on logic in order to interact with each file.

My M_String package has a function that will help you.