encodings = don't worry be happy

  1. 7 weeks ago

    Rich H

    Feb 16 Pre-Release Testers, Xojo Pro

    Thanks to Kem, I will forever associate him with encodings!

    With that said, here is my question. When I load a text file to a TextInputStream, is there away to determine what encoding it is? The data I am working with is a mix of UTF8 and UFT16. I am trying to normalize the process of being able to read the input and wasnt sure the best way to detect native encoding.

    any help will be appreciated.

  2. Jean-Yves P

    Feb 16 Pre-Release Testers, Xojo Pro Europe (France, Besançon)
    Edited 7 weeks ago

    from here : https://forum.xojo.com/51984-determining-a-text-files-character-set/0

    there is a method isValidData, you can call from Encodings.UTF8 (or 16).isValidData will tell you if your string can be utf8 or 16

    https://docs.xojo.com/TextEncoding.IsValidData

  3. Thom M

    Feb 16 Pre-Release Testers Greater Hartford Area, CT

    IsValidData only tells you wether or not the data can be represented with such an encoding. It doesn't tell you which encoding was intended. For example:

    Var TestString As String = "Hello World"
    TestString = TestString.ConvertEncoding(Encodings.UTF16LE).DefineEncoding(Nil)
    
    Var IsValidUTF8 As Boolean = Encodings.UTF8.IsValidData(TestString)
    Var IsValidUTF16LE As Boolean = Encodings.UTF16LE.IsValidData(TestString)
    Var IsValidUTF16BE As Boolean = Encodings.UTF16BE.IsValidData(TestString)
    Var IsValidUTF32LE As Boolean = Encodings.UTF32LE.IsValidData(TestString)
    Var IsValidUTF32BE As Boolean = Encodings.UTF32BE.IsValidData(TestString)

    IsValidUTF8, IsValidUTF16LE, and IsValidUTF16BE will all be True.

  4. Robert W

    Feb 16 Western Canada

    If I recall correctly, for UTF16, the first byte of each 16 bit character code will be a zero for the majority of characters. So a simple statistical test should be able to tell the difference between UTF8 and UTF16. I haven't tested this, but you could give it a try:

    Public Function UTF8or16(t As TextInputStream) as String
      dim txt As string = t.Read(100,Encodings.ASCII) 'read as bytes
      t.PositionB=0 'reset file pointer to beginning of file
      dim nullRatio as double = (CountFields(txt,chr(0))-1)/lenb(txt)
      'nullRatio will be close to zero for UTF8, and close to 0.5 for UTF16.
      if nullRatio > 0.25 then
        return "UTF16"
      Else
        return "UTF8"
      end if
    End Function
  5. Thom M

    Feb 17 Pre-Release Testers Greater Hartford Area, CT

    @Robert W If I recall correctly, for UTF16, the first byte of each 16 bit character code will be a zero for the majority of characters. So a simple statistical test should be able to tell the difference between UTF8 and UTF16. I haven't tested this, but you could give it a try:
    Public Function UTF8or16(t As TextInputStream) as String dim txt As string = t.Read(100,Encodings.ASCII) 'read as bytes t.PositionB=0 'reset file pointer to beginning of file dim nullRatio as double = (CountFields(txt,chr(0))-1)/lenb(txt) 'nullRatio will be close to zero for UTF8, and close to 0.5 for UTF16. if nullRatio > 0.25 then return "UTF16" Else return "UTF8" end if End Function

    The byte order depends entirely on endianness... by definition.

  6. Kem T

    Feb 17 Pre-Release Testers, Xojo Pro, XDC Speakers, MVP Connecticut

    If the file mixes encodings, I'd call that a binary file and get the structure from its author.

  7. Robert W

    Feb 17 Western Canada
    Edited 7 weeks ago

    @Thom M The byte order depends entirely on endianness... by definition.

    Byte order won't matter in the routine I posted. It just counts zero bytes wherever they occur.

    @Kem T I interpreted the original post to mean that he receives two different types of files, either UTF8 or UTF16, not a mix of encodings in a single file.

  8. Rich H

    Feb 17 Pre-Release Testers, Xojo Pro

    Thank you all!

    I should have clarified.. the source txt data I am working with has files either UTF8 or UTF16. I am going to take the above advice to work on logic in order to interact with each file.

  9. Kem T

    Feb 17 Pre-Release Testers, Xojo Pro, XDC Speakers, MVP Connecticut

    My M_String package has a function that will help you.

or Sign Up to reply!