Detect UTF16 (LE) without BOM

Hello,

I’ve written a method that looks for byte order marks in text files. If no BOM information is found, I would now jump to a second method that tries to read the encoding of the file. Using a UTF-16 (LE) file without BOM as an example, I would like to explain what my approach is.

I would first, say, read the first 1024 bytes of the file using Xojo.IO.BinaryStream. Now it was my idea, it really is only about this exact example to search for line breaks encoded in UTF-16 (LE) in the BinaryStream. For example, after CR LF line breaks. These are encoded in UTF-16 (LE) as 0D 00 0A 00. Here now the question, how can I exactly find this byte sequence (4 bytes)? Does anyone have an example of this, please?

Thank you very much!

I would use Binarystream and read the bytes in a string and use instrB() with chr(13)+chr(0)+chr(10)+chr(0) to find it.

I had the same idea. Please look at my code. The If…Then Condition in DetectEncoding won’t work well. Why? At the moment I only added the conditions for the Unix and Windows EOL-Delimiter.

[code]Using Xojo.Core
Using Xojo.IO

Private Function HasDelimiter(Extends value As Text, delimiter As Text) As Boolean
Return value.IndexOf(delimiter) > -1
End Function

Private Function CreateByteMap(data As MemoryBlock) As Text
Dim result() As Text?
Dim count As Integer = data.Size - 1
For i As Integer = 0 To count
Dim currentByte As UInt8 = data.UInt8Value(i)
result.Append(currentByte.ToHex(2))
Next
Return Text.Join(result, “”)
End Function

Function DetectEncoding(f As FolderItem) As TextEncoding?
’ open file?
Dim b As BinaryStream = BinaryStream.Open(f, BinaryStream.LockModes.Read)
’ read 1024 bytes
Dim data As MemoryBlock = b.Read(If(b.Length < 1024, b.Length, 1024))
’ close file
b.Close

’ create Byte Map
Dim byteMap As Text = CreateByteMap(data)
Dim enc As TextEncoding

’ check for specific EOL-Delimiter?
’ this loop dont work correctly. Why??
If byteMap.HasDelimiter(“000D000A”) Or byteMap.HasDelimiter(“000A”) Then
enc = TextEncoding.UTF16BigEndian
? Elseif byteMap.HasDelimiter(“0D000A00”) Or byteMap.HasDelimiter(“0A00”) Then
enc = TextEncoding.UTF16LittleEndian
Else
enc = TextEncoding.UTF8
End If

Return enc
End Function[/code]

did you check in debugger what byte map look like?
Also is HasDelimiter working or should you just use Instr maybe?

[quote=381981:@Christian Schmitz]did you check in debugger what byte map look like?
Also is HasDelimiter working or should you just use Instr maybe?[/quote]
Yes I did, byteMap looks exactly like the content of the HexViewer. I added the HasDelimiter Function in the code above!

So you see the 000D000A and InStr doesn’t find it?

Yes I see. Weird. I have two test files:

Line 1 Line 2
The hex code of the UTF-16 (BE) looks like this:

004C 0069 006E 0065 0020 0031 000A 004C 0069 006E 0065 0020 0032

The hex code of the UTF-16 (LE) looks like this:

4C00 6900 6E00 6500 2000 3100 0A00 4C00 6900 6E00 6500 2000 3200

The LE Version will be matched as TextEncoding.UTF16BigEndian. Why?

Because it’s often 000A00 in the code.

So you need to find 0A00 at an even position to make it 16LE and 00A0 at an even position for 16BE

[quote=381991:@Christian Schmitz]Because it’s often 000A00 in the code.

So you need to find 0A00 at an even position to make it 16LE and 00A0 at an even position for 16BE[/quote]
Ok, sounds like i have to modify the HasDelimiter function? Is Modulo the right direction? I hate maths :smiley: or did you mean another part of my code?

You could simply change your byte map function to add an extra space after each 2 bytes.

than search for “0A00” to find it.

That’s a good tip. Looks like it’s working flawlessly. Thank you very much.
In the CreateByteMap function I only had to insert an additional line.

Private Function CreateByteMap(data As MemoryBlock) As Text Dim result() As Text? Dim count As Integer = data.Size - 1 For i As Integer = 0 To count Dim currentByte As UInt8 = data.UInt8Value(i) result.Append(currentByte.ToHex(2)) ' add extra space after each 2 bytes If i Mod 2 <> 0 Then result.Append(" ") Next Return Text.Join(result, "") End Function