I’ve written a method that looks for byte order marks in text files. If no BOM information is found, I would now jump to a second method that tries to read the encoding of the file. Using a UTF-16 (LE) file without BOM as an example, I would like to explain what my approach is.
I would first, say, read the first 1024 bytes of the file using Xojo.IO.BinaryStream. Now it was my idea, it really is only about this exact example to search for line breaks encoded in UTF-16 (LE) in the BinaryStream. For example, after CR LF line breaks. These are encoded in UTF-16 (LE) as 0D 00 0A 00. Here now the question, how can I exactly find this byte sequence (4 bytes)? Does anyone have an example of this, please?
I had the same idea. Please look at my code. The If…Then Condition in DetectEncoding wont work well. Why? At the moment I only added the conditions for the Unix and Windows EOL-Delimiter.
Private Function HasDelimiter(Extends value As Text, delimiter As Text) As Boolean
Return value.IndexOf(delimiter) > -1
End Function
Private Function CreateByteMap(data As MemoryBlock) As Text
Dim result() As Text?
Dim count As Integer = data.Size - 1
For i As Integer = 0 To count
Dim currentByte As UInt8 = data.UInt8Value(i)
result.Append(currentByte.ToHex(2))
Next
Return Text.Join(result, “”)
End Function
Function DetectEncoding(f As FolderItem) As TextEncoding?
’ open file?
Dim b As BinaryStream = BinaryStream.Open(f, BinaryStream.LockModes.Read)
’ read 1024 bytes
Dim data As MemoryBlock = b.Read(If(b.Length < 1024, b.Length, 1024))
’ close file
b.Close
’ create Byte Map
Dim byteMap As Text = CreateByteMap(data)
Dim enc As TextEncoding
’ check for specific EOL-Delimiter?
’ this loop dont work correctly. Why??
If byteMap.HasDelimiter(“000D000A”) Or byteMap.HasDelimiter(“000A”) Then
enc = TextEncoding.UTF16BigEndian
? Elseif byteMap.HasDelimiter(“0D000A00”) Or byteMap.HasDelimiter(“0A00”) Then
enc = TextEncoding.UTF16LittleEndian
Else
enc = TextEncoding.UTF8
End If
[quote=381981:@Christian Schmitz]did you check in debugger what byte map look like?
Also is HasDelimiter working or should you just use Instr maybe?[/quote]
Yes I did, byteMap looks exactly like the content of the HexViewer. I added the HasDelimiter Function in the code above!
[quote=381991:@Christian Schmitz]Because it’s often 000A00 in the code.
So you need to find 0A00 at an even position to make it 16LE and 00A0 at an even position for 16BE[/quote]
Ok, sounds like i have to modify the HasDelimiter function? Is Modulo the right direction? I hate maths or did you mean another part of my code?
That’s a good tip. Looks like it’s working flawlessly. Thank you very much.
In the CreateByteMap function I only had to insert an additional line.
Private Function CreateByteMap(data As MemoryBlock) As Text
Dim result() As Text?
Dim count As Integer = data.Size - 1
For i As Integer = 0 To count
Dim currentByte As UInt8 = data.UInt8Value(i)
result.Append(currentByte.ToHex(2))
' add extra space after each 2 bytes
If i Mod 2 <> 0 Then result.Append(" ")
Next
Return Text.Join(result, "")
End Function