Create a UTF-8 file with BOM

How can I create a textfile using TextOutputStream() to create a UTF-8 file with a BOM?

In Delphi, I would do:

st:=TStreamWriter.Create(filepath,false,TEncoding.UTF8); st.writeline(...); st.close;

How can I do the same in Xojo? The TextOutputStream does not have an encodings parameter.

By default, text in a TextField / TextArea are UTF!.

Now, you can enforce your text to UTF8 Encoding and save that.

Seach in the LR about Encoding/Encodings for the syntax.

Well you can convert any string to UTF-8 before writing it to the file (UTF8 is the default encoding anyway!)

dim utf8string as String = ConvertEncoding( myString, Encodings.UTF8String )

Prefix the BOM manually to the file before writing your UTF-8 string. Xojo won’t do that for you.

I have BOM tools in M_String, btw.

I think you can just do a write as the first thing after creating file:

t.Write encodings.UTF8.Chr(&hFEFF)

[quote=441856:@Christian Schmitz]I think you can just do a write as the first thing after creating file:

t.Write encodings.UTF8.Chr(&hFEFF)

I already tried that, and the first line is then not UTF-8, but Unicode, lots of null character. Starting with line #2, the text is ok again.

You need to write the BOM if you want it.
Then for all other write calls, always use ConvertEncoding to make sure it’s an UTF8 string.

Just writing some string may not give right encoding.

[quote=441862:@Christian Schmitz]You need to write the BOM if you want it.
Then for all other write calls, always use ConvertEncoding to make sure it’s an UTF8 string.

Just writing some string may not give right encoding.[/quote]

Isn’t the UTF-8 BOM 0xEF 0xBB 0xBF and not 0xFe 0xFF?

IsnÂ’t UTF-8 part of Unicode ?

Yes

FEFF is one of the UTF-16 BOM ( UTF-16 bg endian)

FEFF is the BOM for UTF-16, the BOM for UTF-8 is EFBBBF, how to I write that 3 bytes to the textoutputstream?

Well, it’s &hFEFF for the magic character. Depending of the encoding, it’ll be FE FF for UTF-16 and EF BB BF for UTF-8.
But you don’t need to know those details of the byte representations.

Try it:

Dim s As String = encodings.UTF8.Chr(&hFEFF) MsgBox EncodeHex(s)

Protected Function BOMUTF8() as String
  static r as string = DefineEncoding( ChrB( &hEF ) + ChrB( &hBB ) + ChrB( &hBF ), nil ) // If you define it as the encoding, you can't properly add it to a string
  return r
End Function
1 Like

[quote=441870:@Christian Schmitz]Well, it’s &hFEFF for the magic character. Depending of the encoding, it’ll be FE FF for UTF-16 and EF BB BF for UTF-8.
But you don’t need to know those details of the byte representations.

Try it:

Dim s As String = encodings.UTF8.Chr(&hFEFF) MsgBox EncodeHex(s)[/quote]

except that for utf16 you get the same bom for UTF16, UTF16LE and BE which is wrong

Dim utfbom As String 

utfbom = encodings.UTF8.Chr(&hFEFF)
textarea1.appendtext "UTF8 - " + EncodeHex(utfbom) + EndOfLine

utfbom = encodings.UTF16.Chr(&hFEFF)
textarea1.appendtext "UTF16 - " + EncodeHex(utfbom) + EndOfLine

utfbom = encodings.UTF16BE.Chr(&hFEFF)
textarea1.appendtext "UTF16BE - " + EncodeHex(utfbom) + EndOfLine

utfbom = encodings.UTF16LE.Chr(&hFEFF)
textarea1.appendtext "UTF16LE - " + EncodeHex(utfbom) + EndOfLine

Norman, that’s not a bug as BE/LE is handled later when writing data.

Xojo even gives 4100 back for encodings.UTF16BE.Chr(65)

Seems like the Xojo string is always LE.

So it seems not be easily possible to just write ordinary UTF-8 files with BOM EF BB BF with Xojo?

What happened when you saved a text from a TextArea (with a UTF-8 encoding) ?

Did you try that ?

No, as you can see here:

https://www.utf8-chartable.de/unicode-utf8-table.pl

On the contrary, it’s quite easy, you just have to do it yourself. Write the BOM (I supplied the code above), then write your text. When reading back, strip the BOM, then define the encoding of the rest as UTF-8.

Usually, I read from the official owner, never elsewhere (elsewhere can be wrong: read what some say about tha ASCII table, about the characters values from 129 thru 256)Â…

UTF-8 - Wikipedia (not the owner, and can be wrong too):

UTF-8 is a variable width character encoding capable of encoding all 1,112,064[1] valid code points in Unicode using one to four 8-bit bytes.[2] The encoding is defined by the Unicode Standard, and was originally designed by Ken Thompson and Rob Pike.[3][4] The name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit.[5].