Create a UTF-8 file with BOM

Guenter_Kraemer · June 18, 2019, 1:00pm

How can I create a textfile using TextOutputStream() to create a UTF-8 file with a BOM?

In Delphi, I would do:

st:=TStreamWriter.Create(filepath,false,TEncoding.UTF8); st.writeline(...); st.close;

How can I do the same in Xojo? The TextOutputStream does not have an encodings parameter.

Emile_Schwarz · June 18, 2019, 1:12pm

By default, text in a TextField / TextArea are UTF!.

Now, you can enforce your text to UTF8 Encoding and save that.

Seach in the LR about Encoding/Encodings for the syntax.

Stéphane_Mons · June 18, 2019, 1:13pm

Well you can convert any string to UTF-8 before writing it to the file (UTF8 is the default encoding anyway!)

dim utf8string as String = ConvertEncoding( myString, Encodings.UTF8String )

Kem_Tekinay · June 18, 2019, 1:19pm

Prefix the BOM manually to the file before writing your UTF-8 string. Xojo won’t do that for you.

I have BOM tools in M_String, btw.

Christian_Schmitz · June 18, 2019, 1:53pm

I think you can just do a write as the first thing after creating file:

t.Write encodings.UTF8.Chr(&hFEFF)

Guenter_Kraemer · June 18, 2019, 2:25pm

[quote=441856:@Christian Schmitz]I think you can just do a write as the first thing after creating file:

t.Write encodings.UTF8.Chr(&hFEFF)

I already tried that, and the first line is then not UTF-8, but Unicode, lots of null character. Starting with line #2, the text is ok again.

Christian_Schmitz · June 18, 2019, 2:26pm

You need to write the BOM if you want it.
Then for all other write calls, always use ConvertEncoding to make sure it’s an UTF8 string.

Just writing some string may not give right encoding.

Guenter_Kraemer · June 18, 2019, 2:28pm

[quote=441862:@Christian Schmitz]You need to write the BOM if you want it.
Then for all other write calls, always use ConvertEncoding to make sure it’s an UTF8 string.

Just writing some string may not give right encoding.[/quote]

Isn’t the UTF-8 BOM 0xEF 0xBB 0xBF and not 0xFe 0xFF?

Emile_Schwarz · June 18, 2019, 2:31pm

Isnt UTF-8 part of Unicode ?

Norman_Palardy · June 18, 2019, 2:35pm

Yes

FEFF is one of the UTF-16 BOM ( UTF-16 bg endian)

Guenter_Kraemer · June 18, 2019, 2:40pm

FEFF is the BOM for UTF-16, the BOM for UTF-8 is EFBBBF, how to I write that 3 bytes to the textoutputstream?

Christian_Schmitz · June 18, 2019, 2:41pm

Well, it’s &hFEFF for the magic character. Depending of the encoding, it’ll be FE FF for UTF-16 and EF BB BF for UTF-8.
But you don’t need to know those details of the byte representations.

Try it:

Dim s As String = encodings.UTF8.Chr(&hFEFF) MsgBox EncodeHex(s)

Kem_Tekinay · June 18, 2019, 2:45pm

Protected Function BOMUTF8() as String
  static r as string = DefineEncoding( ChrB( &hEF ) + ChrB( &hBB ) + ChrB( &hBF ), nil ) // If you define it as the encoding, you can't properly add it to a string
  return r
End Function

Norman_Palardy · June 18, 2019, 2:48pm

[quote=441870:@Christian Schmitz]Well, it’s &hFEFF for the magic character. Depending of the encoding, it’ll be FE FF for UTF-16 and EF BB BF for UTF-8.
But you don’t need to know those details of the byte representations.

Try it:

Dim s As String = encodings.UTF8.Chr(&hFEFF) MsgBox EncodeHex(s)[/quote]

except that for utf16 you get the same bom for UTF16, UTF16LE and BE which is wrong

Dim utfbom As String 

utfbom = encodings.UTF8.Chr(&hFEFF)
textarea1.appendtext "UTF8 - " + EncodeHex(utfbom) + EndOfLine

utfbom = encodings.UTF16.Chr(&hFEFF)
textarea1.appendtext "UTF16 - " + EncodeHex(utfbom) + EndOfLine

utfbom = encodings.UTF16BE.Chr(&hFEFF)
textarea1.appendtext "UTF16BE - " + EncodeHex(utfbom) + EndOfLine

utfbom = encodings.UTF16LE.Chr(&hFEFF)
textarea1.appendtext "UTF16LE - " + EncodeHex(utfbom) + EndOfLine

Christian_Schmitz · June 18, 2019, 3:06pm

Norman, that’s not a bug as BE/LE is handled later when writing data.

Xojo even gives 4100 back for encodings.UTF16BE.Chr(65)

Seems like the Xojo string is always LE.

Guenter_Kraemer · June 18, 2019, 4:09pm

So it seems not be easily possible to just write ordinary UTF-8 files with BOM EF BB BF with Xojo?

Emile_Schwarz · June 18, 2019, 4:12pm

What happened when you saved a text from a TextArea (with a UTF-8 encoding) ?

Did you try that ?

TimStreater · June 18, 2019, 4:30pm

No, as you can see here:

https://www.utf8-chartable.de/unicode-utf8-table.pl

Kem_Tekinay · June 18, 2019, 5:01pm

On the contrary, it’s quite easy, you just have to do it yourself. Write the BOM (I supplied the code above), then write your text. When reading back, strip the BOM, then define the encoding of the rest as UTF-8.

Emile_Schwarz · June 18, 2019, 5:37pm

Usually, I read from the official owner, never elsewhere (elsewhere can be wrong: read what some say about tha ASCII table, about the characters values from 129 thru 256)

UTF-8 - Wikipedia (not the owner, and can be wrong too):

UTF-8 is a variable width character encoding capable of encoding all 1,112,064[1] valid code points in Unicode using one to four 8-bit bytes.[2] The encoding is defined by the Unicode Standard, and was originally designed by Ken Thompson and Rob Pike.[3][4] The name is derived from Unicode (or Universal Coded Character Set) Transformation Format 8-bit.[5].