Mac Encodings with Classic MemoryBlock

Eugene_Dakin · August 21, 2017, 12:19am

Hello,

I have had a rough time with encodings, or at least that is the problem I believe I am having. A screen grab of the same program is shown running on Windows and Mac. The Mac version has many question marks which seem-to-be a result of improper encoding when working with binary and text files. This works well on Windows and would appreciate some help from Mac users as to how to understand encoding on Mac. I have tried many different types of ConvertEncoding and DefineEncoding commands over the last few days and have not found a solution.

Here is clean code without encoding information.

[code]Sub Action() Handles Action
//WString
#If TargetWindows //Windows OS
Dim WideString as New MemoryBlock(24)
#Else //Mac and Linux
Dim WideString as New MemoryBlock(48)
#EndIf

WideString.WString(0) = “Hello World”
TextArea1.Text = "Constant WideString size: " + WideString.Size.ToText + ", Content: " + WideString.WString(0) + EndOfLine

//Cstring
#If TargetWindows //Windows OS
Dim CStyleString as New MemoryBlock(12)
#Else //Mac and Linux
Dim CStyleString as New MemoryBlock(24)
#EndIf

CStyleString.CString(0) = “Hello World”
TextArea1.Text = TextArea1.Text + "Constant CStyleString size: " + CStyleString.Size.ToText + ", Content: " + CStyleString.CString(0) + EndOfLine

//String
Dim VarString as MemoryBlock
VarString = “Hello World”
Dim s as String = VarString
TextArea1.Text = TextArea1.Text + "Variable VarString size: " + VarString.Size.ToText + ", Content: " + s

End Sub
[/code]

Here is a download link for a Xojo File with the above code:
Example 1-9

Any help is appreciated

Norman_P · August 21, 2017, 2:09am

TextArea1.Text = "Constant WideString size: " + WideString.Size.ToText + ", Content: " + WideString.WString(0) + EndOfLine

This WideString is UTF32-LE - 4 bytes PER character
But since the memoryblock is 48 bytes long and the “string” only uses 44 you get the last as a run of NULLS

TextArea1.Text = TextArea1.Text + "Constant CStyleString size: " + CStyleString.Size.ToText + ", Content: " + CStyleString.CString(0) + EndOfLine

This CString has a NIL encoding
As does the last

NIL encodings will get you funky ? in the display (see http://blog.xojo.com/2013/08/20/why-are-there-diamonds-in-my-user-interface/)

try just

 //WString
  #If TargetWindows //Windows OS
    Dim WideString as New MemoryBlock(24)
  #Else //Mac and Linux
    Dim WideString as New MemoryBlock(48)
  #EndIf
   
  WideString.WString(0) = "Hello World"
  TextArea1.Text = "Constant WideString size: " + WideString.Size.ToText + ", Content: " + WideString.WString(0) + EndOfLine

and note the absence of ?
why ? because the wstring you appended has a well defined (non-nil) encoding

NIL encodings are “this is a pile of bytes”
And trying to display them in a textarea which expects a known encoding gives you this effect

Eugene_Dakin · August 21, 2017, 2:52am

Hi Norman,

I am trying to wrap my head around this, and please be patient if I am not explaining it correctly

From reading the response, it sounds like the operating system can’t switch between different Encodings for EndOfLine and when using a Unicode wide string, all of the content should be written in text format as Unicode and not blend Unicode with a C-Style String, right?

I am going to rest for the night. Thanks for your help Norman!

Eli_Ott · August 21, 2017, 6:53am

[code]// CString
Dim CStyleString As New MemoryBlock(24)

// Any string created in source code is a string encoded in UTF8
CStyleString.CString(0) = “Hello World ?” // Notice the UTF8 multy-byte character

// You cannot append anything after the CString end, which is at position 11 for “Hello World”:
// 012345678901234567890123
// Hello World0000000000000
// Hello World0 is what CStyleString.CString(0) reads
// Hello World is what is assigned when assigning to a variable (see below)

// Nil encoding, so no appending possible after CString \0 end marker
TextArea1.Text = "Content: " + CStyleString.CString(0) + EndOfLine // EndOfLine is not appended
TextArea1.Text = TextArea1.Text + EndOfLine

// Explicitly set CStyleString.CString(0) to be interpreted as UTF8
TextArea1.Text = TextArea1.Text + "Content: " + CStyleString.CString(0).DefineEncoding(Encodings.UTF8) + EndOfLine

// Automatic interpretation as UTF8 due to assignment to a pointer of type CString
Dim cs As CString = CStyleString.CString(0) // CString \0 end marker is (obviously) automatically stripped
TextArea1.Text = TextArea1.Text + "Content: " + cs + EndOfLine

// Automatic interpretation as UTF8 due to assignment to a variable of type String
Dim s As String = CStyleString.CString(0) // CString \0 end marker is automatically stripped
TextArea1.Text = TextArea1.Text + "Content: " + s.DefineEncoding(Encodings.UTF8) + EndOfLine[/code]

Norman_P · August 21, 2017, 2:54pm

[quote=346688:@Eugene Dakin]Hi Norman,

I am trying to wrap my head around this, and please be patient if I am not explaining it correctly

From reading the response, it sounds like the operating system can’t switch between different Encodings for EndOfLine and when using a Unicode wide string, all of the content should be written in text format as Unicode and not blend Unicode with a C-Style String, right?

I am going to rest for the night. Thanks for your help Norman![/quote]

Well what it has an issue with is the NIL encoding
The link I posted gives you details about how that is handled
NIL encoded strings are the issue here
From that post you’re getting this behaviour
If the string had no encoding or it was otherwise invalid, the framework treats the string as ASCII and replaces anything non-printable with the Unicode replacement character.

Eugene_Dakin · August 22, 2017, 12:55am

Both @Norman Palardy and @Eli Ott have been VERY helpful, and I am almost there.

My first mistake was changing the size of a C-String on Windows and Mac, in which they are both the same length.

Hello.World. //WString Mac (48 bytes) H.e.l.l.oW.o.r.l.d //WString Windows (24 bytes) Hello.World. //CString Windows and Mac (12 bytes) Hello.World //String Windows and Mac (11 bytes)
Eli, thanks for explaining the nil encoding after the \0 end marker on Mac, this makes sense now. The good news is that using the UTF8 encoding allows the output to look good on Mac, and unfortunately, the WString on Windows is now not recognizing the EndOfLine.

Windows Version 1.1 Constant WideString size: 24, Content: HConstant CStyleString size: 12, Content: Hello World Variable VarString size: 11, Content: Hello World

EndofLine after the WString does not seem to be recognized on Windows.

Mac Version 1.1 Constant WideString size: 48, Content: Hello World Constant CStyleString size: 12, Content: Hello World Variable VarString size: 11, Content: Hello World

Here is the code for this example:

[code]Version 1.1
Sub Action() Handles Action
//WString
#If TargetWindows //Windows OS
Dim WideString as New MemoryBlock(24)
#Else //Mac and Linux
Dim WideString as New MemoryBlock(48)
#EndIf

Dim s as String
WideString.WString(0) = “Hello World”
s = WideString.WString(0).DefineEncoding(Encodings.UTF8)
TextArea1.Text = "Constant WideString size: " + WideString.Size.ToText + ", Content: " + s + EndOfLine

//Cstring
//Windows OS and Mac
Dim CStyleString as New MemoryBlock(12)
CStyleString.CString(0) = “Hello World”
s = CStyleString.CString(0).DefineEncoding(Encodings.UTF8)
TextArea1.Text = TextArea1.Text + "Constant CStyleString size: " + CStyleString.Size.ToText + ", Content: " + s + EndOfLine

//String
Dim VarString as MemoryBlock
VarString = “Hello World”
s = s.DefineEncoding(Encodings.UTF8)
TextArea1.Text = TextArea1.Text + "Variable VarString size: " + VarString.Size.ToText + ", Content: " + s
End Sub
[/code]

It looks like the DefineEncoding UTF8 command on Windows is causing the EndofLine to not be recognized. When encoding for Windows is removed, then the EndOfLine formatting seems to be correct and the following code runs as expected on both OS’s:

[code]Version 1.2
Sub Action() Handles Action
//WString
#If TargetWindows //Windows OS
Dim WideString as New MemoryBlock(24)
#Else //Mac and Linux
Dim WideString as New MemoryBlock(48)
#EndIf

Dim s as String
WideString.WString(0) = “Hello World”
s = WideString.WString(0) //.DefineEncoding(Encodings.UTF8) <-commented out
TextArea1.Text = "Constant WideString size: " + WideString.Size.ToText + ", Content: " + s + EndOfLine

//Cstring
//Windows OS and Mac
Dim CStyleString as New MemoryBlock(12)
CStyleString.CString(0) = “Hello World”
s = CStyleString.CString(0).DefineEncoding(Encodings.UTF8)
TextArea1.Text = TextArea1.Text + "Constant CStyleString size: " + CStyleString.Size.ToText + ", Content: " + s + EndOfLine

//String
Dim VarString as MemoryBlock
VarString = “Hello World”
s = s.DefineEncoding(Encodings.UTF8)
TextArea1.Text = TextArea1.Text + "Variable VarString size: " + VarString.Size.ToText + ", Content: " + s
End Sub
[/code]

Mac and Windows Version 1.2 Constant WideString size: 48/24, Content: Hello World //(48 Mac/24 Win) Constant CStyleString size: 12, Content: Hello World Variable VarString size: 11, Content: Hello World

EndofLine encoding is an issue with Mac as the two of you mentioned.

Just curious as to why encoding to UTF8 on WString with Windows causes the EndOfLine command to be ignored? Is this a similar situation with Nil and Null differences?

Edit: Added bytes for both Mac and Windows in Version 1.2 screen output

Norman_P · August 22, 2017, 1:29am

Probably because you’re fibbing to the Xojo runtime by doing

  Dim s as String
  WideString.WString(0) = "Hello World"
  s = WideString.WString(0) //.DefineEncoding(Encodings.UTF8) <-commented out

WString is more likely UTF16 on Windows and may be UTF32 on macOS
DEFINE ENCODING doesnt VALIDATE what you tell it
It just sticks that one a “string” and says “interpret the contents as if it IS this encoding”

When you get UTF16 data and say “interpret it as UTF-8” I’m more surprised you dont get a pile of ? in the first one since there’s an extraneous NUL byte between every character
And one at the end so that could easily affect how the concat works
I suspect this is the case since splitting

  TextArea1.Text = "Constant WideString size: " + WideString.Size.ToText + ", Content: " + s + EndOfLine

into

TextArea1.Text = "Constant WideString size: " + WideString.Size.ToText + ", Content: " + s 
TextArea1.Text = TextArea1.Text + EndOfLine

gives the right result. This makes sense because the data has been put into the text area and pulled out getting translated into UTF-8 in the process and so the concat works as expected

In fact I’m fairly sure this is what causes it because if you do

  s = s + endOfLine
  TextArea1.Text = "Constant WideString size: " + WideString.Size.ToText + ", Content: " + s

the result is really busted because of the extraneous nil characters in the data - because the encoding is set wrong

Guess I’m still not clear on what it is your trying to achieve ?

Norman_P · August 22, 2017, 1:37am

WString on Windows should be UTF-16 - and SHOULD be preset if you pull a Wstring out of a memoryblock like your sample code.
On macOS I believe Wstring comes out as UTF-32LE (dont know if this varies by OS version but I’d be surprised)

CString has a nil encoding because C-strings can be “bags of bytes” but IF you want to use it as textual data you should maybe set it as ASCII encoding and beware of any byte with code points 128 or higher as then you kind of HAVE to know what single bye encoding it really is

Mot single byte encodings would “work” for “DefineEncoding” but they may not be correct according to what the original data & intent were but there’s no way to know that just by looking at the bytes.

This is why so many internet protocols and even file formats have markers, BOM’s and such, that say “this data is this encoding”
So the receiver can know since its really tough to guess the right one from just a pile of bytes.

Eugene_Dakin · August 22, 2017, 1:53am

This is exactly what I was looking for - sorry for my ambiguity. I was confused when attempting to match the current coding (eg UTF32LE) with the conversion of the encoding (to a UTF8) by passing the encoded text to a string. This was even more confusing when certain encodings worked on one OS and not on the other - it was easy to be confused. I have run into many issues when there are mixed encodings and kept having formatting problems with EndofLines not working and question marks (?) appearing. This Hello World snippet of code has helped me understand the encoding and decoding of text, which includes the endofline suffix (that was a tough one).

Thank you @Norman Palardy and @Eli Ott , as this was very helpful.

Here is the final code for those who may find this useful in the future:

[code]Sub Action() Handles Action
//WString
#If TargetWindows //Windows OS
Dim WideString as New MemoryBlock(24)
#Else //Mac and Linux
Dim WideString as New MemoryBlock(48)
#EndIf

Dim s as String
WideString.WString(0) = “Hello World”
#If TargetWindows //Windows
s = WideString.WString(0).DefineEncoding(Encodings.UTF16)
#Else //Mac and Linux
s = WideString.WString(0).DefineEncoding(Encodings.UTF32LE)
#Endif
TextArea1.Text = "Constant WideString size: " + WideString.Size.ToText + ", Content: " + s + EndOfLine

//Cstring
//Windows OS and Mac
Dim CStyleString as New MemoryBlock(12)
CStyleString.CString(0) = “Hello World”
s = CStyleString.CString(0).DefineEncoding(Encodings.UTF8)
TextArea1.Text = TextArea1.Text + "Constant CStyleString size: " + CStyleString.Size.ToText + ", Content: " + s + EndOfLine

//String
Dim VarString as MemoryBlock
VarString = “Hello World”
s = s.DefineEncoding(Encodings.UTF8)
TextArea1.Text = TextArea1.Text + "Variable VarString size: " + VarString.Size.ToText + ", Content: " + s
End Sub
[/code]

Norman_P · August 22, 2017, 2:03am

NIL encoding is really the bane of most peoples code since you can get it by

reading from a file (binary or text stream if the text stream hasn’t had an encoding set for it)
reading from a database
reading from a serial port
reading from a socket
And its the one that almost always gets you the ? in the output (that link I posted)

Basically ANYTHING NOT generated wholly in Xojo and coming from an external source needs a DefineEncoding applied to it.

Controls like TextAreas & TextFields are ok because they are part of the framework and so prefer, like most other things in Xojo, to bet set & provide UTF-8.