Strange UTF16BE/UTF16LE string umlaut encoding behaviour under OSX

Hi Guys,

I’m pretty new to Xojo and thus for tryouts converted some code of one of my Java progs over to it, in order to get familiar with the used RealBasic language, the IDE and so on. The program I wrote for tryouts so far is a OSX based console application, which when run from inside the IDE is given some command line arguments (file paths and search/replace strings) as options. Since the program alters specific binary format files it works internally with binary buffers, for Xojo I used MemoryBlocks here instead. - I try to describe now the Xojo r1.1 problem I encountered under OSX, interestingly here is, that the same project code runs/works just fine instead under Xojo r1.1 for Windows.

I use some methods like the following here …

[code]Public Function GetBytesUTF16BE(str As String) as memoryblock
dim s as string = str
s = ConvertEncoding(s, encodings.UTF16BE)
dim m as memoryblock = s
return m

End Function

… for converting from the OSX console passed over command line string args into UTF16BE/LE memoryblocks for the app. Here for OSX the converted strings via ConvertEncoding(s, encodings.UTF16BE) do always have wrong/false Umlaut byte presentations inside the binary blocks. So under OSX it seems that Umlauts () in UTF8 Strings are only converted to ??? question mark bytes instead to their right UTF16 byte representation. - I partly saw this too in the Xojo r1.1 OSX debugger, which BTW seems also to have problems for showing up UTF16 bytes with Umlauts here.

Strange is, that under Windows in Xojo all that works fine with the conversion and also the Umlauts are handled correctly.

Does anybody have an idea what might gets wrong here for OSX, or how to overcome with this?

Why use UTF16 for a European language ? UTF-16 is only necessary for Asian languages. UTF-8 supports perfectly Umlauts.

UTF-8 is perfect for all languages.
UTF-16 is needed if the other app reading/writing the file only does UTF-16.
e.g. FileMaker exports UTF-16 only till version 15.

The Adobe binary file format I do alter with that program do contain specific UTF16 big endian and little endian string sections. And since I parse and change predefined binary format sections here, there is no free choice of format usage!

The question for me is actually more why this here …

dim s as string = ""
 s = ConvertEncoding(s, encodings.UTF16BE)

… does produces ??? under OSX instead of the correct byte representations for the umlaut characters? - Since under Windows the same code works fine!

Looks then to me like there might be then probably a bug inside the Xojo OSX encoding conversion routines for r1.1 ?

If you’re reading this data in from a binary file do you ever DEFINE the encoding ?
If not then converting a NIL encoding wont have the effect you think it should

Your sample code works because a string literal is, by definition, in Xojo in UTF-8

beside of having encoding defined as Norman stated, you could try text to memory block conversion with new framework.
Or convert to just UTF-16 and do byte swap yourself.

No, I thought Xojo as default uses/assumes UTF-8 here. Actually I do just read those strings of interest (which have to be converted into corresponding UTF16 formats) in from the Xojo IDE’s command line arguments panel (passed over under the IDE’s shared debug settings). AFAIK those are just via OSX Terminal passed over UTF-8 search and a replace strings as command line args. I then do convert those strings into the needed UTF-16 encoding in order to place them in memoryblocks (as shown above in my first post).

The main Adobe binary file I do then read in as a binary stream …

Dim f As FolderItem
Dim readstream As BinaryStream
f = New FolderItem(inputFilename,FolderItem.PathTypeShell)
If f <> Nil Then
if f.Exists Then
readstream = BinaryStream.Open(f, False)
end if
End If

Dim mbuffer As MemoryBlock = readstream.Read(readstream.Length)
Print “Read input file (” + Str(mbuffer.Size) + " bytes)"

… which I placed into a memoryblock.

Then I search in that file memoryblock for certain big/little endian byte string occurences, where they start, indexes, offsets, lengths and end-indexes etc. which I preserve. Then I do a clone of that memoryblock which I use to set together a new binary file with all changed (replaced string) parts and new length informations.

Ok in the meantime I tried out what Norman stated …

[code]Public Function GetBytesUTF16BE(str As String) as memoryblock
dim s as string = str
s = DefineEncoding(s, encodings.UTF8)
s = ConvertEncoding(s, encodings.UTF16BE)
dim m as memoryblock = s
return m

End Function

… and it works now correctly also under OSX!

Thanks for the help Norman and Christian!

Now where the console stuff works, it’s time to get my feet wet with the next steps, aka making now a GUI (desktop) app out of it. :wink:

[quote=336410:@Valentino Kyriakides]@Norman
No, I thought Xojo as default uses/assumes UTF-8 here.
Not when you read it from a binary file (or other external source like a database, network connection, etc)

Well on Windows it worked for Umlauts in my tryouts even without defining the encoding first, on OSX it did not. - However, good to know for the future!