When to define encodings?

Richard_Summers · June 22, 2014, 6:26pm

Hi,
When do I need to define text encodings in my app?

Do I need to define the encoding type when I am saving an entry to a database or a text file, or should I define the encoding type when the data gets read by the database or text file?

In other words - define at the time of WRITING, or define at the time or READING?

Hope that made sense.

Thank you all in advance.

Roger_Clary · June 22, 2014, 6:36pm

If you are working only with text created by and read by your app, you should not need to define the encodings at all. Xojo creates Unicode8 text. You should define encodings if the text comes from an outside source. Define it when you app reads the source so it will (hopefully) be correctly read.

Richard_Summers · June 22, 2014, 6:44pm

Roger,
Thanks. My app writes to a database file, and a text file - and then later reads them both back in.

So if it needs to be defined when / if parsing external text, can I set the encoding type to the actual text area which displays the text, so that any text read / typed into it automatically becomes encoded, OR, do I need to define the encoding to the text, and THEN read it in?

I know what I am saying - just hope you do

Kem_Tekinay · June 22, 2014, 7:09pm

Forgive me if I’m wrong, but your question suggests that still don’t have a firm grasp on the concept of encodings.

A string is just a series of bytes. The encoding tells the system how to interpret those bytes to convert them into characters.

When you create and manipulate strings entirely within Xojo code, each string will be UTF-8, the Xojo default, so each byte or series of bytes will be interpreted accordingly.

When you store a string to a database, you are storing the bytes as they exist, and you read back those bytes as they exist, so it’s up to you to make sure that the encoding of what you’ve read matches what was written. If it doesn’t, the bytes may be misinterpret and may not look right to the end users.

Does that clear things up?

Roger_Clary · June 22, 2014, 7:18pm

To relate what Kem said to your situation, if your app WRITES the data and your app READS the data, you don’t need to worry about encodings. They will all be UTF-8

Eli_Ott · June 22, 2014, 7:39pm

Generally you should use ConvertEncoding for outgoing strings and DefineEncoding for incoming data. This could be from/to databases, tcp sockets, files, etc.

I find the names of the two methods a bit misleading:

Dim aString As String = aRecordSet.IdxField(0).StringValue.DefineEncoding(Encodings.WindowANSI)

I think the term ConvertFromEncoding for DefineEncoding (and ConvertToEncoding for ConvertEncoding) would be more accurate.

Richard_Summers · June 22, 2014, 7:45pm

Sorry - but I sometimes have trouble explaining what I mean.

I understand that text created in Xojo will be UTF-8.
I also understand that if my Xojo app reads that same data back in - it will know it is UTF-8.

What I was trying to say is:
If my Xojo app was to read a text file for example, which was created elsewhere, like on a Windows PC, - is it possible to tell the text area that ANY text displayed by it, should be displayed using UTF-8, OR, do I have to read the external data in, then convert the encoding to UTF-8, THEN display it in the text field.

Hopefully, I don’t sound as dumb this time ??

Richard_Summers · June 22, 2014, 7:46pm

I agree Eli - the names seem to be reversed (to me at least).

Kem_Tekinay · June 22, 2014, 7:51pm

No, you cannot define the TextArea. If you are reading from a text file, you can define the TextInputStream, but you have to know what the encoding is.

Michael_Hußmann · June 22, 2014, 8:17pm

But DefineEncoding doesnt perform any conversion; it only defines what the encoding is. After you have defined the encoding you may use ConvertEncoding to convert it to a different encoding say to UTF8 if that is what you want to standardise on , but you dont have to. Xojo has no difficulty dealing with strings of different encodings and will always do the right thing, provided the encodings are (correctly) defined.

Richard_Summers · June 22, 2014, 8:20pm

Thanks Kem - I just thought maybe you could set the text area to always display as UTF-8.
I now understand that I need to define the TextInputStream and then display it.

Thank you for clearing that up.
I didn’t think it was possible - It was more of a hope

Richard_Summers · June 22, 2014, 8:31pm

So,

If I know the encoding of a text file, but my app doesn’t - I use DefineEncoding.
If I Know the encoding of a text file, and wan’t to convert it to another encoding, I use - ConvertEncoding.
What do I use If I do not know the encoding of an external text file???

Eli_Ott · June 22, 2014, 8:43pm

I didn’t know that a variable, property or parameter declared as String could have any encoding - I was under the impression that all Strings are utf8. Either I had a lot of luck the last six years or all incoming data in my two projects was utf8 anyway. Thank you!

Michael_Hußmann · June 22, 2014, 8:45pm

Yes, but you need to call DefineEncoding first ConvertEncoding wouldnt know how to perform the conversion if it didnt know which encoding the string is in to begin with.

You have to find out what it is by looking for a BOM, for example, or an explicit declaration of the encoding (as in HTML or XML files. And you can use TextEncoding.IsValidData to check whether you assumption of a specific encoding might be correct.

Kem_Tekinay · June 22, 2014, 8:55pm

My M_String package has a method to try to determine the encoding by analyzing the contents of the string. It’s not perfect, but an option if you simply don’t know.

http://www.mactechnologies.com/downloads

Richard_Summers · June 22, 2014, 8:56pm

Thanks.
My only problem now is question 3.

I will look into that and also Kem’s M_String

Tim_Hare · June 22, 2014, 9:07pm

Just to be clear, this is incorrect. You must still define the encoding when you read it back.

Richard_Summers · June 22, 2014, 9:26pm

OK,
final question on this subject:

If I define the encoding, and then convert the encoding, do I then need to define the new encoding? Or does ConvertEncoding convert and define?

Kem_Tekinay · June 22, 2014, 9:29pm

Converts and defines.

Richard_Summers · June 22, 2014, 9:30pm

Phew - glad that’s over