Databases and TextEncoding

I have only worked with Sqlite and know it uses UTF-8 for text encoding so that make strings straight forward for use in Xojo, but i want to write code that generalizes to all DBs that Xojo can use…

Googling it looks like not all DBs use UTF8 and some even use different encodings for different Column types that hold textual information or they can be defined in the the DB by column …

My question is can I depend ob the Xojo API to deal with all of that…

If so. for a DB (or column of a DB) that uses a different encoding than UTF-8, will Xojo auto convert it to UTF-8 or Define the coding to the type that the DB uses?

If neither, is there a way for me to find out what the encoding is through the Xojo API , preferably without knowing which DB the recordset came from?

Thanks,

  • karen

with some sql commands, you can get in what encoding is your database coded.
then manage some convertencodings to get the utf8 you need
these commands are all different for each db engine.
stackoverflow is a good friend to find them.

What database?
for MySQL, Xojo passes just bytes, so encoding is up to you.

With MBS Xojo SQL Plugin, we use unicode APIs for databases and handle text encoding for you.

SQL Server has two kinds of text storage types: ASCII / ANSI (varchar, char & text) and Unicode supported (nvarchar, nchar & ntext).

Reading (SELECT) either into a UTF-8 supported Xojo String value is not an issue, but if going in the opposite direction to save (INSERT / UPDATE) to SQL Server - if your UTF-8 String contains any high-unicode characters when saving to varchar, char or text datatype columns in SQL Server, this will result in corrupt characters in those columns (and SQL Server won’t throw an error or complain about it).

I haven’t played at this end of Xojo much yet, but I’m pretty there is no mechanism that will alert you to this kind of saving issue, because a SQLite text is not the same thing as a SQL Server text.

Note: If saving a UTF-8 String that only contains characters in the ASCII / ANSI range, then varchar, char & text columns will store the values properly.

To get around the potential saving issue, you’ll need to need to know in advance if the text type of a column is Unicode supported. Either by being familiar with the db design already, or querying the database, column by column as @Jean-Yves Pochez suggested.

In SQL server, one example of a query to find the datatype of a known column is the following. I hope this helps.

SELECT TYPE_NAME(c.user_type_id) -- returns datatype name, e.g., varchar or nvarchar FROM sys.columns c JOIN sys.tables t ON c.object_id = t.object_id WHERE SCHEMA_NAME(t.schema_id) = 'dbo' AND OBJECT_NAME(t.object_id) = 'TABLE_NAME' AND c.name = 'COLUMN_NAME'

Edited: fixed SQLite datatype reference name

sqlite :

ch = "PRAGMA encoding;" rs = mDatabase.SQLSelect( ch)

postgres:

Select pg_encoding_to_char(Encoding) FROM pg_database WHERE datname = 'myschema'

mysql:

ch = "show variables like ""character_set_database"";" rs = mDatabase.SQLSelect( ch)

Thanks for all the replies… that was what I was afraid of…

Writing general, truly DB (and schema) agnostic code based on Xojo’s DB API really is not possible… I was already running into not getting enough info from RecordSet.ColumnType for my needs… I think I can work around (though sub optimally), but text encoding varying by DB and even by Column datatypes with some DB’s, and different ways of querying that is an additional level of complication… Even knowing teh original column maybe be straight forward fro recordset based on views… I was hoping to be bale to get by wiyjh only teh info that is provided in a recordSet.

I will need to rethink my approach!

Thanks again,

  • karen

If it’s any consolation, I think the sort of solution you’re looking to build isn’t simple on any development platform.

On other development platforms, many developers will take a “Data Access Layer” (DAL) approach to their data interface design, so that in theory they can swap out the type, brand or version of their database, without having to modify their application code (to create a database-agnostic application).

In my experience there’s almost always still some database specific coding required at some level with this kind of switch, whether in the app code, middle-ware or db.

But a DAL could help minimize the pain. I hope that helps.

Edited: to emphasize “in theory”

I builded a “myDatabase” class exactly for this purpose. it’s not a Database subclass.
inside, I managed to make all the call coherent with same returned results, but for all different databases I can use.
this is kind a lot of work, but really powerful after .

This shows the default encoding of the database (as ist is stored) but this is not necessarily the encoding to use when interacting with the database (reading and writing). That is controlled by the GUC client_encoding and can be determined by SHOW client_encoding;. You can even set it per session by SET client_encoding TO latin9; if you like. But be aware not all encodings support the same characters. For example ‘Œ’ exisits in latin9 (ISO 8859-15) but not latin1 (ISO 8859-1) so it would throw errors like the following if accessed through an incompatible encoding: ERROR: character with byte sequence 0xc5 0x92 in encoding "UTF8" has no equivalent in encoding "LATIN1"