Xojo Databases, Multilingual Writing Systems, and Data Storage and Retrieval

Hi, all!

I'm developing library online catalog. The catalog will record bibliographic information, subject headings, links (for digital material) of books, websites, journals, DVDs, etc.
    
The catalog my library is currently using stores records in UTF-8. The cataloging standards require storage of a title in both its original script and its transliterated script; and a library may translate the title into the language of the native speakers for its patron's needs. The storage is fine; it's the retrieval that is flummoxing us.

If we see a title like the following, I make sure I record all three versions of the title, original, translated (into English), and transliterated.

	242  12 ‡a A Japanese-English-Chinese dictionary of computer terms / ‡c Hitachi. Omika Factory. ‡y eng
	245  10 ‡6 880-01/$1 ‡a Wa-Ei-Chu taiyaku denshi keisanki yogoshu / ‡c Hitachi. Omika Factory.
	880  10 ‡6 245-01 ‡a ????????????? / ‡c Hitachi. Omika Factory.
	
	(Ignore the codes. They are library codes analogous after a fashion to XML tags. See MARC cataloging if you're interested or ask me.)
	
The problem is that we can look up the title in the Romanized forms but not the kanji.

The library catalog is running on a Linux server; and the catalog itself stores the records in a MySQL database. The first filing indicator of the 242, 245, and 880 fields tells the catalog to index that version of the title and link it to this record; that's the "1" after the 242, 245, and 880. In order to get the catalog to retrieve the kanji, we have to add to the Linux box an extension called Elasticsearch; and then grapple hooks between Elasticsearch and the catalog. This problem occurs in other writing systems as well, e.g., Cyrillic. If I catalog "????? ? ???", I'll include the translation "War and Peace", and the transliteration "Voina i mir" just so that we and our patrons can retrieve the work.

But I began thinking about Linux in my own little project. If I develop a Xojo application and compile a Linux version, will I have this type of retrieval problem regarding multilingual writing systems? Does Xojo's database engines automatically handle storage and retrieval of multiple writing systems or will I have to build in reliance on something like Elasticsearch? I know I'll have to make sure the interface recognizes and processes multilingual writing systems, but will the database be able to retrieve the non-Roman material as long as I store the data in UTF-8?

Thanks for your help, all.

Scratching head… I read your problem description 2 times but I still don’t understand what your problem is. Do you have encoding problems? What are you using Elasticsearch for??? Most databases have full support for Unicode. MySQL surely does have that, too.

How does your code for retrieval look like and what is the result of the code?

First configure MySQL table to use UTF8mb4 as encoding and the right collate to allow search.

see
https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8mb4.html

Second if you use MBS Xojo SQL Plugin, we handle text encodings for you.

If you use Xojo’s built-in mysql, you need to do much more:

  • run command SET NAMES ‘utf8’; ones to set UTF-8 for the connection.
  • Apply ConvertEncoding to UTF-8 for every text you put in to the database for insert or select.
  • Apply DefineEncoding on each time you read text.

And then you may check if you can search for titles.

Not to forget that checking error codes or exceptions and using prepared statements is a requirement to avoid trouble.

Thank you both.

Beatrix Willius, in answer to your question, let me answer it thus: the library catalog I’m building stores all text in UTF-8, but it does not retrieve non-Roman text properly without a Linux extension called Elasticsearch. The catalog is open source written in PERL for a Linux platform. I’m trying to make sure that if I were to write a library catalog of my own in XOJO, I would not encounter the retrieval problem.

In summary the question is how XOJO handles text on the Linux platform.

According to Christian Schmitz, the answer appears to be that as long as I wrap ConvertEncoding and DefineEncoding statements in the storage and retrieval instructions, and robustly do error checking, I ought not have the problem. (I have been doing this anyhow since the days of REALbasic 5.5.)

Thanks to you both.