Add a new Ansel Encoding - no plugins

Hello Everyone,

I have been asked about creating a solution for reading a file, and the file uses special characters which are ANSEL (American National Standard for Extended Latin), and here is the code page layout on ANSEL Wikipedia.

Is there a way to create a new encoding in native Xojo that works in both Windows and OS X?

Example pseudocode that this may look like in its final form would be:

[code]Dim AllFiles as New FileType
AllFiles.Name = “GEDCOM”
AllFiles.Extensions = “GED”
AllFiles.MacType = “GED”
AllFiles.MacCreator = “*”

Dim doc As FolderItem = GetOpenFolderitem(AllFiles)
Dim bs As BinaryStream = BinaryStream.Open(doc, False) // Open as read-only

//Convert to correct encoding (ANSEL)
Dim s as String
s = bs.Read(bs.Length)
s = s.DefineEncoding(Encodings.ANSEL) //Ansel encoding here…

// Read the whole BinaryStream
TextArea1.Text = s [/code]

If making this kind of encoding is not possible, then thats ok too.

As far as I can see it’s not so easy. This page describes the structure and translation. Hard to follow, because some characters have only one byte, some two (e.g. E0+41 = &u1EA2 > Latin Capital Letter A + Combining Hook above ? ) and some three (e.g. E0+E3+6F = &u1ED5 > Latin small Letter O + Combining Circumflex Accent + Combining Hook above ? ).

I think it needs a conversion table and a way, i don’t know, to search within the binary Data for those bytes and replace them. Maybe not in the source file, only within a variable which holds the source file.

Howevery i would prefer the new Xojo Framework if there is a way to do it. Can you work with this Eugene? Or any other Xojo developer?

I think ANSEL is obsolete except for its use in GEDCOM files. I had a look at it when it was discussed in this topic:
Though the topic has more to do with GEDCOM than ANSEL. There’s a link near the bottom to a GEDCOM to XML parser project that I wrote. But I avoided any attempt to read/convert anything except ASCII or Unicode. However it does read ANSEL treating it as single byte characters.

It would be interesting to see if it’s possible to add a new encoding to Xojo. But, I suspect the most practical solution is to read the text as a binary stream and process it byte by byte.

When reading this file (ANSEL.ged) with Hex Fiend we can see it’s an UTF-8 Textfile. Now reading it into Xojo and save a copy:

[code]Using Xojo.Core
Using Xojo.IO

Dim inputFile As FolderItem = SpecialFolder.Documents.Child(“ANSEL.ged”)
Dim outputFile As FolderItem = SpecialFolder.Documents.Child(“ANSEL out.ged”)

Dim input As TextInputStream
input = TextInputStream.Open(inputFile, TextEncoding.UTF8)

Dim output As TextOutputStream
output = TextOutputStream.Create(outputFile, TextEncoding.UTF8)

Dim conv As Text = input.ReadAll
// wont find this combination, because after ReadAll, Xojo converted F5 41 into EF BF BD
conv = conv.ReplaceAll(&uF5 + &u41, “A?”) // &uF = double low line, &u41 = Latin Capital Letter A

Catch e As IOException
MsgBox "File IO Error: " + e.Reason
End Try [/code]

Now, if you compare both files in Hex Fiend, watch e.g. Line 227, then you can see in “ANSEL out.ged” that the Hex-Value after 2 PLAC is changed from 2 Bytes (Double Byte Character)

F5 41

into 4 Bytes


That’s wrong. By the way, Xojo converts every invisible Byte before the Letter Hex-Value into EF BF BD. Thats wrong! Why does Xojo not convert into the same UTF-8 Hex-Values like the Source-File? I need the right value to decode the letters behind PLAC into the right UTF-8 code. After conversion with ReplaceAll it should look like this (in a normal Text-Editor):


As you can see, each letter is built on two letters (one invisible, diacritic and the normal one).

Why do say that it’s a UTF-8 file? I see nothing to indicate that. In fact the second line in the file is the encoding specification stating that it’s ANSEL. Xojo will assume an input file to be UTF-8 by default, if it is not instructed otherwise. If it attempts to interpret the ANSEL file as UTF-8, then you can’t really blame it for making wrong substitutions.

Edited to add:

In UTF-8 ‘F5’ is an illegal byte. It should never appear. So it is replaced with EFBFBD which is the unicode replacement character (diamond with question mark inside).

Hi Robert. Yes, the file does appear to be a UTF-8 file since there appears to be 256 characters and when using a Byte-Order-Mark program, the binary data remains UTF-8. The issue seems to be that when the incorrect encoding is used, then the unknown character numbers are automatically converted to the question mark in Xojo, which is the encoding.

Just for fun when I use a different encoding, characters appear and there are no errors. The problem is that the encoded characters assigned to the 8 bits are incorrect. This is the reason for my request to create a new encoding - if creation of a new encoding type is possible in Xojo. :slight_smile:

ANSEL is a 256 byte character encoding, and when I looked at the file with a hex editor, all of the bytes >127 were valid ANSEL extended characters or diacriticals. At the same time there were some bytes such as F5 that are never valid anywhere in a UTF-8 file. So, I don’t understand how it can be concluded that it’s a UTF-8 file rather than an ANSEL file. What am I missing?

It would be interesting to see if a user can add their own encoding to Xojo, but since there is a steady move away from special purpose encodings towards Unicode, I question whether it’s worthwhile to expend the effort. For a one off project, I would be inclined just to read the file as a binary stream and do a direct translation to Unicode. The translation is not difficult from a programming point of view. Going byte by byte, there are only three cases:

  1. Byte value is <128 – no translation required;
  2. Byte value is in range 161…198 – simple direct translation from input byte to corresponding Unicode character based on a small lookup table;
  3. Byte value is >223 – These are non spacing diacriticals that prefix an upcoming character. So, save the byte, and continue reading and saving diacritical bytes until encountering a byte <128. This combination of character byte plus one or more preceding diacritical bytes will uniquely correspond to one Unicode character, so look it up and replace the byte combination with that character.

The third case is the most work, but only because it requires a large lookup table. However, the work to create the lookups has already been done (and publicly posted) by at least two different developers, and those lookup tables could be machine converted into Xojo code without much effort.

I tried it this way. Read via TextInputStream and do a replace for all Character (Combinations). This works only for the replacements within the loop. The commented replacements after the loop would produce some error replacements. The result would be the same, if you do those replacements before the loop. This work ok for small file. But for large files, it’s really slow and as I said, it does not replace all letters well.

[code]Using Xojo.Core
Using Xojo.IO

Dim inputFile As FolderItem = SpecialFolder.Documents.Child(“ANSEL.ged”)

Dim input As TextInputStream
input = TextInputStream.Open(inputFile, TextEncoding.ASCII)

Dim conv As Text = input.ReadAll
Dim letter As Text

’ A-Z and a-z
for i As Integer = &h41 To &h7A

letter = Text.FromUnicodeCodepoint(i)

' E0 (Unicode: hook above, 0309) / low rising tone mark
conv = conv.ReplaceAll(&uE0 + letter,  letter + &u0309, Text.CompareCaseSensitive)
' E1 (Unicode: grave, 0300) / grave accent
conv = conv.ReplaceAll(&uE1 + letter,  letter + &u0300, Text.CompareCaseSensitive)
' E2 (Unicode: acute, 0301) / acute accent
conv = conv.ReplaceAll(&uE2 + letter,  letter + &u0301, Text.CompareCaseSensitive)
' E3 (Unicode: circumflex, 0302) / circumflex accent
conv = conv.ReplaceAll(&uE3 + letter,  letter + &u0302, Text.CompareCaseSensitive)
' E4 (Unicode: tilde, 0303) / tilde
conv = conv.ReplaceAll(&uE4 + letter,  letter + &u0303, Text.CompareCaseSensitive)
' E5 (Unicode: macron, 0304) / macron
conv = conv.ReplaceAll(&uE5 + letter,  letter + &u0304, Text.CompareCaseSensitive)
' E6 (Unicode: breve, 0306) / breve
conv = conv.ReplaceAll(&uE6 + letter,  letter + &u0306, Text.CompareCaseSensitive)
' E7 (Unicode: dot above, 0307) / dot above
conv = conv.ReplaceAll(&uE7 + letter,  letter + &u0307, Text.CompareCaseSensitive)
' E8 (Unicode: diaeresis, 0308) / umlaut (dieresis)
conv = conv.ReplaceAll(&uE8 + letter,  letter + &u0308, Text.CompareCaseSensitive)
' E9 (Unicode: caron, 030C) / hacek
conv = conv.ReplaceAll(&uE9 + letter,  letter + &u030C, Text.CompareCaseSensitive)

' EA (Unicode: ring above, 030A) / circle above (angstrom)
conv = conv.ReplaceAll(&uEA + letter,  letter + &u030A, Text.CompareCaseSensitive)
' EB (Unicode: ligature left half, FE20) / ligature, left half
conv = conv.ReplaceAll(&uEB + letter,  letter + &uFE20, Text.CompareCaseSensitive)
' EC (Unicode: ligature right half, FE21) / ligature, right half
conv = conv.ReplaceAll(&uEC + letter,  letter + &uFE21, Text.CompareCaseSensitive)
' ED (Unicode: comma above right, 0315) / high comma, off center
conv = conv.ReplaceAll(&uED + letter,  letter + &u0315, Text.CompareCaseSensitive)
' EE (Unicode: double acute, 030B) / double acute accent
conv = conv.ReplaceAll(&uEE + letter,  letter + &u030B, Text.CompareCaseSensitive)
' EF (Unicode: candrabindu, 0310) / candrabindu
conv = conv.ReplaceAll(&uEF + letter,  letter + &u0310, Text.CompareCaseSensitive)

' F0 (Unicode: cedilla, 0327) / cedilla
conv = conv.ReplaceAll(&uF0 + letter,  letter + &u0327, Text.CompareCaseSensitive)
' F1 (Unicode: ogonek, 0328) / right hook
conv = conv.ReplaceAll(&uF1 + letter,  letter + &u0328, Text.CompareCaseSensitive)
' F2 (Unicode: dot below, 0323) / dot below
conv = conv.ReplaceAll(&uF2 + letter,  letter + &u0323, Text.CompareCaseSensitive)
' F3 (Unicode: diaeresis below, 0324) / double dot below
conv = conv.ReplaceAll(&uF3 + letter,  letter + &u0324, Text.CompareCaseSensitive)
' F4 (Unicode: ring below, 0325) / circle below
conv = conv.ReplaceAll(&uF4 + letter,  letter + &u0325, Text.CompareCaseSensitive)
' F5 (Unicode: double low line, 0333) / double underscore
conv = conv.ReplaceAll(&uF5 + letter,  letter + &u0333, Text.CompareCaseSensitive)
' F6 (Unicode: line below, 0332) / underscore
conv = conv.ReplaceAll(&uF6 + letter,  letter + &u0332, Text.CompareCaseSensitive)
' F7 (Unicode: comma below, 0326) / left hook
conv = conv.ReplaceAll(&uF7 + letter,  letter + &u0326, Text.CompareCaseSensitive)
' F8 (Unicode: left half ring below, 031C) / right cedilla
conv = conv.ReplaceAll(&uF8 + letter,  letter + &u031C, Text.CompareCaseSensitive)
' F9 (Unicode: breve below, 032E) / half circle below
conv = conv.ReplaceAll(&uF9 + letter,  letter + &u032E, Text.CompareCaseSensitive)

' FA (Unicode: double tilde left half, FE22) / double tilde, left half
conv = conv.ReplaceAll(&uFA + letter,  letter + &uFE22, Text.CompareCaseSensitive)
' FB (Unicode: double tilde right half, FE23) / double tilde, right half
conv = conv.ReplaceAll(&uFB + letter,  letter + &uFE23, Text.CompareCaseSensitive)
' FE (Unicode: comma above, 0313) / high comma, centered
conv = conv.ReplaceAll(&uFE + letter,  letter + &u0313, Text.CompareCaseSensitive)


’ conv = conv.ReplaceAll(&uA1, &u0141)
’ conv = conv.ReplaceAll(&uA2, &u00D8)
’ conv = conv.ReplaceAll(&uA3, &u0110)
’ conv = conv.ReplaceAll(&uA4, &u00DE)
’ conv = conv.ReplaceAll(&uA5, &u00C6)
’ conv = conv.ReplaceAll(&uA6, &u0152)
’ conv = conv.ReplaceAll(&uA7, &u02B9)
’ conv = conv.ReplaceAll(&uA8, &u00B7)
’ conv = conv.ReplaceAll(&uA9, &u266D)

’ conv = conv.ReplaceAll(&uAA, &u00AE) ’ ®
’ conv = conv.ReplaceAll(&uAB, &u00B1)
’ conv = conv.ReplaceAll(&uAC, &u01A0)
’ conv = conv.ReplaceAll(&uAD, &u01AF)

’ conv = conv.ReplaceAll(&uAE, &u02BC) ’ Part of ASCII
’ conv = conv.ReplaceAll(&uB0, &u02BB) ’ Part of ASCII
’ conv = conv.ReplaceAll(&uB1, &u0142)
’ conv = conv.ReplaceAll(&uB2, &u00F8)
’ conv = conv.ReplaceAll(&uB9, &u00A3)
’ conv = conv.ReplaceAll(&uBA, &u00F0)

’ conv = conv.ReplaceAll(&uC2, &u2117)
’ conv = conv.ReplaceAll(&uC3, &u00A9)
’ conv = conv.ReplaceAll(&uC4, &u266F)
’ conv = conv.ReplaceAll(&uC5, &u00BF)
’ conv = conv.ReplaceAll(&uC6, &u00A1)
’ conv = conv.ReplaceAll(&uCF, &u00DF)

TextArea1.Text = conv

Catch e As IOException
MsgBox "File IO Error: " + e.Reason
End Try[/code]

You are right. I can understand your input. But if you have really large Files, how to handle? I don’t wanna modify the original file and I don’t wanna create a copy of this file at the computer.

Could you give a simple, short example of looping through each individual byte of binary files? This would also be interesting, for example, to replace all EndOfLines (CR, LF and CRLF) with &uA.

Apparently I had too much time on my hands.

[code]Public Sub ANSELtoUnicode()
Dim diacriticals As String = “”
Dim ANSELcodePoint As UInt8
Dim UnicodeCodePoint As String
dim input As BinaryStream
dim myFile As FolderItem
'charset2 is the set of ANSEL accented characters 161…207
dim charSet2() as string = array(_
'charset3 is the set of all possible combinations of
'ANSEL combining diacriticals and final character
'The following data was taken from
'and then machine converted. The format of each element is as follows:
'The first 4 hex digits are the Unicode codepoint output value.
'The remaining hex digit pairs are the ANSEL diacritical prefixes
'and the base character, which make up the dictionary key.
dim charSet3pairs() as string = array(_

dim charSet3 As new Dictionary
'Build charSet3 transliteration Dictionary
for i as Integer = 0 to UBound(charSet3pairs)
'Get the Unicode codepoint output value
dim cs3value As String = chr(val("&h"+left(charSet3pairs(i),4)))
'Get the ANSEL diacritical prefixes and the base character, which make up the key
dim cs3key As String = mid(charSet3pairs(i),5)

//The program code above this point should probably be placed in a separate
//initialization routine because it doesn’t need to run each time a file is read.

'Get an ANSEL encoded input file and open it
'Open file as a read only BinaryStream

'Now process the input byte by byte
while not input.EOF
ANSELcodePoint = input.ReadUInt8
if ANSELcodePoint<128 Then
'charSet1 - This range of characters is the same in all encodings
if len(diacriticals)>0 then
'There are diacriticals to prefix to this character
if charSet3.HasKey(diacriticals) then
'Not found, so go with the unaccented character
end if
'No diacriticals, so the character needs no translation
end if
elseif ANSELcodePoint<161 Then
'This range is Invalid, so convert to Unicode Replacement Character
elseif ANSELcodePoint<208 Then
'This is the ANSEL set of accented characters
elseif ANSELcodePoint<224 Then
'This range is Invalid, so convert to Unicode Replacement Character
'The ANSEL codepoint is a combining diacritical, so append it to the diacritical string.
end if
'At this point the input ANSEL character has been converted to Unicode
'So we send it out to be handled by another routine
'For this example we simply display it in a TextArea
myTextArea.AppendText UnicodeCodePoint
End Sub