I’ve run into a curious problem. I need to generate a UCS-2 encoded file. UCS-2 is a deprecated version of UTF-16 that is strictly 2-bytes per character. This means it cannot represent every Unicode character. Xojo does not have built-in support for UCS-2, so ConvertEncoding isn’t an option.
So my goal is to remove or replace 4-byte characters before (or after) ConvertEncoding so that what is produced is valid UCS-2. That’s proving to be difficult with String. Text handles this better, but Text isn’t an option in this case because I could be dealing with megabytes of text and Text on Windows is slower than atoms at absolute zero.
Using RegEx hasn’t worked because it can’t work with multibyte strings. Looping over each character with Mid reads the 4-byte characters as 2 2-byte characters.
[quote=453402:@Kem Tekinay]Don’t know where you got the idea that RegEx can’t work with multi-byte string. The trick is to do the conversion in UTF-8, then convert to UTF-16.
rx.SearchPattern = "[\\x{10000}-\\x{FFFFF}]"
rx.ReplacementPattern = "" // or something
[/quote]
The docs say it can’t handle 2-byte characters, so I would assume it can’t do 4-byte either. Regardless, I found that expression already. Xojo doesn’t like it, gives character value in \\x{...} sequence is too large instead.
I can’t reproduce that. This code works as intended.
dim rx as new RegEx
rx.SearchPattern = "[\\x{10000}-\\x{FFFFF}]"
rx.ReplacementPattern = ""
rx.Options.ReplaceAllMatches = true
dim s as string = "a" + Chr( &h11000 ) + "b"
s = rx.Replace( s )
MsgBox s
So anyway, the difference appears to be the encoding of the target string. If I call Replace on the string as UTF-8, it works. If the string is already UTF-16, I get the exception.
As an alternative, you could convert to UTF-16, store in a MemoryBlock, then cycle through the bytes looking for UTF-16 pairs. I’m not sure which would be faster.
Fastest would probably be Dim Converted As String = ConvertEncoding("Source", Encodings.ASCII).ConvertEncoding(Encodings.UTF16LE), add a BOM, and to hell with the special characters!
I’m only half-joking, because I guarantee users are going to get very confused when opening a UCS-2 file in Notepad. I’m certain they will try it.