Need to remove 4-byte characters from String

I’ve run into a curious problem. I need to generate a UCS-2 encoded file. UCS-2 is a deprecated version of UTF-16 that is strictly 2-bytes per character. This means it cannot represent every Unicode character. Xojo does not have built-in support for UCS-2, so ConvertEncoding isn’t an option.

So my goal is to remove or replace 4-byte characters before (or after) ConvertEncoding so that what is produced is valid UCS-2. That’s proving to be difficult with String. Text handles this better, but Text isn’t an option in this case because I could be dealing with megabytes of text and Text on Windows is slower than atoms at absolute zero.

Using RegEx hasn’t worked because it can’t work with multibyte strings. Looping over each character with Mid reads the 4-byte characters as 2 2-byte characters.

Anybody have any fast ideas?

Don’t know where you got the idea that RegEx can’t work with multi-byte string. The trick is to do the conversion in UTF-8, then convert to UTF-16.

rx.SearchPattern = "[\\x{10000}-\\x{FFFFF}]"
rx.ReplacementPattern = "•" // or something

[quote=453402:@Kem Tekinay]Don’t know where you got the idea that RegEx can’t work with multi-byte string. The trick is to do the conversion in UTF-8, then convert to UTF-16.

rx.SearchPattern = "[\\x{10000}-\\x{FFFFF}]" rx.ReplacementPattern = "•" // or something [/quote]
The docs say it can’t handle 2-byte characters, so I would assume it can’t do 4-byte either. Regardless, I found that expression already. Xojo doesn’t like it, gives character value in \\x{...} sequence is too large instead.

I can’t reproduce that. This code works as intended.

dim rx as new RegEx
rx.SearchPattern = "[\\x{10000}-\\x{FFFFF}]"
rx.ReplacementPattern = "•"
rx.Options.ReplaceAllMatches = true

dim s as string = "a" + Chr( &h11000 ) + "b"
s = rx.Replace( s )
MsgBox s

Try this:

[code]Dim Original As String = "Hello

(BTW, the top value is not \x{FFFFF}, but I haven’t stopped to figure it out.)

The forum is falling over displaying my code.

So anyway, the difference appears to be the encoding of the target string. If I call Replace on the string as UTF-8, it works. If the string is already UTF-16, I get the exception.

The high value is \x{10FFFF}.

Yeah. Ok, so RegEx is a workable solution, as long as I do it before converting to UTF16.

As an alternative, you could convert to UTF-16, store in a MemoryBlock, then cycle through the bytes looking for UTF-16 pairs. I’m not sure which would be faster.

Fastest would probably be Dim Converted As String = ConvertEncoding("Source", Encodings.ASCII).ConvertEncoding(Encodings.UTF16LE), add a BOM, and to hell with the special characters!

I’m only half-joking, because I guarantee users are going to get very confused when opening a UCS-2 file in Notepad. I’m certain they will try it.