Need to remove 4-byte characters from String

Thom_McGrath · September 9, 2019, 9:35pm

I’ve run into a curious problem. I need to generate a UCS-2 encoded file. UCS-2 is a deprecated version of UTF-16 that is strictly 2-bytes per character. This means it cannot represent every Unicode character. Xojo does not have built-in support for UCS-2, so ConvertEncoding isn’t an option.

So my goal is to remove or replace 4-byte characters before (or after) ConvertEncoding so that what is produced is valid UCS-2. That’s proving to be difficult with String. Text handles this better, but Text isn’t an option in this case because I could be dealing with megabytes of text and Text on Windows is slower than atoms at absolute zero.

Using RegEx hasn’t worked because it can’t work with multibyte strings. Looping over each character with Mid reads the 4-byte characters as 2 2-byte characters.

Anybody have any fast ideas?

Kem_Tekinay · September 9, 2019, 9:45pm

Don’t know where you got the idea that RegEx can’t work with multi-byte string. The trick is to do the conversion in UTF-8, then convert to UTF-16.

rx.SearchPattern = "[\\x{10000}-\\x{FFFFF}]"
rx.ReplacementPattern = "" // or something

Thom_McGrath · September 9, 2019, 9:46pm

[quote=453402:@Kem Tekinay]Don’t know where you got the idea that RegEx can’t work with multi-byte string. The trick is to do the conversion in UTF-8, then convert to UTF-16.

rx.SearchPattern = "[\\x{10000}-\\x{FFFFF}]" rx.ReplacementPattern = "" // or something[/quote]
The docs say it can’t handle 2-byte characters, so I would assume it can’t do 4-byte either. Regardless, I found that expression already. Xojo doesn’t like it, gives character value in \\x{...} sequence is too large instead.

Kem_Tekinay · September 9, 2019, 9:49pm

I can’t reproduce that. This code works as intended.

dim rx as new RegEx
rx.SearchPattern = "[\\x{10000}-\\x{FFFFF}]"
rx.ReplacementPattern = ""
rx.Options.ReplaceAllMatches = true

dim s as string = "a" + Chr( &h11000 ) + "b"
s = rx.Replace( s )
MsgBox s

Thom_McGrath · September 9, 2019, 9:51pm

Try this:

[code]Dim Original As String = "Hello

Kem_Tekinay · September 9, 2019, 9:52pm

(BTW, the top value is not \x{FFFFF}, but I haven’t stopped to figure it out.)

Thom_McGrath · September 9, 2019, 9:53pm

The forum is falling over displaying my code.

Thom_McGrath · September 9, 2019, 9:54pm

So anyway, the difference appears to be the encoding of the target string. If I call Replace on the string as UTF-8, it works. If the string is already UTF-16, I get the exception.

Kem_Tekinay · September 9, 2019, 9:56pm

The high value is \x{10FFFF}.

Thom_McGrath · September 9, 2019, 10:03pm

Yeah. Ok, so RegEx is a workable solution, as long as I do it before converting to UTF16.

Kem_Tekinay · September 9, 2019, 10:03pm

As an alternative, you could convert to UTF-16, store in a MemoryBlock, then cycle through the bytes looking for UTF-16 pairs. I’m not sure which would be faster.

Thom_McGrath · September 9, 2019, 10:06pm

Fastest would probably be Dim Converted As String = ConvertEncoding("Source", Encodings.ASCII).ConvertEncoding(Encodings.UTF16LE), add a BOM, and to hell with the special characters!

I’m only half-joking, because I guarantee users are going to get very confused when opening a UCS-2 file in Notepad. I’m certain they will try it.