Need to remove 4-byte characters from String

  1. last week

    Thom M

    Sep 9 Pre-Release Testers Greater Hartford Area, CT

    I've run into a curious problem. I need to generate a UCS-2 encoded file. UCS-2 is a deprecated version of UTF-16 that is strictly 2-bytes per character. This means it cannot represent every Unicode character. Xojo does not have built-in support for UCS-2, so ConvertEncoding isn't an option.

    So my goal is to remove or replace 4-byte characters before (or after) ConvertEncoding so that what is produced is valid UCS-2. That's proving to be difficult with String. Text handles this better, but Text isn't an option in this case because I could be dealing with megabytes of text and Text on Windows is slower than atoms at absolute zero.

    Using RegEx hasn't worked because it can't work with multibyte strings. Looping over each character with Mid reads the 4-byte characters as 2 2-byte characters.

    Anybody have any fast ideas?

  2. Kem T

    Sep 9 Pre-Release Testers, Xojo Pro, XDC Speakers Connecticut

    Don't know where you got the idea that RegEx can't work with multi-byte string. The trick is to do the conversion in UTF-8, then convert to UTF-16.

    rx.SearchPattern = "[\x{10000}-\x{FFFFF}]"
    rx.ReplacementPattern = "•" // or something
  3. Thom M

    Sep 9 Pre-Release Testers Greater Hartford Area, CT

    @Kem T Don't know where you got the idea that RegEx can't work with multi-byte string. The trick is to do the conversion in UTF-8, then convert to UTF-16.

    rx.SearchPattern = "[\x{10000}-\x{FFFFF}]" rx.ReplacementPattern = "•" // or something

    The docs say it can't handle 2-byte characters, so I would assume it can't do 4-byte either. Regardless, I found that expression already. Xojo doesn't like it, gives character value in \x{...} sequence is too large instead.

  4. Kem T

    Sep 9 Pre-Release Testers, Xojo Pro, XDC Speakers Connecticut

    I can't reproduce that. This code works as intended.

    dim rx as new RegEx
    rx.SearchPattern = "[\x{10000}-\x{FFFFF}]"
    rx.ReplacementPattern = "•"
    rx.Options.ReplaceAllMatches = true
    
    dim s as string = "a" + Chr( &h11000 ) + "b"
    s = rx.Replace( s )
    MsgBox s
  5. Thom M

    Sep 9 Pre-Release Testers Greater Hartford Area, CT
    Edited last week

    Try this:

    [code]Dim Original As String = "Hello

  6. Kem T

    Sep 9 Pre-Release Testers, Xojo Pro, XDC Speakers Connecticut

    (BTW, the top value is not \x{FFFFF}, but I haven't stopped to figure it out.)

  7. Thom M

    Sep 9 Pre-Release Testers Greater Hartford Area, CT

    The forum is falling over displaying my code.

  8. Thom M

    Sep 9 Pre-Release Testers Greater Hartford Area, CT

    So anyway, the difference appears to be the encoding of the target string. If I call Replace on the string as UTF-8, it works. If the string is already UTF-16, I get the exception.

  9. Kem T

    Sep 9 Pre-Release Testers, Xojo Pro, XDC Speakers Connecticut

    The high value is \x{10FFFF}.

  10. Thom M

    Sep 9 Pre-Release Testers Greater Hartford Area, CT

    Yeah. Ok, so RegEx is a workable solution, as long as I do it before converting to UTF16.

  11. Kem T

    Sep 9 Pre-Release Testers, Xojo Pro, XDC Speakers Connecticut

    As an alternative, you could convert to UTF-16, store in a MemoryBlock, then cycle through the bytes looking for UTF-16 pairs. I'm not sure which would be faster.

  12. Thom M

    Sep 9 Pre-Release Testers Greater Hartford Area, CT

    Fastest would probably be Dim Converted As String = ConvertEncoding("Source", Encodings.ASCII).ConvertEncoding(Encodings.UTF16LE), add a BOM, and to hell with the special characters!

    I'm only half-joking, because I guarantee users are going to get very confused when opening a UCS-2 file in Notepad. I'm certain they will try it.

or Sign Up to reply!