Parsing and Joining Two Hex/Escaped Multibyte Characters in Shift_JIS

Hi. I’m trying to read in data from an RTF file and have found that some charsets like Shift_JIS (Japanese) have multibyte characters…that is two escaped characters that need to be combined into a single character.

For instance, \'82\'a2 = ?.

The \'82 is the “leading” byte and signals there is another “trailing” byte to come.

I think I know the range to look out for that signals a double-byte sequence (>= &h81 and <=&h9f) or (>= &he0 and <= &hfc) but need to know how to combine these two escaped characters into a single UTF-8 encoded character.

I also need to know if Shift_JIS is the only charset that has this structure and if not, what others have it and does the range differ for all of them?

Hope someone can help me with this. Thanks.

Dim mb as new MemoryBlock(2) Dim s as String mb.LittleEndian = False mb.Byte(0) = leadByte mb.Byte(1) = trailByte s = s.DefineEncoding(Encodings.ShiftJIS) s = mb.StringValue(0,2) s = s.DefineEncoding(Encodings.ShiftJIS) //yes, I know I did this twice, have found defensive coding often needed with encodings) s = s.ConvertEncoding(Encodings.UTF8)

As for variable-width character sets, pretty much any encoding whose target language is Chinese, Japanese, or Korean will be variable-width (GB18030 is particularly annoying, with one-byte, two-byte, and four-byte characters). Also, UTF8 and UTF16 have variable width.

That helped a lot. Thank you!