<https://xojo.com/issue/60642> (currently private because Feedback is bugged, hopefully somebody will fix the case soon)
Edit: Well esoTalk really messed this up.
[code]Var Input As String = “<>”
Var Styles As New StyledText
Var Run As New StyleRun(Input)
Styles.AddStyleRun(Run)
Var RTFData As String = Styles.RTFData
Styles = New StyledText
Styles.RTFData = RTFData
Var Result As String = Styles.StyleRun(0).Text[/code]
The input bytes are F09F 8D89 but result comes back as EDA0 BCED BD89. Has anybody found a workaround? I’ve tried all sorts of combinations of DefineEncoding and ConvertEncoding, but the fact of the matter is the bytes are wrong. That won’t help. I suspect the decoder is to blame, not the encoder.
theres a discussion about this on another forum and it seems that different conversions of the data can result in the same user perceived characters
what they found was that some text in Pages would get encoded as rtf one way, and Word as another
yet when viewed both could open the other and the perceived result is the same
Possibly? I tried using TextEdit to create an RTF of the character and it encodes very differently. Xojo fails in exactly the same way when it tries to parse TextEdit’s data, but TextEdit has no trouble correctly parsing Xojo’s data. This is why I think Xojo’s parser is at fault here.
The UTF-8 sequence &hF09F &h8D89 is Unicode Code Point: &h1F349.
Since this is > &hFFFF it has to be represented in the RTF as a UTF-16 surrogate pair. The sequence in the RTF is u-10180\\u-8375 which is correct.
Unfortunately, whoever wrote the RTF parser didn’t both handling surrogate pairs so instead of them being decoded they are just being put into the output as UTF-8 characters.
Fortunately, it looks like it is possible to fix the problem by checking for characters in the surrogate pair Unicode range and decoding them correctly.
Please see the attached sample I hacked up which demonstrates the problem and also fixes it.
NOTE. It appears that I couldn’t fix RTF that was imported into a TextArea. I had to import the RTF into a new StyledText object and then assign that to the TextArea afterwards.
Const kUTF16HighSurrogateStart = &hD800
Const kUTF16LowSurrogateStart = &hDC00
Const kUTF16SurrogateEnd = &hDFFF
Const k10BitsShift = &h400
Const kSigned16BitValueIntoUnsigned16BitValueRange = &h10000
Dim testStyledTextObj As StyledText
Dim styleRunObj As StyleRun
Dim importStyledTextObj As StyledText
Dim brokenText, fixedText As String
Dim styleRunCount, styleRunIndex As Int32
Dim styleRunText As String
Dim styleRunModified As Boolean
Dim i As Int32
Dim theChar As Int32
Dim highSurrogate As Int32
Dim lowSurrogate As Int32
'make a test styledtext object
testStyledTextObj = New StyledText
testStyledTextObj.Text = ""
styleRunObj = New StyleRun
styleRunObj.Text = "Hello"
styleRunObj.Italic = True
testStyledTextObj.AppendStyleRun(styleRunObj)
styleRunObj = New StyleRun
styleRunObj.Text = DefineEncoding(ChrB(&hF0) + ChrB(&h9F) + ChrB(&h8D) + ChrB(&h89), Encodings.UTF8)
styleRunObj.Size = 36
testStyledTextObj.AppendStyleRun(styleRunObj)
styleRunObj = New StyleRun
styleRunObj.Text = "There"
styleRunObj.Bold = True
testStyledTextObj.AppendStyleRun(styleRunObj)
'import the test styledtext rtf data into a new styledtext object which will corrupt the text if we have characters > unicode plane 0
importStyledTextObj = New StyledText
importStyledTextObj.RTFData = testStyledTextObj.RTFData
brokenText = importStyledTextObj.Text
'the xojo rtf parser does not decode utf-16 surrogate pairs so if the character is in the surrogate range we need to fix it
styleRunCount = importStyledTextObj.StyleRunCount - 1
For styleRunIndex = 0 To styleRunCount
styleRunObj = importStyledTextObj.StyleRun(styleRunIndex)
styleRunText = styleRunObj.Text
styleRunModified = False
For i = 1 To Len(styleRunText)
theChar = Asc(Mid(styleRunText, i, 1))
'is the character in surrogate range?
If (theChar >= kUTF16HighSurrogateStart) And (theChar <= kUTF16SurrogateEnd) Then
'convert this character and the next one to high and low surrogate values
highSurrogate = (theChar - kUTF16HighSurrogateStart)
lowSurrogate = Asc(Mid(styleRunText, i + 1, 1)) And &hFFFF
lowSurrogate = lowSurrogate - kUTF16LowSurrogateStart
'make sure the surrogate values are in-range
If (highSurrogate >= 0) And (lowSurrogate >= 0) Then
'convert the surrogates values to a unicode code point
theChar = (highSurrogate * k10BitsShift) + lowSurrogate + kSigned16BitValueIntoUnsigned16BitValueRange
'replace the two broken characters with the fixed character
styleRunText = Mid(styleRunText, 1, i - 1) + Encodings.UTF8.Chr(theChar) + Mid(styleRunText, i + 2)
Else
'we have an unpaired surrogate so remove this character
styleRunText = Mid(styleRunText, 1, i - 1) + Mid(styleRunText, i + 1)
End If
'flag that we have updated the stylerun
styleRunModified = True
End If
Next
'update the stylerun if we have modified it
If styleRunModified = True Then
styleRunObj.Text = styleRunText
importStyledTextObj.RemoveStyleRun(styleRunIndex)
importStyledTextObj.InsertStyleRun(styleRunObj, styleRunIndex)
End If
Next
fixedText = importStyledTextObj.Text
break