StyledText.RTFData encodes multibyte characters incorrectly.

Thom_McGrath · June 13, 2020, 7:21pm

<https://xojo.com/issue/60642> (currently private because Feedback is bugged, hopefully somebody will fix the case soon)

Edit: Well esoTalk really messed this up.

[code]Var Input As String = “<>”

Var Styles As New StyledText
Var Run As New StyleRun(Input)
Styles.AddStyleRun(Run)

Var RTFData As String = Styles.RTFData

Styles = New StyledText
Styles.RTFData = RTFData
Var Result As String = Styles.StyleRun(0).Text[/code]

The input bytes are F09F 8D89 but result comes back as EDA0 BCED BD89. Has anybody found a workaround? I’ve tried all sorts of combinations of DefineEncoding and ConvertEncoding, but the fact of the matter is the bytes are wrong. That won’t help. I suspect the decoder is to blame, not the encoder.

Norman_Palardy · June 13, 2020, 7:41pm

theres a discussion about this on another forum and it seems that different conversions of the data can result in the same user perceived characters
what they found was that some text in Pages would get encoded as rtf one way, and Word as another
yet when viewed both could open the other and the perceived result is the same

is that possibly going on here ?

Thom_McGrath · June 13, 2020, 7:43pm

Possibly? I tried using TextEdit to create an RTF of the character and it encodes very differently. Xojo fails in exactly the same way when it tries to parse TextEdit’s data, but TextEdit has no trouble correctly parsing Xojo’s data. This is why I think Xojo’s parser is at fault here.

Emile_Schwarz · June 13, 2020, 9:39pm

Avoid the use of Apples TextEdit to create RTF files, the results are wrong with vowels 'éèêë, etc.).

kevin_g · June 15, 2020, 1:24pm

The UTF-8 sequence &hF09F &h8D89 is Unicode Code Point: &h1F349.

Since this is > &hFFFF it has to be represented in the RTF as a UTF-16 surrogate pair. The sequence in the RTF is u-10180\\u-8375 which is correct.

Unfortunately, whoever wrote the RTF parser didn’t both handling surrogate pairs so instead of them being decoded they are just being put into the output as UTF-8 characters.

Fortunately, it looks like it is possible to fix the problem by checking for characters in the surrogate pair Unicode range and decoding them correctly.

Please see the attached sample I hacked up which demonstrates the problem and also fixes it.
NOTE. It appears that I couldn’t fix RTF that was imported into a TextArea. I had to import the RTF into a new StyledText object and then assign that to the TextArea afterwards.

  Const kUTF16HighSurrogateStart = &hD800
  Const kUTF16LowSurrogateStart = &hDC00
  Const kUTF16SurrogateEnd = &hDFFF
  Const k10BitsShift = &h400
  Const kSigned16BitValueIntoUnsigned16BitValueRange = &h10000
  
  Dim testStyledTextObj As StyledText
  Dim styleRunObj As StyleRun
  Dim importStyledTextObj As StyledText
  Dim brokenText, fixedText As String
  Dim styleRunCount, styleRunIndex As Int32
  Dim styleRunText As String
  Dim styleRunModified As Boolean
  Dim i As Int32
  Dim theChar As Int32
  Dim highSurrogate As Int32
  Dim lowSurrogate As Int32
  
  'make a test styledtext object
  testStyledTextObj = New StyledText
  testStyledTextObj.Text = ""
  
  styleRunObj = New StyleRun
  styleRunObj.Text = "Hello"
  styleRunObj.Italic = True
  testStyledTextObj.AppendStyleRun(styleRunObj)
  
  styleRunObj = New StyleRun
  styleRunObj.Text = DefineEncoding(ChrB(&hF0) + ChrB(&h9F) + ChrB(&h8D) + ChrB(&h89), Encodings.UTF8)
  styleRunObj.Size = 36
  testStyledTextObj.AppendStyleRun(styleRunObj)
  
  styleRunObj = New StyleRun
  styleRunObj.Text = "There"
  styleRunObj.Bold = True
  testStyledTextObj.AppendStyleRun(styleRunObj)
  
  
  'import the test styledtext rtf data into a new styledtext object which will corrupt the text if we have characters > unicode plane 0
  importStyledTextObj = New StyledText
  importStyledTextObj.RTFData = testStyledTextObj.RTFData
  
  brokenText = importStyledTextObj.Text
  
  
  'the xojo rtf parser does not decode utf-16 surrogate pairs so if the character is in the surrogate range we need to fix it
  styleRunCount = importStyledTextObj.StyleRunCount - 1
  For styleRunIndex = 0 To styleRunCount
    styleRunObj = importStyledTextObj.StyleRun(styleRunIndex)
    
    styleRunText = styleRunObj.Text
    styleRunModified = False
    
    For i = 1 To Len(styleRunText)
      theChar = Asc(Mid(styleRunText, i, 1))
      
      'is the character in surrogate range?
      If (theChar >= kUTF16HighSurrogateStart) And (theChar <= kUTF16SurrogateEnd) Then
        'convert this character and the next one to high and low surrogate values
        highSurrogate = (theChar - kUTF16HighSurrogateStart)
        
        lowSurrogate = Asc(Mid(styleRunText, i + 1, 1)) And &hFFFF
        lowSurrogate = lowSurrogate - kUTF16LowSurrogateStart
        
        
        'make sure the surrogate values are in-range
        If (highSurrogate >= 0) And (lowSurrogate >= 0) Then
          'convert the surrogates values to a unicode code point
          theChar = (highSurrogate * k10BitsShift) + lowSurrogate + kSigned16BitValueIntoUnsigned16BitValueRange
          
          'replace the two broken characters with the fixed character
          styleRunText = Mid(styleRunText, 1, i - 1) + Encodings.UTF8.Chr(theChar) + Mid(styleRunText, i + 2)
        Else
          'we have an unpaired surrogate so remove this character
          styleRunText = Mid(styleRunText, 1, i - 1) + Mid(styleRunText, i + 1)
        End If
        
        
        'flag that we have updated the stylerun
        styleRunModified = True
      End If
    Next
    
    
    'update the stylerun if we have modified it
    If styleRunModified = True Then
      styleRunObj.Text = styleRunText
      
      importStyledTextObj.RemoveStyleRun(styleRunIndex)
      importStyledTextObj.InsertStyleRun(styleRunObj, styleRunIndex)
    End If
  Next
  
  
  fixedText = importStyledTextObj.Text
  
  
  break

Thom_McGrath · June 15, 2020, 2:07pm

Thank you @Kevin Gale that seems to be a valid workaround.