RTF Unicode Conversion

Hi everyone,

Looking for a working Method to convert Unicode characters > 32767 (Emoji etc.) to valid RTF-Code. Can anyone help please?

Greetings

I’d know a solution for OSX, but you want that for all platforms, apparently?

Yes, you are right Thomas :wink:

https://en.wikipedia.org/wiki/Rich_Text_Format#Character_encoding

\uXXXX

Know that Norman. :slight_smile: Looking for a Method to get the RTF Code for a Multibyte character

Have you tried entering such character in TextEdit, save and look at the code ?

Yes i tried. This is the RTF-File Code for the :slight_smile: Emoji

{\\rtf1\\ansi\\ansicpg1252 \\uc0\\u55357 \\u56842 }

\u specifies the code point which can, depending on encoding, be one or more bytes
you dont put in a \u for each byte

for instance the smiley looks like its \u1F60A

Norman… no offense… but have you tried using \u1f60A in an RTF document?
it doesn’t do anything but print “f160a”

however

\\uc0\\u55357 \\u56832

DOES result in a :slight_smile:

{\\rtf1\\ansi\\ansicpg1252\\cocoartf1404\\cocoasubrtf460
{\\fonttbl
\\f0\\fswiss\\fcharset0 Helvetica;
\\f1\\fnil\\fcharset0 Monaco;
}

\\f0 this is Unicode 1f60a [ \\u1f160a ] using Helvetica

\\f1 this is Unicode 1f60a [ \\u1f160a ] using Monaco

\\f0 \\uc0\\u55357 \\u56832 from Helvetica

\\f1 \\uc0\\u55357 \\u56832 from Helvetica


}

if I mis understood what you tried to convey, please show me what I did incorrectly

You cannot display an emoji with Helvetica, it simply does not contain the character.

Well, If StyledText.RTFData doesn’t do it, maybe StyledText.RTFDataMBS does?
Or TextArea.RTFDataMBS?

Is this about Unicode chars that do not fit into a 16 bit code? Then here’s the way to do that, I suspect (not tested):

First, put the code into a String, convert the string to Encodings.UTF16. Then get the lenth - it should give 2, not 1. Now get these two chars, with Mid, probably. Then these are each 16 bit chars you can put into \u codes.

If that doesn’t work, put that string into a MemoryBlock, which should then be 4 bytes long, and get the two 16 bit values out of it.

Well sorry, but obviously you can, either that or RTF format ignores the “\Fx” codes… because the RTF code I posted above DOES work otherwise I would not have posted it.

And I think what the OP is looking for (perhaps I’m wrong), is a way to create an RTF document containing UNICODE characters (not necessaryily “smileys”, via code,

Hi.

The code below has been cut directly from one of our RB projects. We had to create Unicode RTF so that we could put styled text onto the clipboard from our app (we don’t use StyledText). The code only converts a String to RTF so you would need to add all of the other standard RTF stuff to the output.

NOTES.

  1. The code probably won’t run by itself but should help you solve your problem.

  2. The platform specific code was needed to make OS X & MS-Windows accept the data on the clipboard. There really shouldn’t be any need to do this for RTF written to a file since RTF files ‘should’ be cross platform.

  3. We used StringHandleMBS in outputTempBuffer for speed. I imagine this variable could be a String array or just a String.

  4. addTagSeparator is used to add a space in the situation where RTF is already on the line and the first character to be written is within the standard ASCII range (ie: not written as a RTF tag).

  5. The code assumes the input (styledText) is valid UTF-8.

Kev.

Dim outputTempBuffer As StringHandleMBS
Dim styleText As String
Dim theChar As String
Dim theCharVal As Integer
Dim addTagSeparator As Boolean
Dim firstUnicodeChar As Boolean
Dim textLength As Integer
Dim textArray(-1) As String
Dim i2 As Integer

'replace characters within the string to make it more compatible with rtf
styleText = ReplaceAll(styleText, "\", "\\\")
styleText = ReplaceAll(styleText, kTab, "\\tab ")
styleText = ReplaceAll(styleText, Chr(13), "\" + Chr(13))
styleText = ReplaceAll(styleText, "—", "\\emdash ")
styleText = ReplaceAll(styleText, "–", "\\endash ")
styleText = ReplaceAll(styleText, "•", "\\bullet ")
styleText = ReplaceAll(styleText, "‘", "\\lquote ")
styleText = ReplaceAll(styleText, "’", "\\rquote ")
styleText = ReplaceAll(styleText, "“", "\\ldblquote ")
styleText = ReplaceAll(styleText, "”", "\\rdblquote ")

firstUnicodeChar = True

'convert the string into an array of bytes as this is much faster to parse
textArray = Split(styleText, "")
textLength = UBound(textArray)
For i2 = 0 To textLength
  theChar = textArray(i2)
  theCharVal = Asc(theChar)
  If theCharVal < 128 Then
    If addTagSeparator = True Then
      outputTempBuffer.Add(" ")
      addTagSeparator = False
    End If
    outputTempBuffer.Add(theChar)
  Else
    If firstUnicodeChar = True Then
      #If TargetMacOS Then
        outputTempBuffer.Add("\\uc0 ")
      #Else
        outputTempBuffer.Add("\\uc1 ")
      #EndIf
      firstUnicodeChar = False
    End If
    
    If theCharVal > 65535 Then
      theCharVal = 65535 - theCharVal
    End If
    
    #If TargetMacOS Then
      outputTempBuffer.Add("\\u" + Str(theCharVal) + " ")
    #Else
      'under win32 we need to add a dummy place holder character to make the data more compatible
      outputTempBuffer.Add("\\u" + Str(theCharVal) + "\\'20")
    #EndIf
    
    addTagSeparator = False
  End If
Next

I just noticed the above did not work correctly when a UTF-8 byte sequence was 4 or more bytes. On a Mac, the solution seems to be to convert the character to UTF-16 and write out the surrogate pair.

 Dim outputTempBuffer As StringHandleMBS
 Dim styleText As String
 Dim theChar As String
 Dim theCharVal, theCharVal2 As Integer
 Dim addTagSeparator As Boolean
 Dim firstUnicodeChar As Boolean
 Dim textLength As Integer
 Dim textArray(-1) As String
 Dim i2 As Integer

styleText = Mid(pText, index, theStyleRun.length)

'replace characters within the string to make it more compatible with rtf
styleText = ReplaceAll(styleText, "\", "\\\")
styleText = ReplaceAll(styleText, kTab, "\\tab ")
styleText = ReplaceAll(styleText, Chr(13), "\" + Chr(13))
styleText = ReplaceAll(styleText, "—", "\\emdash ")
styleText = ReplaceAll(styleText, "–", "\\endash ")
styleText = ReplaceAll(styleText, "•", "\\bullet ")
styleText = ReplaceAll(styleText, "‘", "\\lquote ")
styleText = ReplaceAll(styleText, "’", "\\rquote ")
styleText = ReplaceAll(styleText, "“", "\\ldblquote ")
styleText = ReplaceAll(styleText, "”", "\\rdblquote ")

firstUnicodeChar = True

'convert the string into an array of bytes as this is much faster to parse
textArray = Split(styleText, "")
textLength = UBound(textArray)
For i2 = 0 To textLength
  theChar = textArray(i2)
  theCharVal = Asc(theChar)
  
  If theCharVal < 128 Then
    If addTagSeparator = True Then
      outputTempBuffer.Add(" ")
      addTagSeparator = False
    End If
    outputTempBuffer.Add(theChar)
  Else
    If firstUnicodeChar = True Then
      #If TargetMacOS Then
        outputTempBuffer.Add("\\uc0 ")
      #Else
        'under win32 we need to add a dummy place holder character to make the data more compatible
        outputTempBuffer.Add("\\uc1 ")
      #EndIf
      firstUnicodeChar = False
    End If
    
    If LenB(theChar) > 3 Then
      theChar = ConvertEncoding(theChar, Encodings.UTF16)
      theCharVal = Asc(Mid(theChar, 1, 1))
      theCharVal2 = Asc(Mid(theChar, 2, 1))
      
      #If TargetMacOS Then
        outputTempBuffer.Add("\\u" + Str(theCharVal) + " ")
        outputTempBuffer.Add("\\u" + Str(theCharVal2) + " ")
      #Else
        'under win32 we need to add a dummy place holder character to make the data more compatible
        outputTempBuffer.Add("\\u" + Str(theCharVal) + "\\'20")
        outputTempBuffer.Add("\\u" + Str(theCharVal2) + "\\'20")
      #EndIf
    Else
      If theCharVal > 65535 Then
        theCharVal = 65535 - theCharVal
      End If
      
      #If TargetMacOS Then
        outputTempBuffer.Add("\\u" + Str(theCharVal) + " ")
      #Else
        'under win32 we need to add a dummy place holder character to make the data more compatible
        outputTempBuffer.Add("\\u" + Str(theCharVal) + "\\'20")
      #EndIf
    End If
    
    addTagSeparator = False
  End If
Next

I really wouldn’t argue with Michel over font-based topics.

Yes, you can see an emoji if the font around is Helvetica, but you know the actual emoji isn’t Helvetica right?
That’s what he’s saying.

Mac OS X substitutes Helvetica for the next font that contains the desired glyph.

What I am saying… is the simple RTF code I posted above… shows a smiley (but not for the \u1F160a)
and it shows it in any OSX app that recognizes RTF
AND
in WIndows such as WORD so if there is any substitution, the both OS manage to do it

[quote=257568:@Dave S]What I am saying… is the simple RTF code I posted above… shows a smiley (but not for the \u1F160a)
and it shows it in any OSX app that recognizes RTF
AND
in WIndows such as WORD so if there is any substitution, the both OS manage to do it[/quote]

The values in the RTF tags have to be base 10 and not base 16. I have posted some sample code which should help with the conversion.

The substitution is a feature of the text layout engine within the operating system (the feature is more commonly known as Font Fallback). Word processors normally interact with Font Fallback and change the font from the one you specified to one containing the glyph. For example, if you choose Times New Roman in MS-Word and paste the smiley you should find that the font set on the smiley is not the one you chose. If you paste a lot of text (maybe containing different scripts) you might find that multiple fonts get used. Not really sure what happens when you open a RTF file as I’ve never tried that before.

When you use DrawString in Xojo, Font Fallback happens behind the scenes. One of the reasons why trying to generate PDF output from Xojo can be difficult is because PDF does not do Font Fallback and requires that the glyphs do exist in the font specified. If you tried to create a PDF from code that used the Helevetica font for a smiley then you would get either a missing glyph symbol or a space when the PDF was rendered. This means that you have to make sure that your text is really displayable using the font or you can retrieve the data that Font Fallback generates and style the output with the fonts it chose. Using a PDF printer driver doesn’t suffer from these problems as they normally go via the operating system so get Font Fallback for free.

@Kevin Gale : works perfect. Declared OutputTempBuffer as String. No need to use MBS Classes.

Thanks to all of you for the Input :slight_smile: