Convert rtf to html

Does anyone have some code to convert rtf to html? Don’t want to reinvent the wheel. I don’t need a full parser, just some colours and styles. textutil on the Mac does the conversion I need. But the CSS differs from execution to execution.

Did you try NSAttributedStringMBS class?

Thanks, NSAttributedStringMBS works almost.

this was on the -RIP- realsoftware forum …

' converts an rtf formatted Text To a pure html String.
' from http://forums.realsoftware.com/viewtopic.php?f=6&t=39729

Dim st As new StyledText
if st<>nil Then
  if IsRTFText( aRTFvalue) Then
    st.RTFData = aRTFvalue
  Else
    st.Text = aRTFvalue
  End If
  
  Dim Result As Boolean
  Dim Count, I As Integer
  Dim AllPiecesOfPage, PieceOfPage As String
  Dim Piece, Temp, WholeWebPage As String
  Dim SR As StyleRun
  
  Count = st.StyleRunCount
  AllPiecesOfPage = ""
  
  For I = 0 To Count - 1
    
    SR = st.StyleRun( I )
    PieceOfPage = SR.Text
    
    If SR.Bold = True Then
      PieceOfPage = "<b>" + PieceOfPage + "</b>"
    End If
    
    If SR.Italic = True Then
      PieceOfPage = "<i>" + PieceOfPage + "</i>"
    End If
    
    If SR.Underline = True Then
      PieceOfPage = "<u>" + PieceOfPage + "</u>"
    End If
    
    PieceOfPage = ReplaceAll (PieceOfPage, kLineFeed+kLineFeed, "<p>" )
    
    PieceOfPage = ReplaceAll (PieceOfPage, kLineFeed, "<br />" )
    
    Piece = "<span style=" + kGuillemet + "font-family: " + SR.Font
    Piece.AppendText ";  font-size: " + Str ( SR.Size) + "pt; color: " + Mid ( Str ( SR.TextColor ), 3 ) 
    Piece.AppendText ";" + kGuillemet + ">" + PieceOfPage + "</span>"
    
    AllPiecesOfPage =  AllPiecesOfPage + Piece
    
  Next I
  
  WholeWebPage = "<html>" + EndOfLine +  "<body style=""font-size: 30pt;"">"
  WholeWebPage = WholeWebPage + AllPiecesOfPage + EndOfLine
  WholeWebPage = WholeWebPage + "</body>" + EndOfLine + "</html>" + EndOfLine
  
  Return WholeWebPage
End If

Perfect, thanks!

Can you tell me the content of

kGuillemet

A guillemet is a french type of quote. Not sure why the original author uses the guillemet instead of a simple quote.

Here is my updated code:

Private Function MakeHtmlFromStyledText(theStyledText as StyledText) as String
  
  dim theText(-1) as String
  
  theText.Append("<p>")
  For currentRun as Integer = 0 To theStyledText.StyleRunCount - 1
    
    dim theRun as StyleRun = theStyledText.StyleRun(currentRun)
    dim StyleRunText as String = EncodingToHTMLMBS(theRun.Text)
    
    'bold, italic and unterline
    If theRun.Bold Then StyleRunText = "<b>" + StyleRunText + "</b>"
    If theRun.Italic Then StyleRunText = "<i>" + StyleRunText + "</i>"
    If theRun.Underline Then StyleRunText = "<u>" + StyleRunText + "</u>"
    
    'tab and br
    StyleRunText = StyleRunText.ReplaceAll(EndOfLine, "<br />")
    StyleRunText = StyleRunText.ReplaceAll(Encodings.UTF8.Chr(9), "&nbsp;&nbsp;&nbsp;&nbsp;")
    
    'font and color
    dim theColor as Color = theRun.TextColor
    dim ColorString as String = "rgb(" + str(theColor.Red) + ", " + str(theColor.Green) + ", " + str(theColor.Blue) + ")"
    dim FontSpan as String = "<span style=""font-family: " + theRun.Font + ";  font-size: " + Str(theRun.Size) + "pt; color: " + ColorString + ";"">" + StyleRunText + "</span>"
    
    theText.AddRow(FontSpan)
    
  Next
  theText.Append("</p>")
  
  return string.FromArray(theText)
End Function
2 Likes

oups sorry !
Public Const kGuillemet as Number = &u22

1 Like

Thank you :slightly_smiling_face:

Thank you, you too :slightly_smiling_face:

I have an old piece of code for html conversion. It is incomlete and the part handling a font is not so good:

Public Function ConversionRTFtoHTML(ta as textarea) as String

// ------------------------------------------------------------------------------------------------------------
// Konvertierung eines RTF-Textes in einen HTML-Text
// Conversion of a RTF text into a HTML text
// Enriched Text funktioniert leider nicht (Doku aus https://en.wikipedia.org/wiki/Enriched_text )
// Enriched text doesn’t work (docu from https://en.wikipedia.org/wiki/Enriched_text )
// ------------------------------------------------------------------------------------------------------------
// This method isn’t complete. The method contains only the formatting rules which I need at first.
//
// This method is useful to transfer a RTF text of a textarea to a html text for transferring this text per
// e-mail. So isn’t CSS useful - as far as I know.
// ------------------------------------------------------------------------------------------------------------
// --> ta Textarea
// <-- string HTML text
// ------------------------------------------------------------------------------------------------------------
if ta is nil then
return “”
end if

Dim sr as StyleRun
dim lv_SRText as String
dim et as string
dim lv_OpenColor as boolean
dim lv_OpenBold as Boolean
dim lv_OpenItalic as Boolean
dim lv_OpenUnderline as Boolean

et = “”

//Get the number of styles
For i as integer = 0 to ta.StyledText.StyleRunCount - 1 //get the StyleRuns

sr = ta.StyledText.StyleRun( i )

if sr.Size = 0 or sr.Size = 13 then          // erst mal das obsolete "font face" nutzen - as a start trying the obsolet "font face"
  et = et + "<font face="+sr.Font+">"        // die Größe ist im HTML ein Wert von 1 bis 7, paßt ja nicht zur Pixel-Größe
                                             // the size is in HTML a value from 1 to 7, doesn't fit to a pixel size
else
  et = et + "<font face="+sr.Font+" size="+sr.Size.ToText+">"
end if

lv_SRText = sr.Text
'lv_SRText = ReplaceAll(lv_SRText ,"<","&lt;")      ' für das Kleiner-Zeichen < 
'lv_SRText = ReplaceAll(lv_SRText ,">","&gt;")      ' für das Größer-Zeichen > 
lv_SRText = ReplaceAll(lv_SRText ,"&","&amp;")     ' für das Und-Zeichen &
lv_SRText = ReplaceAllB(lv_SRText,"ä","&auml;")    ' für ä (Umlaut-a) 
lv_SRText = ReplaceAll(lv_SRText ,"Ä","&Auml;")    ' für Ä (Umlaut-A) 
lv_SRText = ReplaceAllB(lv_SRText,"ö","&ouml;")    ' für ö (Umlaut-o) 
lv_SRText = ReplaceAll(lv_SRText ,"Ö","&Ouml;")    ' für Ö (Umlaut-O) 
lv_SRText = ReplaceAllB(lv_SRText,"ü","&uuml;")    ' für ü (Umlaut-u) 
lv_SRText = ReplaceAll(lv_SRText ,"Ü","&Uuml;")    ' für Ü (Umlaut-U) 
lv_SRText = ReplaceAll(lv_SRText ,"ß","&szlig;")   ' für ß (scharfes s, s-z-Ligatur)

lv_SRText = ReplaceAll(lv_SRText ,"é","&eacute;")  ' für é (e mit Acute-Akzent) 
lv_SRText = ReplaceAll(lv_SRText ,"à","&agrave;")  ' für à (a mit Grave-Akzent) 
lv_SRText = ReplaceAll(lv_SRText ,"ô","&ocirc;")   ' für ô (o mit Circumflex-Akzent) 
lv_SRText = ReplaceAll(lv_SRText ,"ç","&ccedil;")  ' für ç (c mit Cedille) 
lv_SRText = ReplaceAll(lv_SRText ,"ñ","&ntilde;")  ' für ñ (n mit Tilde) 
lv_SRText = ReplaceAll(lv_SRText ,"å","&aring;")   ' für å (a mit Ring)

lv_SRText = ReplaceAll(lv_SRText ,"°","&deg;")     ' für das Grad-Symbol ° (degree) 
lv_SRText = ReplaceAll(lv_SRText ,"©","&copy;")    ' für das Copyright-Symbol © 
lv_SRText = ReplaceAll(lv_SRText ,"®","&reg;")     ' für das Eingetragene-Marke-Symbol ®
lv_SRText = ReplaceAll(lv_SRText ,"§","&sect;")    ' für das deutsche Paragraphen-Zeichen § (section) 
lv_SRText = ReplaceAll(lv_SRText ,"¶","&para;")    ' für das amerikanische Absatz-Zeichen ¶ (paragraph)

lv_SRText = ReplaceAll(lv_SRText,EndOfLine,"<br>") ' neue Zeile

select case sr.TextColor
case color.Red
  et = et+"<span style=""color:red"">"
  lv_OpenColor = true
case color.Blue
  et = et+"<span style=""color:blue"">"
  lv_OpenColor = true
case color.Green
  et = et+"<span style=""color:green"">"
  lv_OpenColor = true
case color.Yellow
  et = et+"<span style=""color:yellow"">"
  lv_OpenColor = true
case color.Cyan
  et = et+"<span style=""color:cyan"">"
  lv_OpenColor = true
case color.Magenta
  et = et+"<span style=""color:magenta"">"
  lv_OpenColor = true
end select

if sr.Bold then
  et = et+"<strong>"
  lv_OpenBold = true
end if

if sr.Italic then
  et = et+"<em>"
  lv_OpenItalic = true
end if

if sr.Underline then
  et = et+"<u>"
  lv_OpenUnderline = true
end if

et = et + lv_SRText

if lv_OpenUnderline then
  et = et+"</u>"
  lv_OpenUnderline = false
end if
if lv_OpenItalic then
  et = et+"</em>"
  lv_OpenItalic = false
end if
if lv_OpenBold then
  et = et+"</strong>"
  lv_OpenBold = false
end if
if lv_OpenColor then
  et = et+"</span>"
  lv_OpenColor = false
end if

et = et + "</font>"

next

et = et + “”

return et

End Function

but it’s handling of the special characters is better !

But that is the only :woozy_face:

I’ve copied the source text into the edit field but some lines of code becomes corrupted. So I try again and look at the preview:

Public Function ConversionRTFtoHTML(ta as textarea) as String
  // ------------------------------------------------------------------------------------------------------------
  // Konvertierung eines RTF-Textes in einen HTML-Text 
  // Conversion of a RTF text into a HTML text 
  // Enriched Text funktioniert leider nicht (Doku aus https://en.wikipedia.org/wiki/Enriched_text ) 
  // Enriched text doesn't work (docu from https://en.wikipedia.org/wiki/Enriched_text ) 
  // ------------------------------------------------------------------------------------------------------------
  // This method isn't complete. The method contains only the formatting rules which I need at first.
  //
  // This method is useful to transfer a RTF text of a textarea to a html text for transferring this text per
  // e-mail. So isn't CSS useful - as far as I know.
  // ------------------------------------------------------------------------------------------------------------
  // --> ta     Textarea 
  // <-- string HTML text
  // ------------------------------------------------------------------------------------------------------------
  if ta is nil then
    return ""
  end if
  
  Dim sr               as StyleRun
  dim lv_SRText        as String
  dim et               as string
  dim lv_OpenColor     as boolean
  dim lv_OpenBold      as Boolean
  dim lv_OpenItalic    as Boolean
  dim lv_OpenUnderline as Boolean
  
  et = "<html><body>"
  
  //Get the number of styles
  For i as integer = 0 to ta.StyledText.StyleRunCount - 1  //get the StyleRuns
    
    sr = ta.StyledText.StyleRun( i )
    
    if sr.Size = 0 or sr.Size = 13 then          // erst mal das obsolete "font face" nutzen - as a start trying the obsolet "font face"
      et = et + "<font face="+sr.Font+">"        // die Größe ist im HTML ein Wert von 1 bis 7, paßt ja nicht zur Pixel-Größe
                                                 // the size is in HTML a value from 1 to 7, doesn't fit to a pixel size
    else
      et = et + "<font face="+sr.Font+" size="+sr.Size.ToText+">"
    end if
    
    lv_SRText = sr.Text
    'lv_SRText = ReplaceAll(lv_SRText ,"<","&lt;")      ' für das Kleiner-Zeichen < 
    'lv_SRText = ReplaceAll(lv_SRText ,">","&gt;")      ' für das Größer-Zeichen > 
    lv_SRText = ReplaceAll(lv_SRText ,"&","&amp;")     ' für das Und-Zeichen &
    lv_SRText = ReplaceAllB(lv_SRText,"ä","&auml;")    ' für ä (Umlaut-a) 
    lv_SRText = ReplaceAll(lv_SRText ,"Ä","&Auml;")    ' für Ä (Umlaut-A) 
    lv_SRText = ReplaceAllB(lv_SRText,"ö","&ouml;")    ' für ö (Umlaut-o) 
    lv_SRText = ReplaceAll(lv_SRText ,"Ö","&Ouml;")    ' für Ö (Umlaut-O) 
    lv_SRText = ReplaceAllB(lv_SRText,"ü","&uuml;")    ' für ü (Umlaut-u) 
    lv_SRText = ReplaceAll(lv_SRText ,"Ü","&Uuml;")    ' für Ü (Umlaut-U) 
    lv_SRText = ReplaceAll(lv_SRText ,"ß","&szlig;")   ' für ß (scharfes s, s-z-Ligatur)
    
    lv_SRText = ReplaceAll(lv_SRText ,"é","&eacute;")  ' für é (e mit Acute-Akzent) 
    lv_SRText = ReplaceAll(lv_SRText ,"à","&agrave;")  ' für à (a mit Grave-Akzent) 
    lv_SRText = ReplaceAll(lv_SRText ,"ô","&ocirc;")   ' für ô (o mit Circumflex-Akzent) 
    lv_SRText = ReplaceAll(lv_SRText ,"ç","&ccedil;")  ' für ç (c mit Cedille) 
    lv_SRText = ReplaceAll(lv_SRText ,"ñ","&ntilde;")  ' für ñ (n mit Tilde) 
    lv_SRText = ReplaceAll(lv_SRText ,"å","&aring;")   ' für å (a mit Ring)
    
    lv_SRText = ReplaceAll(lv_SRText ,"°","&deg;")     ' für das Grad-Symbol ° (degree) 
    lv_SRText = ReplaceAll(lv_SRText ,"©","&copy;")    ' für das Copyright-Symbol © 
    lv_SRText = ReplaceAll(lv_SRText ,"®","&reg;")     ' für das Eingetragene-Marke-Symbol ®
    lv_SRText = ReplaceAll(lv_SRText ,"§","&sect;")    ' für das deutsche Paragraphen-Zeichen § (section) 
    lv_SRText = ReplaceAll(lv_SRText ,"¶","&para;")    ' für das amerikanische Absatz-Zeichen ¶ (paragraph)
    
    lv_SRText = ReplaceAll(lv_SRText,EndOfLine,"<br>") ' neue Zeile
    
    select case sr.TextColor
    case color.Red
      et = et+"<span style=""color:red"">"
      lv_OpenColor = true
    case color.Blue
      et = et+"<span style=""color:blue"">"
      lv_OpenColor = true
    case color.Green
      et = et+"<span style=""color:green"">"
      lv_OpenColor = true
    case color.Yellow
      et = et+"<span style=""color:yellow"">"
      lv_OpenColor = true
    case color.Cyan
      et = et+"<span style=""color:cyan"">"
      lv_OpenColor = true
    case color.Magenta
      et = et+"<span style=""color:magenta"">"
      lv_OpenColor = true
    end select
    
    if sr.Bold then
      et = et+"<strong>"
      lv_OpenBold = true
    end if
    
    if sr.Italic then
      et = et+"<em>"
      lv_OpenItalic = true
    end if
    
    if sr.Underline then
      et = et+"<u>"
      lv_OpenUnderline = true
    end if
    
    et = et + lv_SRText
    
    if lv_OpenUnderline then
      et = et+"</u>"
      lv_OpenUnderline = false
    end if
    if lv_OpenItalic then
      et = et+"</em>"
      lv_OpenItalic = false
    end if
    if lv_OpenBold then
      et = et+"</strong>"
      lv_OpenBold = false
    end if
    if lv_OpenColor then
      et = et+"</span>"
      lv_OpenColor = false
    end if
    
    et = et + "</font>"
    
  next
  
  et = et + "</body></html>"
  
  return et
  
End Function

Cough:

Wow RTF is still alive and among us? May I ask why and from where you are recieving RTF formated text? If asked I would deny to develop any parser or converter.

Perhaps because it is/was an easy way to create a document the end user can open with MS Word, w/o writing an own docx? My understanding is that the PO wants to switch to HTML but in an easy way w/o recreating everything from scratch.

Grin… I’m doing quite a few blog posts on AppleScript. Of course, I want to show the code in the blog post. It looks nicer if the code is formatted. I have a CLI called ASHtml. But the CLI has some drawbacks. After looking at the source code I thought that I could do this myself in Xojo. It’s just one call to the MBS plugin to get the code of the AppleScript as rtf. Then I just need to make some ugly html out of the html.

I’m going to make the project available in the next days.

Any update here ? we have a project where we have a lot of rtf files and we try to migrate them to html in a way or another . some documentation or examples will help a lot .

Thanks .

I made an app that I use for my blog posts on AppleScript: “Convert AppleScript to Html” at Other Applications . Once the rtf is in a TextArea it’s simple to make some html.

1 Like