Please fix extraordinarily slow parser for RTFData

Christian_Schmitz · May 16, 2014, 6:33pm

If you have a good benchmark, you could email me. Maybe I can improve my code.
But it may be simply that walking through style objects is slow.

Mike_D · May 16, 2014, 6:49pm

I uploaded the demo project, modified to use MBS. In my testing, it takes 18 seconds to use the built-in Xojo framework calls, and 11 seconds to use MBS.

50% faster, yes, but still about 1000x as slow as it should be.

Mike_D · May 16, 2014, 6:51pm

Regarding Bob’s Formatted Text Control: http://www.bkeeney.com/formatted-text-control/ I see two issues:

I downloaded the demo app, and launched it, and on a Retinabook, it seems borked - drag & drop has the wrong coordinates, and it seems to stop responding to key presses. Not a comforting first demo.
the license seems to suggest that there’s no ability to use FTC to make compiled apps which are delivered to end customers? I may be reading it wrong, but my end goal is to use it to sell $3 apps in the Mac App store.

Bob_Keeney · May 16, 2014, 7:10pm

The demo is from an older version. I need to update that. Plus, I have some newer code that hasn’t made it to release yet.

Mike_D · May 16, 2014, 7:18pm

Bob - thanks. I’m hoping to avoid needing FTC as I really just want a solution to my existing code, but if Christian is unable to work some magic and get a 500x speedup, FTC may be in my near future.

Norman_P · May 16, 2014, 7:23pm

If this is Mac only (i.e./ Mac App Store) why not use the declares Massimo posted way back when
That will use the system rtf reader which is about as quick as you’re going to get

Mike_D · May 16, 2014, 7:29pm

I’m using StyledText objects, and from reading the thread I’m understanding that the declares won’t work - they only work for TextArea objects. Is that correct?

Norman_P · May 16, 2014, 7:43pm

Ah yeah I must have missed that in your posts
Declares won’t give you styled text objects

Christian_Schmitz · May 16, 2014, 8:01pm

Well, as Michael and I wrote in the feedback case it comes down to the StringDBCSMid3 function which uses most of the time.

Internally we have to use the functions like Paragraph() to get the information. The plugin can be better with string concats and avoid some overhead, but it can’t improve the functions it calls.

Jonathan_Ashwell · May 16, 2014, 8:12pm

Michael, one of the workarounds I’ve developed to deal with this (dreadful) problem is, on OS X, to stuff the styedtext in a hidden TextArea, use the Declares Norman mentioned, and then get the styledtext back out. I’ve also written a little StyledTextToRTF method. It’s all very kludgey, but it improves performance hundreds of fold (or more) when the block of text is large.

Mike_D · May 16, 2014, 8:20pm

While eating lunch, I was thinking along these lines myself: the functionality of StyledText and StyleRun is not that complex (especially as I don’t use all the features in my app). I think I’ll take a stab at writing my own versions. If I come up with anything workable, I’ll post them here.

Michel_Bujardet · May 16, 2014, 8:42pm

The project posted by Axel Schneider in https://forum.xojo.com/10186-troubles-with-rtf works for styled text, uses MacOSLib and is extremely fast.

Mike_D · May 16, 2014, 10:18pm

A few hours later, and I’ve got about half of it done - I’ve written code which takes a StyledText object and converts its StyleRuns to RTF. While doing this, I think I see where the performance hit may come in.

RTF is a 7-bit system, which has been extended for UTF8. As such, during the conversion you must, at some point, scan your UTF8 string and convert any characters outside the range 0…127 to an equivalent RTF string. The algorithm looks something like this:

  for i = 1 to len(src)
     dim s as string = src.mid(i,1)
     dim x as integer = asc(s)
     if x > 127 then 
       r = r + "\\uc0 \\u" + format(x,"#") + " "  ' \\uc0 indicates "there is no substitution character"
      else
        r = r + c
      end if
  next
  return r

For short strings this is no problem. But, as the size of the src string grows, the mid() operator becomes a bottleneck, since in a UTF8 string the mid operator needs to scan characters from the beginning of the string each time. To get to the 5000th character requires at least 5000 comparisons. To get to the 5001th requires another 5001 operations, etc. This is “Order N-squared” using the “Big O” notation.

I can think of a few optimizations here:

only process strings in “paragraph” sized chunks : this works great, as long as your text has embedded endOfLines, but otherwise is not a good general solution
come up with a replacement for the Mid() operator that is faster when dealing with UTF8 strings
don’t use UTF8? Perhaps pre-converting the string to a fixed-width format such as UTF32 would speed things up?

I’ll do some tests and report back.

Mike_D · May 17, 2014, 1:57am

I wrote pure Xojo code to convert a StyledText object to RTFData, and to parse RTFData and make a StyledText object. I also tested out Massimo’s declare-based code.

Test results with 8KB of RTF text, compiled app:


                 Time (msec)
              -----------------
Method        FromRTF     ToRTF
-------------------------------
Xojo Native      1300      8400
Custom Class       12         8
Declares            1         1



Test results with 32KB of RTF text, compiled app:


                 Time (msec)
              -----------------
Method        FromRTF     ToRTF
-------------------------------
Xojo Native     80000      ??? (gave up after 10 minutes of beachball) 
Custom Class       36       31
Declares            4        4

Conclusions:

the Xojo-based framework is seriously borked in Cocoa builds, and is practically unusable except for tiny amounts of RTF (this is what everyone else has found, no surprise here)
if you only need to work with RTFData going into and out of TextAreas, and don’t care about stand-alone StyledText objects, then use Massimo’s Declares, as they are wicked fast.
if you need more flexibility to work with StyledText objects, it’s relatively easy to write a bare-bones RTF encoder/decoder and performance is very good, easily 100x faster than the Xojo framework.

Mike_D · May 17, 2014, 3:44pm

I’ve uploaded version 1 of a module that works around the performance issues: https://forum.xojo.com/12034-rtfutils-fixes-slow-styledtext-rtfdata-performance

Jonathan_Ashwell · May 17, 2014, 9:19pm

I think you can speed this up quite a bit by using Split(src, “”) instead of Mid(), and then evaluate each element (letter) in the array. Replace any UTF-8 char with the correct RTF entity, then return Join(b, “”).

Mike_D · May 17, 2014, 11:43pm

Oddly enough, I did very little performance optimizing in my code, and it’s already 100x faster than the Xojo framework - so I suspect that there’s something else amiss in the framework. But that’s good news, in a way, as it just means there’s more room for optimization later if needed…

Jonathan_Ashwell · May 17, 2014, 11:48pm

I’ve spent a lot of time optimizing Xojo RTF handling. This is a really (really) basic function if you want to save/restore/manipulate styled text. I’m baffled as to why it hasn’t been addressed by Xojo.

Jonathan_Ashwell · May 18, 2014, 11:58am

One more thing…in your code you should dim x as Int16, not Integer. This will allow it to generate negative RTF unicode entities, which are used for many Asian (and not doubt other) characters.

Mike_D · May 18, 2014, 3:07pm

In Xojo, “Integer” is a signed 32-bit value which can hold positive and negative values. An Int16 is a signed 16 bit value. So, any value that an Int16 can hold, an Integer can also hold. I don’t think this would matter in practice.