Please fix extraordinarily slow parser for RTFData

Jonathan_Ashwell · May 18, 2014, 7:21pm

But RTF Unicode entities are 16 bit numbers, There is no \u32768.

Christian_Schmitz · May 18, 2014, 7:56pm

but a lot of apps put bigger numbers there for higher unicode characters.

Mike_D · May 18, 2014, 8:39pm

Ok, the plot thickens. I realized that the code I wrote was ignoring the Paragraph Alignment property. So I added this code at the beginning:

  // unfortunately the Alignment is not stored in the style runs, so we have to get that from the Paragraph() items
  // cache this for speed
  dim i,u as integer
  dim paragraphs() as Paragraph
  u = st.ParagraphCount-1
  for i = 0 to u
    dim p as Paragraph=st.Paragraph(i)
    paragraphs.append p
  next

Suddenly, my code is now running nearly as slowly as the built-in Xojo framework code. The culprit is the call to StyledText.Paragraph(i). With any non-trivial amount of data, it has an exponentially slowing performance.

This leaves us with some choices:

ignore alignment (or have a simple rule that the entire TextArea must have a single alignment)
find some sort of workaround for the Paragraph(i) slowdown - perhaps another Declare?
give up and put pressure on Xojo to fix the bug…

Christian_Schmitz · May 18, 2014, 8:58pm

Well, if ignoring paragraphs is okay, I could also speedup my plugin code…

Maybe someone from Xojo looks up what StringDBCSMid3 function does and why it’s so slow?

Norman_P · May 18, 2014, 9:41pm

I suspect its related to UTF-8 “mid” being something you have to calculate every time because each “character” could be one or more bytes so to get the nth “character” you have to interpret all the bytes before before that, then grab the bytes for this character, and then return whatever character they represent.

Its a well known issue with multibyte representations like UTF-8 that can use a varying number of bytes.
UCS-2 and UCS-4, which are fixed size, can be quicker as you can quickly compute the correct offset in the byte stream.

Christian_Schmitz · May 18, 2014, 10:00pm

Sounds like you could add a function Paragraphs() to give array of Paragraphs for StyledText. This function could than walk over all paragraphs and not use Mid(), but MidB() and calculate the offset/lengths itself. Would that work?
And this method could be used in RTFData function.

Mike_D · May 18, 2014, 10:06pm

Yes - I’m doing some tests and the logic looks like this:

get the text from the TextArea and split it into an array of characters
get the NSTextStorage from the TextArea (which is a subclass of NSAttributedString)
walk the array of characters, and each newline that shows up, ask the NSTextStorage for paragraph styles at that point.

This approach seems feasible, though it again will only work on a TextArea, not a StyledText - unless a StyledText is also internally some sort of Cocoa subclass?

Jonathan_Ashwell · May 18, 2014, 10:54pm

@Christian. I don’t know if many apps output unicode entities with values > 32767, but Microsoft Word doesn’t . And here’s the relevant passage from p. 11 of RTF Specs 1.9 (it’s from 2007, but I think is the latest):

“[Unicode] Text is handled using the 16-bit Unicode character-encoding scheme.”

And this is from the Rich Text Format Wikipedia entry

“For a Unicode escape the control word \u is used, followed by a 16-bit signed decimal integer giving the Unicode UTF-16 code unit number.”

Joe_Ranieri · May 19, 2014, 12:00am

[quote=89319:@Jonathan Ashwell]@Christian. I don’t know if many apps output unicode entities with values > 32767, but Microsoft Word doesn’t . And here’s the relevant passage from p. 11 of RTF Specs 1.9 (it’s from 2007, but I think is the latest):

“[Unicode] Text is handled using the 16-bit Unicode character-encoding scheme.”

And this is from the Rich Text Format Wikipedia entry

“For a Unicode escape the control word \u is used, followed by a 16-bit signed decimal integer giving the Unicode UTF-16 code unit number.”[/quote]

I don’t see it in the spec, but Unicode scalar values outside of the basic multilingual plane are broken into two UTF-16 code units (a surrogate pair) and written out as two \u control words. An easy example is opening TextEdit, going to the character picker, inserting GRINNING FACE (U+1F600), and saving it as RTF. You’ll end up this:

\\uc0\\u55357 \\u56832

Jonathan_Ashwell · May 19, 2014, 2:12am

Yes, I’ve seen that as well. But I’ve also seen that Asian characters with a negative value in Word don’t display properly when given a value of > 23767. I think there is a lot of ad hoc interpretation of RTF. But if one adheres to the spec (and the UTF-8 is precomposed), then I think one should use signed Int16.