str and text split performance

Carlo_Rubini · December 10, 2015, 9:34am

Hello,
I have a big parenthesis-delimited text-file imported into my project, and I have to make an array out of it.
Replacing the old way with the new framework-way I notice a pretty high difference in speed:

dim myStr as string = DefineEncoding(myTxtFile, Encodings.UTF8)
//kParenthesis is a constant// “]”

dim tt as Double = Ticks
dim mArray() as string = myStr.splitB(kParenthesis)//resulting ubound = 35500
label1.text = Format(tt - Ticks, “#”)// = 1 or 2 ticks

and

dim tt as Double = Ticks
dim mArray() as text = myStr.toText.split(kParenthesis, 1)
label1.text = Format(tt - Ticks, “#”)// = more or less 20 ticks

Any suggestion how to speed up the process? Thanks.

Eli_Ott · December 10, 2015, 9:59am

I think in terms of benchmarking you would need to compare Text.Split with String.Split (and not with String.SplitB).

Text.Split and String.Split will compare each character with the separator.
String.SplitB will compare each byte with the separator, which is much faster, but will not find multi-byte characters.

Carlo_Rubini · December 10, 2015, 10:09am

I thought that myStr.toText.split(kParenthesis, 1) was the equivalent of splitB.

Anyway, using string.split instead of string.splitB still returns 2 ticks:
dim tt as double = Ticks
dim mArray() as String = split(mBible, kParenthesis)
label71.text = Format(tt - Ticks, “#”)

Michel_Bujardet · December 10, 2015, 11:07am

In iOS where String is not available, using a Xojo.Core.MemoryBlock together with a combination of IndexOf and Mid would probably be very fast. http://developer.xojo.com/xojo-core-textencoding will enable going to and from that MemoryBlock to Text.

Could be a bit heavy to construct, but once wrapped into a function, it can be made rather simple to use.

In Desktop or Web, I would simply use String for the spliting and then use ToText. Unless of course the file contains characters that require the Text datatype, such as composite characters.

Carlo_Rubini · December 10, 2015, 11:37am

Actually I hoped to refactor the existing code in my apps with the new framework, but it seems that, at least for text.split, the time has not yet come to refactor.
As for using Xojo.Core.MemoryBlock together with a combination of IndexOf and Mid, well, I’m not fluent enough with memoryblocks.

Michel_Bujardet · December 10, 2015, 4:43pm

In iOS, that would be the only solution to speed up the process I guess. In Desktop and Web, we fortunately still have String for a long time to come.

Carlo_Rubini · December 11, 2015, 12:46am

Since I develop only for Desktop, String is my friend.

Marco_Hof · December 11, 2015, 1:36am

Careful with 64-bit builds though.
Split and Join can give very weird results on strings.

Michel_Bujardet · December 11, 2015, 1:55am

[quote=235093:@Marco Hof]Careful with 64-bit builds though.
Split and Join can give very weird results on strings.[/quote]

What “weird results” ? There no seem to be any bug report.

Marco_Hof · December 11, 2015, 2:07am

It took me two days struggling with a perfectly working 32 bits version while the 64 bits had serious issues. The ‘weird’ -part was that it happened at random points. The split and join operations themselves don’t crash but the results are inconsistent. At random times, characters got chopped off or garbage was added. And only to find out (no 64-bit debugger) way after the split/joins.

I couldn’t use Text because of another 64-bit issue (with encoding. I filed a bug report for that) but finally I saw the very last line at the bottom that I totally overlooked: http://developer.xojo.com/64-bit-guidelines

Michel_Bujardet · December 11, 2015, 2:23am

You are right. I did not notice either.

Carlo_Rubini · December 11, 2015, 4:32am

And that is the original reason why I intended to “refactor” strings into text. Compiling at 64 bits, splitting Bengali strings would return messed up chunks of text; while using splitB.string the output is OK. Text.split too is OK, but is too slow with big texts, as I mentioned above).

Marco_Hof · December 11, 2015, 5:05am

Right.
I can’t remember exactly because I stumbled on two issues in 64 bits. So when trying to work around the first, I ran into the second. Both a pita because not able to use the debugger.

Using Text, above in 32 bits. Same Text below in 64 bits.

Other issues as well like crashing when doing ConvertEncoding or DefineEncoding.

Even with standard string. Different output with same encoding in 64 bits plus the join/split issues.

I know, 64 bits is Beta. I tried but had to give up.

Michel_Bujardet · December 11, 2015, 7:39am

These examples brilliantly demonstrate the advantages of Text over String.

Sure, Text is slower, but in some cases, vastly superior.