split function in 64bit is terribly slow

I have used the split function for many years to take a string and split it up into an array of strings each with a separate line. For example, the following code

line() = string.split(endofline)

used to carry this process out very rapidly even for very large string containing thousands of lines. But when compiled for 64bit using 2015R4, this function becomes 100 times slower than for 32bit applications. I am wondering if anyone has a suggestion on how I might speed this up. Nthfield() is even slower, so I am at a loss how to solve this problem.

my observations
splitting 104,944 lines
in the IDE : 0.0177 seconds
32 bit compiled : 0.0206 seconds
64 bit compiled : 60.32 seconds - WOW! more line almost 3000x slower!!!

to quote the Mythbusters… CONFIRMED :slight_smile:

I think there’s already been a thread on this. I believe it’s a known issue and there are some kind of complications with fixing it due to dealing with text encoding issues. I think there might have been a suggestion to optimize the current code where possible and add a splitB or something.

Of course I could have dreamt all of that so search the forum if you want actual facts. :slight_smile:

Happy New Year!

my observations (useing SplitB)
splitting 104,944 lines
in the IDE : 0.07853 seconds
32 bit compiled : 0.02144 seconds
64 bit compiled : 0.01739 <------- MUCH BETTER

Still an issue… but if you have no UNICODE and can use SPLITB it works in both 32/64 bit

can’t optimize the code too much… after all its ONE LINE

v=splitb(stuff,EndOfLine)

Be careful with using split/join in 64bit. I had some serious issues with it.
Later, I found the remark at ‘known issues’ at the bottom of this page: http://developer.xojo.com/64-bit-guidelines

Use Text instead of String.

Yes, thats the only solution until Xojo fixes the split function.

This sounds like the same bug I logged during the beta for the first 64 bit release.

Using Text instead of String is one solution but Text is generally a lot slower than String so still expect a significant drop in performance here (also logged as a bug).

If the binary ‘B’ functions work correctly in 64 bit builds and your strings are all UTF-8 then it should be safe to use those.

If you submitted a feedback report on this problem, let me know the number so I can sign on to it.

Although SplitB(…) works fine for many of my text files, it will fail for some, so I am hoping Xojo can fix this function. Slower is OK, but 3000 times slower suggests there is room for improvement.

Here are three 64-bit Split related Feedback cases that are Verified:

<https://xojo.com/issue/40961> 64-Bit Split Bug
<https://xojo.com/issue/41227> Xojo x64 Split lose last letter
<https://xojo.com/issue/40702> 64-bit Splitting Large Strings Fails

Let’s hope all of these get fixed and that performance is improved in the next release.

In short, the 64-bit string code needs a lot of work and we hope to get it done for 2016r1 but it may slip to a later release. The good news is that this will likely fix a number of long-standing bugs with String, especially with non-UTF-8 encodings.

I use the split-function on string quite a lot in previous Win 32bits projects . All can be compiled 64-bits now and seem to be fully functional, but I dare not release 64-bits version to the field right now due to this and some other known issues.
Anyway, it was stated that 64-bit should still be treated as beta.

[quote=238823:@Robert Birge]If you submitted a feedback report on this problem, let me know the number so I can sign on to it.

Although SplitB(…) works fine for many of my text files, it will fail for some, so I am hoping Xojo can fix this function. Slower is OK, but 3000 times slower suggests there is room for improvement.[/quote]

40702 is the case I logged
<https://xojo.com/issue/40702>

40840 is related to the performance of Text functions in comparison to String
<https://xojo.com/issue/40840>

I’m surprised that SplitB fails for some of your files. Can you tell me more?

SplitB may fail because some UTF-8 characters exceed one byte in size.

Don’t think that should be a problem. UTF-8 is designed so that you cannot have the same byte value more than once in a multi-byte sequence so you shouldn’t get false matches.

A quick try shows accented characters work just fine with SplitB.

problem will arise if you are spliting by lineending 0x0D or 0x0A and your text contains UTF-8 where 0x0D or 0x0A is a valid part of the encoding (0x270D - writinghand etc)… is this most likely going to be rare? sure… but there are at least a few dozen if not more UTF-8 characters that will cause problems.

This may occur (will occur) with other single character split values… its just a matter of how confident you are that your input data will not contain them

[quote=238892:@Dave S]problem will arise if you are spliting by lineending 0x0D or 0x0A and your text contains UTF-8 where 0x0D or 0x0A is a valid part of the encoding (0x270D - writinghand etc)… is this most likely going to be rare? sure… but there are at least a few dozen if not more UTF-8 characters that will cause problems.

This may occur (will occur) with other single character split values… its just a matter of how confident you are that your input data will not contain them[/quote]

0x0A & 0x0D are ASCII characters. They should not appear anywhere else in a UTF-8 sequence.

Unicode 0x270D is UTF-8 E2 9C 8D.

The following is from the UTF-8 Wikipedia page:
“Any byte oriented string searching algorithm can be used with UTF-8 data, since the sequence of bytes for a character cannot occur anywhere else.”

[quote=238830:@Frederick Roller]Here are three 64-bit Split related Feedback cases that are Verified:

Feedback Case #40961 64-Bit Split Bug
Feedback Case #41227 Xojo x64 Split lose last letter
Feedback Case #40702 64-bit Splitting Large Strings Fails[/quote]

Any chance this will get done for 2016r1?

No, sorry. The delay with retina support pushes my expectation back to 2016r2.