Performance of NthField vs. Split and array access

I’m refactoring a project that takes in a huge (possibly millions) number of lines of text and using NthField to get 3 out of 7 fields (separated by “|”). On a few hundred lines of input, the performance is non-noticeable. However, in real-world tests, the profiler identifies the method that contains the loop reading and processing the lines as the culprit in a serious slow down (I read 1000 lines at a gulp). Before I spend a day refactoring my NthField logic, has anyone done comparisons of Splitting the fields into an array and then accessing the specific members of the array versus using NthField?

Thanks,
Tim

Yes, it’s significantly faster. Like, blow-you-away type faster.

Every time you call NthField, it has to start from the start of the string to index the fields. By splitting it into an array, you’ve done the same as creating an index once.

Blow me away faster is exactly what I’d like to see. Sold - thanks, Kem!

If you’re only interested in 3 of the 7 fields, another option for you is a regular expression applied to each line.

\\A([^|]+)\\|([^|]+)\\|([^|]+)\\|([^|]+)\\|([^|]+)\\|([^|]+)\\|([^|]+)\\z

You only need the parens around the fields you actually want, and they would show up in match.SubExpressionString( x ) where x = 1…3.

Between that and Split, I’m not sure if you’d notice the difference, but it’s something to test.

By using the split option, I can get the fields explicitly. And that makes the rest of the parse/match logic much more straightforward (and removes two inner loops!).

To report back on this, my per-loop time went from an average of 430ms per iterations to 11ms per iteration. Multiply that 419ms saving times 1.5 million iterations and you can see why I am excited by the refactor.

Thanks again, @Kem Tekinay