Speeding up text handling

TimStreater · July 2, 2016, 11:46am

I have the following method. I call it with haystack containing about 67k chars, needle containing about 5 chars, and the method returns a value of 100 or so. In other words, from the start’th character in haystack, a character contained in needle is about 100 chars further on.

According to the profiler, the function takes about 700 msec to do this, which seems to me to be ridiculously long. Any comments as to why would be appreciated.

More generally: in analysing this particular haystack (that is, in the method that calls my version of strcspn), I’m making heavy use of the text methods, and it’s taking about 18 secs of one core of my 2.5GHz Intel core i5 running at 100% to do so. Am I likely to be better off converting haystack to an array of characters at the beginning of the process, and just indexing my way through it? For comparison, my original PHP version of this code completes in 100msec or so.

function strcspn (haystack as text, needle as text, start as integer) as integer

// The PHP strcspn function. Scan through haystack from start, until any character
// from needle is reached, counting chars all the while. Then return that count.

dim  i, count, len As Integer, char As Text

count = 0
len   = haystack.Length - start

for i=1 to len
  char = haystack.Mid (start, 1)
  if  (needle.IndexOf(char)>-1)  then return count
  count = count + 1
  start = start + 1
next

return count

end function

Greg_O_Lone · July 2, 2016, 11:50am

Why don’t you just use Pos?

Kem_Tekinay · July 2, 2016, 12:09pm

Is a regular expression an option?

TimStreater · July 2, 2016, 12:12pm

It doesn’t appear to be a text method or reserved word or show up in a documentation search. Point me at it and I’ll take a look.

Kem_Tekinay · July 2, 2016, 2:07pm

With your code, I get about 12,000 microsecs with my test data. With a function based on a regular expression, I get about 4,000 microsecs on the first run, and about 900 microsecs on subsequent runs.

Function strcspn(haystack As Text, needle As Text, start As Integer) As Integer
  dim source as string = haystack
  
  dim pattern as string = needle.ReplaceAll( "\", "\\\" ).ReplaceAll( "-", "\\-" ).ReplaceAll( "]", "\\]" )
  pattern = "[" + pattern + "]"
  
  if start > 0 then
    start = source.Left( start ).LenB
  end if
  
  dim rx as new RegEx
  rx.SearchPattern = pattern
  
  dim match as RegExMatch = rx.Search( source, start )
  if match is nil then
    return -1
  else
    dim part as string = _
      source.LeftB( match.SubExpressionStartB( 0 ) ).DefineEncoding( Encodings.UTF8 )
    return part.Len
  end if
End Function

Kem_Tekinay · July 2, 2016, 2:35pm

This code is faster with the test data (about 500 microsecs) but much slower when there is no match. However, still much faster than your original code where there is no match. About 1,200 microsecs for the RegEx version, and 85,000 with this code. (I had to stop timing your code as it was taking much too long.)

Function strcspn(haystack As Text, needle As Text, start As Integer) As Integer
  dim pos as integer = 0
  for each testChar as text in haystack.Characters
    if pos >= start and needle.IndexOf( testChar ) <> -1 then
      return pos
    end if
    pos = pos + 1
  next
  
  return -1
  
End Function

Eli_Ott · July 2, 2016, 2:49pm

Function strcspn(haystack As Text, needle As Text, start As Integer) As Integer Dim haystackPart As Text = haystack.Mid(start) Dim count As Integer = 0 For Each char As Text In haystackPart.Characters If needle.IndexOf(char) > -1 Then Exit End count = count + 1 Next Return start + count End

Will_Shank · July 2, 2016, 3:20pm

I believe so. In my tests 99.9% of the time is splitting hayStack. Do that once and the rest should scream.

Norman_P · July 2, 2016, 3:56pm

or just dump it into a memory block & iterate through the bytes

Kem_Tekinay · July 2, 2016, 5:17pm

Eli, while testing I tried something like what you posted but found the Mid function to be very expensive. It takes less time to just iterate.

Eli_Ott · July 2, 2016, 5:34pm

I haven’t used the Mid function (with the exception of once at the beginning to split the text into to parts).

Kem_Tekinay · July 2, 2016, 5:41pm

Yes, that’s exactly what I mean.

TimStreater · July 2, 2016, 9:50pm

Doing some timing tests, on my main method, I’m concluding that at least indexof can be quite slow. E.g. if haystack is 70,000 chars long, and loc is small, then:

startind = haystack.indexof (loc, "<")

takes much less than 1msec. But near the end of haystack, it’s taking closer to 10msec.

So I’ll try splitting haystack to an array of chars.

Some discussion of these matters in the documentation would be useful.

Norman_P · July 2, 2016, 10:05pm

indexof starts at the beginning and hunts for the given char
basically a linear search as you cant really do much more
not surprising that it would be slower if the found char is near the end than at the beginning
an array of chars is not likely to perform better with similar configurations
in fact I would suspect it will be slower as indexof is already written in optimized C++ in the framework

TimStreater · July 3, 2016, 3:10pm

In my case, haystack is the HTML body of an email. I’m looking through it to find certain strings and replace them with others. What I’m doing now is to split() the text into a character array. Overall, for the example haystack that was taking to 18 secs or so to analyse, it’s down to about 80msec. I use indexof() on the array which keeps things simple.

In an 80k piece of HTML, there might be just a few actual replacements to be done. Since there is no array.slice() method, which I might use to replace a chunk of array elements with another array, I’m still doing the replacements on the original text because the replacements won’t in general be the same length as the section of original text. So after each such replacement I’m doing split() again, which is slightly inefficient, but at the minute things are sufficiently well optimised for that not to matter.

Where do I send suggestions for new methods?

Mike_D · July 7, 2016, 1:02am

Are the needle strings guaranteed to be regular old ASCII characters (by which I mean, “things which are represented as a single byte in UTF8”)?

If so, then I think Norman’s suggestion above would work - treat the haystack as UTF8 (or convert it to UTF8 if not) and then just do a byte search. Should be super fast.