Speed Up String Strip Method

Have-you checked if the "unwanted block of leading data” always have the same length ?
(after reading the thread, I was thinking it can be garbage like what was in some old files at their beginning, 40 years ago…)

You mean like a UTF BOM?

No, there was no UTF so long ago.

I never knew why there was a block of garbage at the beginning of the file (512 or 1024Bytes of binary garbage).

The “application file owner” skipped these garbage at read time (and so never display it).

@Douglas_Handy it’s faster by a few tens of microseconds per call on average according to the IDE’s profiler. This translates to a second or so over a batch run of my app, which has undergone heavy restructuring since I first posted, to reduce the number of calls to this method and improve overall speed in other ways.

I’m going to stay with regex since it’s so beautifully compact.

I appreciate your time in creating the regex for me and explaining it - it would have taken me some tutorial-reading time to figure out on my own.

I’m actually surprised it is only that much. Regular Expressions are wonderfully powerful things, and some search patterns can be quite obtuse. However, simple expressions like this one are relatively easy to learn and still extremely useful.

Our resident RegEx expert @Kem_Tekinay will create much more sophisticated patterns than I have ever used. But even for what I use them for, they are terrific compared to alternatives.

Indeed, I use regex for serial data parsing in some apps, which it’s great for. The downside is its “write-only” nature, especially if you use it infrequently :slight_smile:

I did a little informal testing and I have a RegEx pattern that is about 15% faster than the Replace method:

r.Options.Greedy=true
r.Options.DotMatchAll=true
r.SearchPattern="[\x21-\xFF].*$"

resultString=r.Search(testString).SubExpressionString(0)

The only downside to this method is that it will fail if the testString is comprised solely of the excluded ASCII characters.

I not by my computer to confirm but I think this has a few issues. I’ll test and post later.

I’m wrong, the only issue is the one you mentioned, which is a NilObjectException where the entire string is made up of “bad” characters. That can be overcome too by adding this:

r.SearchPattern="[\x21-\xFF].*$|^"

Still, I’d like to test this further because it’s hard to imagine that this would be faster than straight Replace on many repetitions only because it has to create a new RegExMatch on each iteration. But maybe Replace does that internally too.

Hmm - that search pattern doesn’t work for " this is the test string" (the spaces are just spaces). I don’t think there’s a way in RegEx to return “” if the pattern matches " ", which is what I would expect the original poster would want.

Probably the easiest thing to do is to wrap it in an exception handler and just return an empty string when one is thrown - or, since this is really likely to be an edge case, add in some tests for “string is all unacceptable characters”, etc. and handle each one appropriately.

The test I set up definitely shows about 15% improvement in Xojo 2020 r1.2 by finding and returning instead of replacing and returning. This make some sort of sense to me, since replacing and returning is actually finding and THEN replacing and returning; so why not change the finding step to find what you want and return it instead of also replacing? :slight_smile:

If XOJO is too slow, make an external in something like C or C++ and run the external within XOJO and capture the console output or have the C/C++ app save to a file or something. C++ could process a massive number of these pretty fast;

Modify the following to do exactly what you need and compile it to a console app in a C++ compiler like MS Visual Studio or something.

#include <iostream>
#include <string>
//#include <fstream>

int main(int argc, char* argv[])        // entry point with arguments
{
    std::string myString = "";      // declare a string, make it empty
    
    myString = argv[1];   // argv[0] is the calling app name and argv[1] is the first argument
    // if there are more arguments they will be argv[2], argv[3] and so on...
    int pos = 0;                    // declare an integer and set it to zero
    if (myString != "")                 // as long as we have an actual string to work on... (string not equal to empty)
    {
        std::string sub = "";               // we need a substring to store our trimmed string into
        char c;                             // we need a char to store a single character
        int len = myString.length();        // calc the length of the string
        for (int i = 0; i < len; i++)       // we are going to loop through every character in the string
        {
            c = myString[i];            // get the current character and store as a char
            if ((int)c > 32)            // if the ASCII code is above 32 then...
            {
                pos = i;                // record the position (the value of i)
                break;                  // break out of the for loop
            }  
        }

        sub = myString.substr(pos, myString.length() - pos);   // this just copies the sub string into a string called sub

        std::cout << sub << std::endl;         // output string to console
        //std::ofstream fout("string.txt");
        //fout << sub;
        //fout.close();
    }  
     return 0;        // quit the app
}

The fundamental problem with the original implementation is that it is O(N*N). Performance will be OK for short strings but get progressively worse as string lengths increase.

Any code that loops over the length of a string, while also calling s.Left or s.Mid or s.Right within the loop, will suffer from this because those calls are not constant time calls: Xojo has to actually check the string for multibyte characters each time to find the offset to the desired string.

(Compare to e.g. LeftB/MidB/RightB, which are constant time. However, if you use those your code won’t work properly with multibyte characters.)

So the main goal here is to figure out how to iterate through the string only once. Regex is one way to do that. Another way is to use String.Codepoints. Codepoints is close to ideal for this task, since it iterates the string once, and it’s returning unicode codepoints, which are the same as asc() values for characters < asc(128). So something like this should be fairly quick.

var stringToTrim as String = "..."
var newStart as Integer = -1
var index = 1

for each codepoint as UInt32 in stringToTrim
  if codepoint > 32 then
    newStart = index
    exit
  end if
  index = index + 1
next

if newStart = -1 then
  return ""
else
  return stringToTrim.Mid(newStart)
end if

This assumes the potential for a massive amount of possible characters at the beginning of the strings.
I’m curious to know why they are there in the first place… if the files come from a contactable source, it’s not unreasonable to ask the providers to stop sending files that need that kind of processing.

in my OWN experience, the only text files I have encountered with bytes < 32 at the beginning, have been the BOM found at the start of UTF files.
(during 20 years of being provided with files from various sources as my job),

Emile says there used to be files with up to 1K of binary garbage at the beginning.
Never seen one ‘described as a text file’ personally: to me, that’s a binary file that happens to have some clear text in it.
I would expect that binary data to contain a pointer to the start position of the clear text.

However we do not know from Julia what amount of this pre-data exists.
We don’t know if the files are UTF or not.
We don’t know how long these methods take.

The perhaps splitting the string to two halves (or more) would help? :man_shrugging:

1 Like

This was solved by Jim Meyer in the third reply, and speed was improved by not modifying the source in the loop.

Using byte characters was suggested by KarenA, it’s a good thing to know.

Because XML Plists have indentation, as mentioned in Post 28. Probably only tabs, but I guess the original authors figured that if they’re stripping tabs, they might as well strip everything.

I did mention that the average string length is 22 characters, but indentation depends on the nesting of the Plist.

Plists (at least those conforming to Apple’s specs) are UTF-8.

You were right!!!

Worthwhile speed gains were achieved by following Jim Meyers’ suggestion and by putting the code inline as suggested by KarenA, to eliminate the overhead of the function call. I ended up using regex because it’s the most concise and at least as fast as anything else.

The biggest speed improvements were achieved by restructuring the app to reduce the number of calls by a factor of ≈5.

I’ve created a project to test the various functions in a consistent way. So far, the fastest I’ve found is using a MemoryBlock with Ptr, followed by RegEx.Replace.

I’ll post the project to GitHub soon, but here is the MB code:

Private Function StripMemoryBlock(s As String) As String
  if s = "" then
    return s
  end if
  
  var origEncoding as TextEncoding = s.Encoding
  var useEncoding as TextEncoding = if( origEncoding is nil, nil, Encodings.UTF8 )
  
  var mb as MemoryBlock = s.ConvertEncoding( useEncoding )
  var p as ptr = mb
  var lastByteIndex as integer = mb.Size - 1
  
  var result as string
  
  for byteIndex as integer = 0 to lastByteIndex
    if p.Byte( byteIndex ) > 32 then
      result = mb.StringValue( byteIndex, mb.Size - byteIndex, useEncoding )
      exit
    end if
  next
  
  result = result.ConvertEncoding( origEncoding )
  return result
  
End Function
1 Like

Here is the repo:

5 Likes