Fast String Concatenation with a defined encoding

I’ve been using the MemoryBlock / BinaryStream method for fast string concatenation. Take the following function as an example:

Function GenerateString() as String
   Var cResult As New MemoryBlock( 0 )
   Var bsResult As New BinaryStream( cResult )

   For iNumber as Integer = 1 to 100000
       bsResult.Write( Str( iNumber ) )
       bsResult.Write( " " )
   Next

   bsResult.Close
   Return cResult
End Function

Obviously the contents of the function is only an example. The real functions are more convoluted but this shows the principal. The problem I’ve got is that the returned string has a null encoding.

I would like to define the encoding to UTF8, which I can’t seem to find a way to do.

Return CType( cResult, String ).DefineEncoding( Encodings.UTF8 )

Works, but I’m worried it will duplicate the MemoryBlock before defining the encoding. Another way is:

Return String( cResult ).DefineEncoding( Encodings.UTF8 )

Would seem like a viable option but it doesn’t work:

Return cResult.StringValue( 0, cResult.Size - 1 ).DefineEncoding( Encodings.UTF8 )

Which I presume doesn’t duplicate but I have to get the length for the string, which could be time consuming. Produces a syntax error anyway.

I’d rather not put the .DefineEncoding( Encodings.UTF8 ) on every call of the function.

Just put it in a string first and then define encoding on the string before returning it…

IIRC define encoding does not copy the string, and strings are passed byRef not ByValue, so no copying returning it either.

-Karen

Will that duplicate the memory block into the string, or is it just copying a pointer. if so that is the answer.

I believe converting a memory block to a string is copying in Xojo because string are immutable and memoryblocks are mutable…

So if you need to wind up with the data in a string, you have no other choice AFAIK.

In that case only option to avoid the copying might be a plugin of some sort.

-Karen

There needs to be a FAQ about this, but for tight loops like this, always add:
#pragma DisableBackgroundTasks

You can see 2x to 10x speedups basically for free.

See Pragma Directives — Xojo documentation

Edit to add: I think KarenA is right, you are going to have to take a single MemoryBlock to String conversion no matter what. The best thing is to do it at the end, after the string is complete (rather than doing it 10000 times in the loop)

Function GenerateString() as String
   ...
   Return cResult
End Function

The return line is already copying the memoryblock into a string (memoryblock auto converts to string), so you might as well define the encoding right there.

Edit: to be clear, since your function returns a String, returning a MemoryBlock will implicitly create a copy of the contents of the MemoryBlock as a String in memory and double the memory footprint of the results. The memory of the original memoryblock will be released when it goes out of scope, but there will be a moment in time when both copies are in memory.

Public Function GenerateString() As String

  Var mem As New MemoryBlock( 0 )
  Var stream As New BinaryStream( mem )
  
  For i as Integer = 1 to 10
    stream.Write( Str( i )+" " )
  Next
  
  stream.Close
  
  Var ret As String = mem
  
  Return ret.DefineEncoding(Encodings.UTF8)
  
End Function

@Mike_D
The loop is bogus. It was simplification but still made a point that the string could be big.

@KarenA and others
So long as I’m not doing a copy in assigning to string and another in returning it that would be fine.

Return CType( cResult, String ).DefineEncoding( Encodings.UTF8 )

Would seem to.do the job without need for an extra variable then.

Thanks all

That will still result in a second copy of the data in memory.

Surely no more than the string variable solution? It has one less variable pointer hanging around.

MemoryBlock.Stringvalue will take a text encoding as a third parameter.

2 Likes

Being functional and multi-parameter, MAYBE it’s more expensive than the CTYPE way. Someone should compare. (the CTYPE way also contains a call, to the DefineEncoding).

I just whipped up a little test. I create a memory block and then in a loop perform the conversion 100,000 times. Creating a DateTime before and after and differencing them. Background Tasks are disabled around the loop. Times are in milliseconds, each column is a run of the method with a long pause in-between.

CType		49,667,968	49,250,976	51,186,035	52,678,955
StringValue	32,847,900	35,890,136	34,487,060	27,796,875
Variable		49,415,039	60,269,042	60,821,044	62,688,964

StringValue is surprisingly the fastest.

CType contains:
cString = CType( mMemoryBlock, String ).DefineEncoding( Encodings.UTF8 )

StringValue contains:
cString = mMemoryBlock.StringValue( 0, mMemoryBlock.Size -1, Encodings.UTF8 )

Variable contains:
aString = mMemoryBlock
aString = aString.DefineEncoding( Encodings.UTF8 )

1 Like

FYI, that’s a bug - the second parameter is Length, so don’t subtract one, instead:

cString = mMemoryBlock.StringValue( 0, mMemoryBlock.Size, Encodings.UTF8 )

1 Like

FYI, there’s a bug:

<https://xojo.com/issue/66897>
MemoryBlock.StringValue encoding is nil but should be UTF8

1 Like

This makes sense in that the others have to copy the string to set the encoding whereas StringValue (probably) doesn’t.

IIRC define encoding does not, but convert does

Pretty sure an Xojo engineer said that long ago.

Strings are immutable, so it has to. Anything else would be a bug.

Now maybe they did some magic on the backend where strings with a single reference are treated differently, I don’t know, but I wouldn’t count on that.

You mean this?

Part 4. Wich is faster, ConvertEncoding or TextConvert

https://nug.xojo.narkive.com/JxLZ6tJV/encoding-issue

Just so it is clear to anyone coming back to this the fastest solution at the present time is the following. This also has the -1 bug fixed, as pointed out by @Mike_D.

Return mMemoryBlock.StringValue( 0, mMemoryBlock.Size, Encodings.UTF8 )
1 Like