Fast String Concatenation with a defined encoding

A very helpful thread, thanks!

I have an app that uses String concatenation quite heavily. I add the String components to an array, but inspired by this thread I replaced the built in Join method by this, which is 25 % faster in my scenario:

Public Function Concat(s() As String, delimiter As String) As String
  #Pragma DisableBackgroundTasks
  
  If s.LastIndex >= 0 Then
    Var mem As New MemoryBlock(0)
    Var stream As New BinaryStream(mem)
    Var lim As Integer = s.LastIndex - 1
    
    For i As Integer = 0 To lim
      stream.Write(s(i) + delimiter)
    Next
    
    stream.Write(s(s.LastIndex))
    
    stream.Close
    
    Return mem.StringValue(0, mem.Size, Encodings.UTF8)
  End If

  Return ""
  
End Function

I was talking about the difference between ConvertEncoding and DefineEncoding Not between ConvertEncoding and TextConverter… But I think in a different post it was Joe Strout that said DefineEncoding does not copy.

What I took that to mean was that while the bytes of the characters in a string are immutable, the encoding assigned to them was not.

Maybe a current Xojo engineer can comment on that.

BTW back in the day there were sometimes good technical discussions on the beta mailing list that were not always beta related so the discussion I think I remember would not have been archived if it was there.

-Karen

Typically you’d use DefineEncoding on a string with no encoding (such as a string read from a socket). Doing so sets the encoding and does not otherwise modify the string.

ConvertEncoding always involves a source string and a destination string. The source string will be untouched but the destination string’s contents will have been converted - modified from that in the original string - according to what the new encoding is set to.

I noticed last night in the Plugin SDK that there is a SetEncoding routine.

So from that at least a plugin can set encoding hint on existing string, without getting new one. From that then its quite possible that you may be getting same string back from DefineEncoding.

Not the bytes behind the string, but it will create a copy (again, unless they’ve done something to accommodate strings with only one reference, like s = s.DefineEncoding(enc)).

For what I remember that is not the case, but I may be wrong…

Just tested with some code… It does copy!

In any case it’s a shame Xojo requires a copy just to SET the encoding of a string… I can understand having the contents immutable but the encoding?

-Karen

1 Like

That’s so irritating.

Why?

I presume because to have to duplicate a potentially large chunk of data in memory to simple define what encoding was in use for the string seems a waste of time and effort. The bytes of the string are not changing in any way (as they would for ConvertEncoding). They are simply being tagged with the correct encoding.

I assume a string is a structure with an internal variable containing the encoding, and another would be a pointer to a chunk of memory containing the string’s bytes. And possibly others. I see no reason why the encoding can’t just be set without the string’s bytes being disturbed and without needing them to be copied. Especially if the string is some megabytes.

If there was a method:

Sub DefineEncoding( extends S as String, Encoding as TextEncoding )

That would be nice.

Setting encoding would not disturb any bytes since DefineEncoding is only a “Promise” has no effect on the string it self.

Its like “I Time Steater here by promise that this string actually is UTF8”

Your promise may or may not be true. Regardless if true or false then the string it self is unchanged. False promise can make users of the string do bad things though like if you feed string with false encoding to Listbox then you could get crash in some cases, since the Listbox will trust your promise on how to interpret the actual bytes in the string.

This is the difference between DefineEncoding and ConvertEncoding, DefineEncoding just stamps it with a “Promise” while ConvertEncoding actually converts and changes bytes.

Event the Xojo Plugin SDK has function to set the Encoding of a string without making new one. Though I am not really sure its a good idea, I do think the string should be kept as not mutable in any way, including the promise of encoding.

That’s what I said upthread.

Why?

Mutating increases chance of creating bugs and strangeness.

You could have passed string around 10 times in your application, the string having ref count of 10, and now suddenly in some subroutine deep in your application you change encoding of this string. Thats in any good programming just asking for trouble.

And of course one of the base for modern unit testing is do not have globals or routines that have global effect.

Its for exactly this reason that almost nothing in Cocoa base structures is Mutable for example (unless its specific mutable version of the class like NSMutableDictionary vs NSDictionary)

Makes more sense to have specialized structures for dealing with something you need to mutate.

Good points, however the following code makes a copy and has all of the problems you have just described:

s = s.DefineEncoding( Encodings.UTF8 )

The problem is in code like this, that @Björn_Eiríksson just described:

// s1 has some undefined string

var s2 as string = s1
DefineEncoding( s2, newEncoding ) // Proposed code

Now s1 is set to newEncoding too, and that may not be what you want.

But Kem, you say, how could that be? s2 is a copy of the string!

It is not.

When you create a string, only one exists in memory. You can “copy” it to 100 different variables, but you are just copying a reference to the same string. From your point of view it’s a copy of the string precisely because strings are immutable and any operation on a string, including defining its encoding will create another string.

This is what lets you pass a 10 MB string to a function in the same amount of time it takes to pass a 1 character string, and what makes it seem like you are copying the string, which is easy to understand, rather that passing a reference to a string, which is not.

Again, maybe they added some optimization where s1 = s1.DefineEncoding will actually change the encoding rather than copy it if s1 is the only reference, but that would be an implementation detail, and we’d need an engineer to weigh in.

In the meantime, DefineEncoding is a relatively rare occurrence compared to, say, passing a string around, and the functions that get a string often have parameters that let you define the encoding at the outset.

Which is now marked as Fixed, so hopefully it will fix just returning the memory block as a string (which returns a Nil encoding).

There is no global effect of s = s.DefineEncoding( Encodings.UTF8 )

Unless your s is global variable. And every other parts of your code that linked to the old s will link to the old s after your call and not the new one.

You specifically called out a global variable. So yes s is global. It is still changed throughout the whole system just it would be without a copy.