Is this a bug or expected behaviour?

Var a As String = "☺️"
Var aLength As Integer = a.Length
Var b As String = "😀"
Var bLength As Integer = b.Length

aLength is reported as 2 not 1 whereas another emoji (b) is correctly reported as being length 1.

Should I report this as a bug or is there some other method I need to use to get the length of a string? According to the docs:

Returns the number of characters in the specified string.

Check the debugger.
The encoding could be the thing here since if it’s utf-8 can be one or more bytes.
Try .Bytes you’ll get the actual bytes count.

You can alos check to see if the string is corrupted (has invisible chars in it) by selecting the string and right click - > “Clean invsible ascii characters”

It’s typed directly into the IDE using the macOS emoji picker. It’s not an encoding issue.

Well what’s the encoding in the debugger?

I imagine String.Bytes will give you the correct length of the emojis.

It’s a bug if Length should return the number of characters. The simple test is put it into a textarea, if pressing the arrow key once moves over the character e.g. 1 character then it should be 1

Things like Left, Middle etc. should then work as if its one character and not cut the thing in half like it does now (another bug).

4 Likes

Emojis came later to the unicode encoding schemes. If Xojo uses unicode libs to handle them, maybe they need to update them.

Looks correct to me.

‘a’ consists of 2 UTF-8 sequences: E298BA & EFB88F while ‘b’ consists of 1 UTF-8 sequence: F09F9880

The Xojo string functions have always operated on code points rather than user perceived characters which are a higher level text concept.

If you want user perceived characters you will probably have to count them using the Characters iterator. I’m sure manipulating strings at a user perceived character level would also be possible with a bit of work.

Is it a bug? I would say not. You can process strings either as code points or as user perceived characters. Many years ago, Realbasic chose code points for some reason. Maybe the concept of user perceived characters didn’t exist at the time or maybe the string processing code just didn’t support them.

If Xojo changed how this worked then it would break lots of code. Some of our apps interact with 3rd party libraries which also process text the same way so changing this would be a show stopper for us.

From what I can remember, the Xojo text data type did operate on user perceived characters so you could say that deprecating it was a step backwards. Maybe an enhancement to String would be to introduce an additional set of string methods (or maybe parameters on the existing methods) that specified code point mode or user perceived character mode.

2 Likes

I would say yes.
For me str.length() should return length in chars as stated in the manual. str.bytes() the number of bytes. Codepoints needs another method like str.codepoints().

3 Likes

Probably this bug is just related to libs needing updates.

For better or worse, this is correct. The bytes behind the string represent two code points that are presented as one visible character.

I’ve updated the docs to make that clear.

1 Like

Your suggestion wouldn’t be fixing a bug though. It would be changing existing functionality which could break lots of existing code. Remember, that this isn’t just Length. It would also affect: Left, Right, IndexOf / Instr, Mid / Middle & Split (possibly others).

If Length was changed then the others would have to be changed. This would mostly convert String into the deprecated Text data type. I’m sure this would be useful for some but it would also break lots of code and would make string manipulation much slower for everyone. Remember, String won the String vs Text battle so that means living with the limitations.

If anything, the documentation is wrong as it should clearly mention code points but the whole concept of code points vs user perceived characters could be confusing to a lot of users.

The Characters iterator is the way to work with user perceived characters.

And Left, Right, Middle etc. ?

What a dogs dinner, in the name of backwards compatibility.

1 Like

I have to admit, I had to look that one up. :slight_smile:

Its not just for backwards compatibility though. Processing strings at such a high level is not always required. It is also much slower to process strings when you are taking into account grapheme clusters as you have to apply Unicode rules to determine if a code point belongs with the previous one.

1 Like

Any code “working” with a broken function is already broken. Fixing it It will just fix correct code, now containing an anomaly not yet noticed, waiting to happen. If it will break wrong code, so let it be, the good part is that it fix correct code.

3 Likes

It’s been this way for a long time, but I guess it’s worth a Feedback request.

OMG! This culture of “let’s not fix the bug, let’s say that it’s a feature!” must end.

3 Likes

Sure, but you should not spread in the docs that the current wrong behavior is an expected behavior, or people will start another round of workarounds that will be broken when fixed.

1 Like