Is this a bug or expected behaviour?

GarryPettet · July 26, 2021, 10:54am

Var a As String = "☺️"
Var aLength As Integer = a.Length
Var b As String = "😀"
Var bLength As Integer = b.Length

aLength is reported as 2 not 1 whereas another emoji (b) is correctly reported as being length 1.

Should I report this as a bug or is there some other method I need to use to get the length of a string? According to the docs:

Returns the number of characters in the specified string.

DerkJ · July 26, 2021, 11:06am

Check the debugger.
The encoding could be the thing here since if it’s utf-8 can be one or more bytes.
Try .Bytes you’ll get the actual bytes count.

You can alos check to see if the string is corrupted (has invisible chars in it) by selecting the string and right click - > “Clean invsible ascii characters”

GarryPettet · July 26, 2021, 11:46am

It’s typed directly into the IDE using the macOS emoji picker. It’s not an encoding issue.

DerkJ · July 26, 2021, 11:47am

Well what’s the encoding in the debugger?

Martin_T · July 26, 2021, 12:02pm

I imagine String.Bytes will give you the correct length of the emojis.

anon20074439 · July 26, 2021, 12:09pm

It’s a bug if Length should return the number of characters. The simple test is put it into a textarea, if pressing the arrow key once moves over the character e.g. 1 character then it should be 1

Things like Left, Middle etc. should then work as if its one character and not cut the thing in half like it does now (another bug).

Rick_Araujo · July 26, 2021, 12:48pm

Emojis came later to the unicode encoding schemes. If Xojo uses unicode libs to handle them, maybe they need to update them.

kevin_g · July 26, 2021, 12:50pm

Looks correct to me.

‘a’ consists of 2 UTF-8 sequences: E298BA & EFB88F while ‘b’ consists of 1 UTF-8 sequence: F09F9880

The Xojo string functions have always operated on code points rather than user perceived characters which are a higher level text concept.

If you want user perceived characters you will probably have to count them using the Characters iterator. I’m sure manipulating strings at a user perceived character level would also be possible with a bit of work.

Is it a bug? I would say not. You can process strings either as code points or as user perceived characters. Many years ago, Realbasic chose code points for some reason. Maybe the concept of user perceived characters didn’t exist at the time or maybe the string processing code just didn’t support them.

If Xojo changed how this worked then it would break lots of code. Some of our apps interact with 3rd party libraries which also process text the same way so changing this would be a show stopper for us.

From what I can remember, the Xojo text data type did operate on user perceived characters so you could say that deprecating it was a step backwards. Maybe an enhancement to String would be to introduce an additional set of string methods (or maybe parameters on the existing methods) that specified code point mode or user perceived character mode.

Rick_Araujo · July 26, 2021, 1:06pm

I would say yes.
For me str.length() should return length in chars as stated in the manual. str.bytes() the number of bytes. Codepoints needs another method like str.codepoints().

Rick_Araujo · July 26, 2021, 1:15pm

Probably this bug is just related to libs needing updates.

Kem_Tekinay · July 26, 2021, 1:16pm

For better or worse, this is correct. The bytes behind the string represent two code points that are presented as one visible character.

Kem_Tekinay · July 26, 2021, 1:32pm

I’ve updated the docs to make that clear.

kevin_g · July 26, 2021, 1:37pm

Your suggestion wouldn’t be fixing a bug though. It would be changing existing functionality which could break lots of existing code. Remember, that this isn’t just Length. It would also affect: Left, Right, IndexOf / Instr, Mid / Middle & Split (possibly others).

If Length was changed then the others would have to be changed. This would mostly convert String into the deprecated Text data type. I’m sure this would be useful for some but it would also break lots of code and would make string manipulation much slower for everyone. Remember, String won the String vs Text battle so that means living with the limitations.

If anything, the documentation is wrong as it should clearly mention code points but the whole concept of code points vs user perceived characters could be confusing to a lot of users.

The Characters iterator is the way to work with user perceived characters.

anon20074439 · July 26, 2021, 1:37pm

And Left, Right, Middle etc. ?

What a dogs dinner, in the name of backwards compatibility.

Kem_Tekinay · July 26, 2021, 1:39pm

I have to admit, I had to look that one up.

kevin_g · July 26, 2021, 1:50pm

Its not just for backwards compatibility though. Processing strings at such a high level is not always required. It is also much slower to process strings when you are taking into account grapheme clusters as you have to apply Unicode rules to determine if a code point belongs with the previous one.

Rick_Araujo · July 26, 2021, 2:07pm

Any code “working” with a broken function is already broken. Fixing it It will just fix correct code, now containing an anomaly not yet noticed, waiting to happen. If it will break wrong code, so let it be, the good part is that it fix correct code.

Kem_Tekinay · July 26, 2021, 2:12pm

It’s been this way for a long time, but I guess it’s worth a Feedback request.

Rick_Araujo · July 26, 2021, 2:13pm

OMG! This culture of “let’s not fix the bug, let’s say that it’s a feature!” must end.

Rick_Araujo · July 26, 2021, 2:16pm

Sure, but you should not spread in the docs that the current wrong behavior is an expected behavior, or people will start another round of workarounds that will be broken when fixed.