Check the debugger.
The encoding could be the thing here since if it’s utf-8 can be one or more bytes.
Try .Bytes you’ll get the actual bytes count.
You can alos check to see if the string is corrupted (has invisible chars in it) by selecting the string and right click - > “Clean invsible ascii characters”
It’s a bug if Length should return the number of characters. The simple test is put it into a textarea, if pressing the arrow key once moves over the character e.g. 1 character then it should be 1
Things like Left, Middle etc. should then work as if its one character and not cut the thing in half like it does now (another bug).
‘a’ consists of 2 UTF-8 sequences: E298BA & EFB88F while ‘b’ consists of 1 UTF-8 sequence: F09F9880
The Xojo string functions have always operated on code points rather than user perceived characters which are a higher level text concept.
If you want user perceived characters you will probably have to count them using the Characters iterator. I’m sure manipulating strings at a user perceived character level would also be possible with a bit of work.
Is it a bug? I would say not. You can process strings either as code points or as user perceived characters. Many years ago, Realbasic chose code points for some reason. Maybe the concept of user perceived characters didn’t exist at the time or maybe the string processing code just didn’t support them.
If Xojo changed how this worked then it would break lots of code. Some of our apps interact with 3rd party libraries which also process text the same way so changing this would be a show stopper for us.
From what I can remember, the Xojo text data type did operate on user perceived characters so you could say that deprecating it was a step backwards. Maybe an enhancement to String would be to introduce an additional set of string methods (or maybe parameters on the existing methods) that specified code point mode or user perceived character mode.
I would say yes.
For me str.length() should return length in chars as stated in the manual. str.bytes() the number of bytes. Codepoints needs another method like str.codepoints().
Your suggestion wouldn’t be fixing a bug though. It would be changing existing functionality which could break lots of existing code. Remember, that this isn’t just Length. It would also affect: Left, Right, IndexOf / Instr, Mid / Middle & Split (possibly others).
If Length was changed then the others would have to be changed. This would mostly convert String into the deprecated Text data type. I’m sure this would be useful for some but it would also break lots of code and would make string manipulation much slower for everyone. Remember, String won the String vs Text battle so that means living with the limitations.
If anything, the documentation is wrong as it should clearly mention code points but the whole concept of code points vs user perceived characters could be confusing to a lot of users.
The Characters iterator is the way to work with user perceived characters.
Its not just for backwards compatibility though. Processing strings at such a high level is not always required. It is also much slower to process strings when you are taking into account grapheme clusters as you have to apply Unicode rules to determine if a code point belongs with the previous one.
Any code “working” with a broken function is already broken. Fixing it It will just fix correct code, now containing an anomaly not yet noticed, waiting to happen. If it will break wrong code, so let it be, the good part is that it fix correct code.
Sure, but you should not spread in the docs that the current wrong behavior is an expected behavior, or people will start another round of workarounds that will be broken when fixed.