Is this a bug or expected behaviour?

Kem_Tekinay · July 26, 2021, 2:17pm

There is no such “culture” to my knowledge. But when you file Feedback, be sure to mention the updated docs in case that should be reversed or further clarified.

kevin_g · July 26, 2021, 2:18pm

Its not broken.

Rick_Araujo · July 26, 2021, 2:20pm

Xojo has a culture of if there’s a workaround, its not relevant. Bugs keep there for years and people on the forum says “Its a know thing, we have a workaround for that”. It became a cultural thing already.

Kem_Tekinay · July 26, 2021, 2:21pm

Kindly keep this conversation on track.

Rick_Araujo · July 26, 2021, 2:21pm

Completely broken until redefined in docs to appear less broken.

anon20074439 · July 26, 2021, 2:22pm

The whole point of a high level language is to make it easy for users to get things done. If the user now goes outside their “safe bubble” and gets some text in from a 3rd party, be that a web service or an untrained user’s input there is a possibility that their code will not perform as expected.

If any of those inputs have a multibyte encoding then any string manipulation will fragment as happens just by selecting code in the IDE, nice.

This makes the usefulness of the simplest of things like indexof, length, middle, left, right etc. pretty much pointless going forward in a society where the increased use of diverse language and pictures is only going to increase.

If people want speed, they should be using the Bytes variants of these much like you would use a memoryblock for speed

kevin_g · July 26, 2021, 2:23pm

It isn’t broken though. The string functions have never supported grapheme clusters so expecting them to return a grapheme cluster length is wrong.

Kem_Tekinay · July 26, 2021, 2:29pm

Right, which is why I said above “for better or worse”.

At any rate, I think @GarryPettet has his answer, and anyone who thinks this is the wrong behavior should file Feedback and post the link here.

Rick_Araujo · July 26, 2021, 2:30pm

You know that if we don’t process, split, etc chars (here known as the possible cases of grapheme clusters) we will have numerous numbers of bugs due to processing a pack of not meaningful bytes instead of the expected chars (in bytes terms, a grapheme cluster), don’t you?

Kem_Tekinay · July 26, 2021, 2:31pm

Yes, this has been a challenge practically since Unicode was introduced in the REALbasic days. That’s why understanding how it works is important.

Rick_Araujo · July 26, 2021, 2:32pm

So, let’s fix the bugs.

Kem_Tekinay · July 26, 2021, 2:33pm

Not a bug, and your continually calling it that doesn’t make it so.

But file Feedback so you can get the engineers involved. In the meantime, this is getting repetitive and unhelpful.

Rick_Araujo · July 26, 2021, 2:35pm

As I said, we have a sick cultural thing of tolerance of bugs, and rewriting the rules to make them “features” needing workarounds. I give up, again.

kevin_g · July 26, 2021, 2:36pm

It all depends on how the 3rd party deals with text. For example, the 3rd party libraries we integrate with handle text the same way Xojo does.

That isn’t correct. The problem only occurs when you have characters comprised of multiple code points (for example, emoji & decomposed characters).

No. Using the Bytes variants would mean that you would have to write your own UTF-8 processing code which I imagine would be quite slow if 100% Xojo code.

I would say there are 3 levels of text processing:
a) Bytes
b) Code Points
c) Grapheme Clusters

Xojo provides function for ‘a’ & ‘b’ with ‘b’ being the default. ‘c’ can be implemented via the Characters iterator.

kevin_g · July 26, 2021, 2:36pm

Saying something is a bug doesn’t make it a bug though.

Rick_Araujo · July 26, 2021, 2:37pm

Well, saying that a bug is not, too.

kevin_g · July 26, 2021, 2:39pm

It isn’t a bug. The OP is expecting Xojo to return the length of a string based on a different level of text handling.

TimStreater · July 26, 2021, 2:40pm

@Rick_Araujo doesn’t appear to appreciate that the problem, if any, lies in Unicode’s acceptance of the notion of combining code points to give what to a human looks like a single character.

Rick_Araujo · July 26, 2021, 2:42pm

Combining code points as one char is a char for both, humans and computer.
Not Rick’s opinion, just a fact.

Kem_Tekinay · July 26, 2021, 2:45pm

You’ve all made your points, and this is now turning into an argument. I’m asking you all to stop here unless you have something new to contribute to @GarryPettet 's question.