Please "Un-deprecate" `Text`

The Text datatype was deprecated in Xojo 2021r1. Whilst Xojo do tend to support deprecated functions, etc for a long time I cannot bring myself to begin a brand new complex app to fruition using a deprecated class as it’s too risky. I’m therefore asking Xojo to consider un-deprecating the Text datatype and I want to explain why.

I’m writing a code editor which, as you can imagine, has to be accept arbitrary textural data from users including not only emojis but also textual characters from non-English languages. The String class (whilst much faster than Text) simply does not work when manipulating text that contains these characters and there are no good workarounds. I will give you two examples:

Example 1 - slicing characters:

Let’s say the user enters this text: "😀+☺️".
This is three visible characters.

Suppose the user presses the backspace key in my editor (i.e. they want to remove the second emoji). Xojo implies we can do this:

Var s As String = "😀+☺️"
s = s.Left(2) // Get the first two characters

The trouble is this silently fails. Since the second emoji has a length of 2 (not 1) the returned string is broken. This is really nicely demonstrated in Xojo’s IDE (which also fails to handle non-English characters correctly):

Broken String Left

Notice how pressing the delete key in the actual Xojo IDE breaks the string. I suspect Xojo Inc are using Strings under the hood for data storage so they are also seeing this bug.

Further manipulation of the s String returned will continue to lead to broken and unpredictable results. This behaviour affects all the String functions that manipulate the length of a string (Left, Middle, Right, etc).

Example 2 - getting the length of a string

How long is this piece of text:

Var s As String = "☺️"
// What is s.Length?

“Length is 1, right?” Wrong. In this case Length is 2. This is because this particular emoji is composed of two code points, not one. How about this emoji?

Var s As String = "😀"
// s.Length must be 2 right?

Nope, s.Length in this case is 1 (which is correct).

The “solution” to solving this dilemma using String is to iterate over all characters in the string using the String.Characters method and manually counting them:

Function CharacterCount(Extends s As String) As Integer
  Var count As Integer = 0
  For Each c As String In s.Characters
    count = count + 1
  Next c

  Return count
End Function

Whilst this “solves” the problem of getting an incorrect length, it doesn’t fix the broken .Left, .Right and .Middle functions which will break because they also work on code points and not characters.

API 2.0 really confuses things here

The Text datatype solves all of these issues (that’s why it was created in the first place!) but it has now been deprecated. Xojo keep saying to make feature requests to add new functionality to the String datatype to handle things that Text used to. The trouble is Xojo have made odd choices with renaming methods on the String datatype in API 2.0 which confuse the issue. For example, Xojo renamed String.Mid to String.Middle but didn’t change the functionality. What would have been sensible would be to have String.Middle act like the old Text.Mid function that actually correctly returned the middle characters for multi-codepoint strings.

I know people will complain that Text is slow (and it is) but there is a real use case for it to not be deprecated. I am now facing the prospect of rewriting my entire code editor (a multi thousand line project) to use Text rather than String because I was duped into thinking that String had been updated to correctly handle these characters (the docs have subsequently been updated). I will have to accept that Text is slower than String (hopefully the editor will still be usable).

Can I get some support for this?

I would be really interested to hear from @Geoff_Perlman about the reasoning behind the Text datatype and why it could not be “un-deprecated”.

8 Likes

Not so. Middle was added to String and is not the same as Mid. Mid counts from 1 and Middle counts from 0.

1 Like

Good morning @GarryPettet,

all the things you describe are possible with the string data type. I have done this myself with the TextInputCanvas and String.

The trick is to use String.Bytes to get the number of characters in a (composite) character (Grapheme Cluster Characters). If you’re working with tokens in your code editor, you can just read the current token under the cursor and then use Graphics.TextWidth and the String.Characters iterator through interference until you’re at the location of the character/emoji under the cursor. At the same time you can buffer the length of the String and then manipulate it using String.MiddleBytes/LeftBytes/RightBytes, etc. This works wonderfully.

BTW, you’re right, Xojo has unfortunately not implemented even its own code editor well as far as processing emojis or international character sets is concerned. But it is possible.

If you have any questions, feel free to write.

True but it still doesn’t count characters. That would have been more sensible.

Thanks for the offer Martin (I may take you up on it).

I too could use String but I suspect just switching to Text will fix a lot of the issues but I’m not going to do that without some reassurance that the data type is no longer deprecated.

Which issues do you have?

Nasty bugs that only surface when certain characters are entered. Mostly doing string joins and splits with Left and Right that return part characters unexpectedly. I can fix them (now I know that String doesn’t behave as I expected it to with extension methods) but what I’m asking Xojo to do is explain why Text has been deprecated since it already solves all the issues I’m having.

I’m very happy to use Text instead of String but I’m not going to use a deprecated class. I honestly don’t understand why it’s deprecated.

Speed? I do a lot of processing with strings and less speed would make me very upset.

1 Like

For sure which is why people can use String if speed is important to them (or they know that they won’t be dealing with multi-codepoint characters).

Having a fast but inaccurate datatype is worse than having a slow but accurate one.

We need the right tool for the job.

1 Like

No, someone mentioned above a language which has 3 different ways dealing with string/text/whatever. This is a bad idea because you never know if you are dealing with emojis or not.

Example: my app deals with emails. Those emails can be printed or exported to PDF. The name of the PDFs can contain emojis. I’ve had some fun with emojis before where some cloud solutions didn’t do emojis. See Mail Archiver, PDFs, Dropbox, BoxCryptor and Emojis . So I really don’t want to think about the underlying technology.

3 Likes

OT

i would be happy if we not have umlauts issues in 2021 …

1 Like

Email data is html anyways. I’m using a html viewer to create the raw PDF and then I add header information. MBS, of course. WKWebview has been much fun.

If you want to guarantee the correct processing of emojis, Text is the solution. The current implementation of Xojo String handles codepoints, and SOME emojis are made of clusters (2+ codepoints) and that breaks text processing, causes bad things like having a string with 10 chars and String.Length returning 14 for example, but Text.Length does it ok, returns 10.

1 Like

Crikey.

And I thought Xojo had made a mess of Timers (which I now cannot get to work on any platform).

Please stop this shenanigans with different (inconsistent) api’s and “deprecation”. It would be a lot more helpful if you just delete the stuff from both the compiler and the supporting documentation and made a clean break instead of having things “deprecated”.

Right now I have no idea what’s supposed to work, or not, in several object classes.

The 2021 releases have also broken my main app and if had to go back to the 2919 versions just to be able to compile what has worked for years.

The other really annoying thing - and the Timer is a god example - you have managed to make something that is fundamentally simple so teduous with a stupidly verbose syntax. In several cases i have rolled my own classes just to hide that horrible api syntax and use straightforward single-word names for properties/methods.

3 Likes

I would want to see methods added to string to handle the other way rather than have 2 strings. Having two sets is just simply terrible in so many ways.

3 Likes

That’s another possibility to explore.

Right now we have this:

Var facepalmCluster As String = &u1f926+&u1f3fc+&u200d+&u2642+&ufe0f

Var facepalmClusterText As Text = facepalmCluster.ToText

MessageBox facepalmCluster + EndOfLine + _
"String length: " + facepalmCluster.Length.ToString + EndOfLine +_
"Text length: " + facepalmClusterText.Length.ToString

image

5 Likes

I also see the degree of verbosity of API 2 as negative. While in a vaccum in some case it may be more readable, for me me in a line or page of code the visual density of the the text make it harder to read/digest at a glance.

While some names in API 1 could have been better, the degree of verbosity was about right - a good compromise between explicitness and convinience and overall code readability

4 Likes

You turned this into Feedback? I’ll add comments there.

With the old String class, we had Len (which returns # of codepoints) and LenB (which returns # of bytes), and similar functions (Mid, MidB, etc.)

I think it would make sense to simply add some new methods which operate on characters, e.g.:
LenC
LeftC
RightC
etc.

If I were in your situation, I would probably just write my own extension functions for the String class which do this.

3 Likes

What verbosity?

Everyone works with timer just fine.