do I need to convert text to memoryblock to get lenb()?

James_Sentman · May 25, 2016, 11:51am

Working to convert some protocol handling code to iOS and woof, new framework is kicking my backside. Ive forgotten how frustrating it is not to know immediately how to do the most basic stuff But learning is good for the brain so Im not complaining about that, though having to ask such simple questions is a little embarrassing…

How do I get the actual byte length of a text? Do I have to convert it to a memory block via the textEncoder method and then take the size of the memory block? Doesnt that actually translate the data in the text and make a copy of it in memory? I dont want to copy the data or do any encoding conversion on it at all, to do any conversion on it would ruin the byte count anyway as it may no longer be the same as the original text that Ill be writing out a socket in a moment and need to first write the length of the data Im going to write.

In the past Ive been using strings, Im feeling like I should be using memory blocks for all this data now, but its mostly human readable strings. They may contain unicode multi-byte characters so I need to know how big the data actually is before writing it without doing any encoding conversion on it.

Beatrix_Willius · May 25, 2016, 12:01pm

How about using encoding = nil?

Michel_Bujardet · May 25, 2016, 12:16pm

No need to create a memoryblock :

Function LenB(Extends T as Text) As Integer const allowLossy as boolean = True Return TextEncoding.UTF8.ConvertTextToData(T, allowLossy).Size End Function

I have used allowLossy to prevent possible errors, as TextEncoding can be finicky with some characters.

I don’t think that is possible with the new framework Text type.

Kem_Tekinay · May 25, 2016, 12:45pm

Right. Text represents the actual text, hence there is no encoding (or byte count).

If you open a word processing document, the paragraphs you see are the equivalent of Text. You might want a character count, but a “byte count” would not make sense. On the other hand, if you open that same document in a low-level file editor, you’ll see how the data was actually stored on disk, i.e., how the same text was “encoded”. This is the equivalent of String or using a MemoryBlock.

Kem_Tekinay · May 25, 2016, 12:47pm

Also see

https://forum.xojo.com/conversation/post/187814

Michel_Bujardet · May 25, 2016, 2:34pm

Actually, I believe Text has always an encoding, that is one of the things what makes it different from string. But indeed Text.Length reflects only visible glyphs, and that can be very different to string, as Text can render composite characters such as an è formed of `+e.

The method I posted above measures indeed the bag of bytes :

whatever 8
whateveré 10
whatever 12

I have added LenB in both forms LenB(Text) and Text.LenB to XojoiOSWrapper at GitHub - Mitchboo/XojoiOSWrapper: Module that brings legacy and additional functions to Xojo iOS

Joe_Ranieri · May 25, 2016, 3:08pm

An encoding describes how code points are represented as bytes. Text is always a series of Unicode scalar values, but no byte representation (encoding) is specified.

No. Text.Length represents Unicode grapheme clusters, which can be thought of as a user-perceived character. Non-printable characters like U+0008 BACKSPACE still count towards Text.Length. The exact logic is described in Unicode Standard Annex #29 - Unicode Text Segmentation.

Joe_Ranieri · May 25, 2016, 3:09pm

[quote=268018:@Michel Bujardet]No need to create a memoryblock :

Function LenB(Extends T as Text) As Integer const allowLossy as boolean = True Return TextEncoding.UTF8.ConvertTextToData(T, allowLossy).Size End Function
[/quote]

This does create a MemoryBlock but immediately throws it out after measuring the size.

Michel_Bujardet · May 25, 2016, 3:16pm

To be complete, the zero width character spaces are neither visible or perceived by the user.

James_Sentman · May 26, 2016, 2:50pm

Im still having some basic conceptual problem with this, or missing something obvious. Sorry

Something I do very often is write strings/text/arbitrary binary data to a binary stream. Usually I take the lenb() first and write that as a uint16 or something first so that when reading back in I can read the length of the string or binary data that I want to read in and then read that many bytes. That doesnt seem possible with text without doing a conversion to a memory block unnecessarily to get the lower level length?

The binaryStream.writeText will write a text to the binary stream. I assume it writes the actual bytes And the binaryStream.readText requires that you pass the count which the docs say are bytes, not a count of grapheme clusters. So in order to be able to read a text I must already know its binary length, but there is no way to get its binary length without doing the dance with the memory block? Im either not understanding how this is meant to be used or there is a hole in the implementation somewhere.

Thom_McGrath · May 27, 2016, 5:42am

[quote=268245:@James Sentman]Im still having some basic conceptual problem with this, or missing something obvious. Sorry

Something I do very often is write strings/text/arbitrary binary data to a binary stream. Usually I take the lenb() first and write that as a uint16 or something first so that when reading back in I can read the length of the string or binary data that I want to read in and then read that many bytes. That doesnt seem possible with text without doing a conversion to a memory block unnecessarily to get the lower level length?

The binaryStream.writeText will write a text to the binary stream. I assume it writes the actual bytes And the binaryStream.readText requires that you pass the count which the docs say are bytes, not a count of grapheme clusters. So in order to be able to read a text I must already know its binary length, but there is no way to get its binary length without doing the dance with the memory block? Im either not understanding how this is meant to be used or there is a hole in the implementation somewhere.[/quote]
Usually you would use some kind of type-length-data markup before each value. A fixed-length header that tells you how to read the next chunk.

Greg_O_Lone · May 27, 2016, 9:55am

[quote=268245:@James Sentman]Im still having some basic conceptual problem with this, or missing something obvious. Sorry

Something I do very often is write strings/text/arbitrary binary data to a binary stream. Usually I take the lenb() first and write that as a uint16 or something first so that when reading back in I can read the length of the string or binary data that I want to read in and then read that many bytes. That doesnt seem possible with text without doing a conversion to a memory block unnecessarily to get the lower level length?

The binaryStream.writeText will write a text to the binary stream. I assume it writes the actual bytes And the binaryStream.readText requires that you pass the count which the docs say are bytes, not a count of grapheme clusters. So in order to be able to read a text I must already know its binary length, but there is no way to get its binary length without doing the dance with the memory block? Im either not understanding how this is meant to be used or there is a hole in the implementation somewhere.[/quote]
The thing you are missing is the reason for the difference between Text and Memoryblock now. Text is meant for representing, well Text. Things that require encoding to display them correctly on a computer screen. Memoryblock is now the only way to represent data in terms of bytes for communicating with other services, whether it be a file or a socket or whatever else where the encoding wouldn’t get transmitted. In the past, String blurred that line, but it lead to many confusing situations. We tried to make the distinction more obvious in the new framework.

To answer your question, to write that binary data to a file, yes you need to convert it to a Memoryblock first. You should be doing this with the understanding that you’ll need the same encoding when you reload the data later to get it back to its original state.

James_Sentman · May 27, 2016, 12:26pm

While Im doing this forced refactor Ill spend some time reworking where Im using strings (now text) as opposed to binary data, but that really isnt the core of the problem. I understand that.

Im still of the opinion that we should have access to the raw text data without having to do a conversion and copy in memory. Perhaps a ptr reference or a .data property that would return a memory block without the conversion and copy which is unnecessary.

Sending text over a socket, or writing it to a binary file I expect are very common tasks and at the moment the only way to do that is to make an extra copy of the text in memory and pass it through a textEncoding converter? I should be able to get the raw data of the text so that I can write it or send it just the way it is shouldnt I?

The convention of putting binary data in a string was confusing, but it made certain things easier. In a slightly related refactor Im swimming in another part of the program Im managing a buffer for an incoming TCP socket. It used to be that with each data available event Id do a readall and concatenate it to the end of the buffer string. Then see if Ive got enough of it to read the whole packet or not. I cant concatenate memory blocks, so I create a new one the size of the first 2 put together and copy all that data already received and then add the next block to it? That may have been what concatenating strings did behind the scenes I dont know Or do I convert both blocks to text and then add them together? Yuck… Still struggling with just basic questions of things I used to know how to do which is frustrating

KarenA · May 27, 2016, 12:38pm

[quote=268478:@James Sentman]Sending text over a socket, or writing it to a binary file I expect are very common tasks and at the moment the only way to do that is to make an extra copy of the text in memory and pass it through a textEncoding converter? I should be able to get the raw data of the text so that I can write it or send it just the way it is shouldnt I?

[/quote]

It does seem text introduces what SEEMES to me to be significant theoretically unnecessary overhead… but that is how I feel about much of the new framework.

It makes sure we are (over)protected from ourselves… and in doing so it adds both more executional overhead and the need for more coding which is less RAD and introduces more complexity.

As i said before, in general outside of a few inconstancies and rough edges, the old framework IMO strikes a good balance between safety and expediency. The new framework not so much IMO…

Karen

Michel_Bujardet · May 27, 2016, 12:54pm

Unless in iOS where there is no choice, it can be wise to keep using classic framework, unless new framework needed features command it.

James_Sentman · May 27, 2016, 1:15pm

My main day job is a project in xojo that has a very rich interconnection protocol over TCP. I have all the classes to talk to the app that will be necessary to build an iOS client, unfortunately I cant use any of them as they are. Its not just a big project but more of a completely new implementation in an entirely new language to build an iOS client. On days when I look at how many hundreds of strings I just stripped out of some of the basic data handling classes and how Im having to write an entirely new packet parser mucking about with memoryblocks rather than just concatenating strings I get rather discouraged about it. Im not sure that by this afternoon I wont be dusting off the swift programming book that is sitting on the shelf if Im going to learn a whole new language anyway. Or maybe Ill drop this for the moment and work on the raspberry pi client for a while instead since I already have a perfectly good web edition based client that runs fine on iOS (and android) anyway… Does nobody else send Text over sockets or write them to files? You cant do either of those things without finding the length of the underlying data and then accessing the underlying data. It makes no sense to copy it just to get access to the binary storage of it.

Kem_Tekinay · May 27, 2016, 1:44pm

James, no disrespect, but your comments show a fundamental misunderstanding of the Text type. I don’t blame you because, for us as programmers who learned to deal with “strings” all our lives, it’s a tough concept.

There is no additional “underlying data” to which to have access. You are assuming that, somewhere, there is a series of bytes that Xojo is somehow hiding from us, but there isn’t, any more than they are “hiding” the bytes that make up a Dictionary.

Look at this this way. Suppose you had an array of integer that you needed to transmit somewhere. How would you do it? Most likely, the intended recipient would expect a certain form, but let’s say you had control of that. Here are some possibilities:

Use a JSON array of the values in the form of [x,y,z].
Depending on the highest expected value of each element, pad each value to 1, 2, 4, or more, bytes, either big-endian or little-endian, and transmit as a stream.
Encode each value as a hex string of known length.

There are more, of course, but at no time would you think, “Xojo should just give us access to the underlying data so we can transmit this.” It wouldn’t make sense.

The Text type is very much like that array of integer. The best analogy is not a series of bytes like a String, but an array of Unicode code points. It’s the final, human-readable product and it’s up to you to convert that to something that can be transmitted properly, so it’s up to you to encode the code points properly. Does the recipient expect UTF-8, UTF-16, or maybe ISO Latin1? Xojo doesn’t know, just like it doesn’t know how you’d want to encode that array of integer.

Yes, it requires you to think about it more than you had to with String, which was already a series of bytes. But because you have to think about it, you are far less likely to transmit the wrong thing in error.

James_Sentman · May 27, 2016, 1:53pm

I believe I admitted to the fact that this was probably a core misunderstanding on my part far above in this thread Thank you, thats very interesting and explains exactly what I needed to understand to get over the frustration of thinking it just wasnt fleshed out enough

So instead Ill make a feature request for an operator ADD implementation for memory blocks so I can concatenate them like I did strings in the past

Joe_Ranieri · May 27, 2016, 1:57pm

A bug report against the documentation would be useful, explaining what you found to be misleading or confusing and why Kem’s post cleared it up. The documentation about Text should be able to explain these concepts without needing to read forum posts.

Joe_Ranieri · May 27, 2016, 2:02pm

This seems reasonable. Would it help your use case if there were a function on TextEncoding to get the number of bytes that would be used if the conversion to MemoryBlock were to happen?