Split that "string" into an array

I have a file of what I will loosely describe as strings delimited by a “special value”. We all know how to split COMMA or TAB delimited strings. – in my instance I have a string of data that is delimited by HexFF — something like "EVERY#GOOD#BOY#DOES#FINE where the ‘#’ character here represents a Hex FF CHRB(255). Attempting to split the string into an array by this value does not work.

I’ve looked at the source data (even in the Xojo Debugger) and I can see exactly 1 Byte between each word that has the value of a HexFF. I can see my delimiter string variable as also having the value of a single HexFF byte. If I change this value to some other letter like “D” or “O” this string splits as you would expect. My encoding for what it’s worth is set to UTF-8.

Everything else about the file is ordinary text. Linefeed terminated etc. I wanted to leverage the simple one-line solution to split each line by the “special character”. This Hex FF character is “safe” in that it is guaranteed not to exist in the data, unlike the COMMA , TAB, SINGLE and DOUBLE QUOTES, Vertical Bar, etc. I simply want to split a “string” by Hex FF.

I get that maybe this isn’t really a valid “string” since it contains a character not defined in the encoding. – BUT Xojo had no problem READING this character from a FILE into a string. And Xojo had no problem with me setting a string variable to CHRB(255) — so… why doesn’t it just split the string by the byte value I sent in? Thoughts?

Try converting the encoding to Windows.ANSI.

2 Likes

This is probably your answer right here. You’re telling Xojo that this string containing 0xFF bytes randomly sprinkled throughout is UTF-8 when it almost certainly is not. So when you try to use the encoding-aware Split command, it understandably fails.

Tell us more about your input data. Do you have control over its contents? What is the character range of the rest of the data – is it ASCII, full UTF-8, etc? Are there a consistent number of fields per line?

Try this:

Var tis As TextInputStream = TextInputStream.Open(SpecialFolder.Desktop.Child("sep.txt"))
Var s As String = tis.ReadAll(Encodings.WindowsANSI)
Var parts() As String = s.ToArray(String.ChrByte(255))

tis.Close

I don’t know what the API 2 equivalent is, but API 1 SplitB should work regardless of encodings

-Karen

2 Likes

I haven’t tried a SplitB as suggested by KarenA , but both Wayne Golding and Martin T both suggested the use of applying WindowsANSI encoding which did work just fine. Thanks Folks. Appreciated.

The bit 7 of bytes should not be touched in UTF8.

You should not insert “bytes” into an UTF8 string unless you know exactly the implications of what you are doing.

If you insert Chr(255) into a UTF8 string, you are inserting the codepoint 255, and it is encoded as an UTF8 sequence of 2 bytes that uses the bit 7 to denote the continuations of the code in multiple bytes.

So never insert ChrB(255) or anything above 127 (&h7F).

You broke the encoding rules. Pay attention, Chr(255) and ChrB(255) are 2 different things.

You know, ASCII code has special chars for breaking fields, you could use one, like Group Separator Chr(35) ou Record Separator Chr(36), and those Chrb() are allowed because they end exactly as Chr() does (or any codepoint below &h80)

2 Likes

Before you declare victory, I suggest you test your solution with non-ASCII characters, like é and ñ.

Some foreign words will certainly break non utf-8 or non utf-32 codes. Sometimes some accent, sometimes a Greek letter, sometimes some math symbol…

Some like those certainly will cause mess: “𝗑 𝗒 𝗓” or “𝓍 𝓎 𝓏”

Well I did note that setting my string to Chr(255) did result in TWO bytes where as ChrB(255) resulted in a Single Byte with the value of HexFF. I’m actually processing records that come from a NoSQL Multivalue Database. Anyone familiar with Rocket Universe, Reality, and any PICK type database originally designed in the 1960’s will note that fields are fields are separated by ATTRIBUTE delimiters (Hex FE) each attribute can be MultiValued by separating values by Hex FD. Each Value can be further divided into Subvalues (Hex FC) and in this particular implementation, each Subvalue can be divided into Sub-Subvalues (Hex FB) and possibly a Sub-Sub-Sub Value Hex FA. When I want to return a row set of multiple RECORDS I opted to use what is oft referred to as an “END OF ITEM” character HEX FF.
I may opt to deal with this via Memory Blocks. What I will say is that when it get’s down to the actual “value” …there are no data types. No Integers, Floating Point, Date Type, etc. Fundamentally it’s a… string. in the C-string type sense. It could be a NAME or ADDRESS or ZIP CODE etc. It’s data that I need to manipulate as a string. So it seemed like a good first effort to “split” this into an array of strings. Then I may need to split that “string” into an array of strings. My initial “string” may have any number of attributes. It may be large…maybe small…may differ for each record. Some records may have 1000 attributes and no multivalued fields. Some may only have three fields…each with 1000+ multivalues. It’s impractical to DIM an array beforehand and/or constantly REDIM a multi-dimensional array. The best way I’ve found to treat this data and pass it from place to place is in it’s native form – a contiguous stream of bytes where values can be extracted or inserted as a multidimemnsional data model.

As I explained.

Ok, so you are working with legacy data, basically ASCII, no international support, no accents, nothing.

In this case just setting the encoding to WindowsANSI would suffice to preserve those bytes.

1 Like