Indexof with text containing 4-byte UTF-8 can give off-by-one error

TimStreater · June 2, 2018, 8:41pm

Consider this piece of text, where I am obliged to have @-signs instead of the actual characters because they won’t display properly when I submit the post:

Intro stuff@Car stuff@end stuff.

The two @-signs are actually little cars that are 4-byte UTF-8 characters with this hex: f0 9f 9a 97 (unicode point U+1F697), and may be seen here:

https://www.utf8-chartable.de/unicode-utf8-table.pl

(look for the code page “Transport and Map symbols”, where it is described as “Automobile”.)

Now: what I find is that if I search for a string before the little cars (such as <tok1 in the above) then IndexOf returns the right value. However if I search for a string after the little cars (such as <tok2 in the above), then IndexOf returns a value which is too small by 1. I have a tiny project to demo this to submit as a feedback, but before doing that I wanted to see if this rings any bells. I couldn’t find anything in the forum or feedback cases that looked similar.

Using MacOS (Mavericks and High Sierra). This happens with 2018r1.1, but also happens with 2017r2.1

Fixed Link

Kem_Tekinay · June 2, 2018, 8:57pm

However you posted that link, it doesn’t take me to where it should.

Kem_Tekinay · June 2, 2018, 9:00pm

I assume that you are dealing with Text, not String, since you are using IndexOf, yes? If that’s the case, text encoding, UTF-8 or otherwise, doesn’t factor into it. IndexOf is zero-based, not one-based, could that have something to do with it?

Otherwise, it would be better to see your code or the example project.

TimStreater · June 2, 2018, 9:09pm

If you mean the two links related to the “@-sign” text, I’ve no idea how they got converted to links. It wasn’t my doing.

Kem_Tekinay · June 2, 2018, 9:10pm

No, this one:

https://www.utf8-chartable.de/unicode-utf8-table.pl

anon20074439 · June 2, 2018, 9:16pm

Post the demo you’ve made so we can see what you’re doing with the indexof

TimStreater · June 2, 2018, 9:19pm

I’m dealing with text and I know it’s zero-based. I can submit a feedback if you think that is better.

Ah, the UTF-8 pages. OK. There is a popup menu in the setup area of that page where you choose which UTF-8 page you want to look at. The one with the little Automobiles is about 15 up from the bottom of that menu.

TimStreater · June 2, 2018, 9:21pm

Can I post a zip file here? If not I’ll do a feedback case. The text to be searched is in a file.

anon20074439 · June 2, 2018, 9:24pm

You can only post links so you’ll need some sort of online storage like dropbox or (does a quick google) https://www.filedropper.com/ (never used them, just tested it, should work)

TimStreater · June 2, 2018, 9:31pm

Feedback case 52370.

Kem_Tekinay · June 2, 2018, 9:32pm

For convenience:

<https://xojo.com/issue/52370>

TimStreater · June 2, 2018, 9:43pm

[quote=390353:@Kem Tekinay]For convenience:

<https://xojo.com/issue/52370>[/quote]

Ah, is that how you do it? Thanks, I was wondering that

anon20074439 · June 2, 2018, 10:09pm

Looks like you’ve found a bug there, it works fine in windows 10:

but not on my ElCap laptop with 2017r3

TimStreater · June 3, 2018, 7:03am

Now looking at those numbers I wonder whether it’s more that indexof provides the right value, but that mid then returns the wrong slice of the text.

TimStreater · June 3, 2018, 8:35am

You might like to modify the statement in findToken from:

endpos = app.htmlBody.IndexOf (start+tokstr.length, ">") // Look for token terminator

to:

endpos = app.htmlBody.IndexOf (start, ">") // Look for token terminator

which should not affect matters except burn a few cycles. However, what actually happens is that with start=40, the result is that we get endpos=39.

The modified statement is what I originally had in my app (slightly lazy not to advance the start, I agree), and having endpos less than start caused a loop which took a while to pin down to the 4-byte UTF-8 chars.

Seems to me this must also be a bug, having indexof return a value lower than the starting value (except -1 for “not found”, of course).

anon20074439 · June 3, 2018, 9:22am

Its the text cutting functions that seem to be in error, like mid, left etc, they don’t seem to be accounting for the U+1F697 correctly. If you put more of them in the text file the cutting is out by even more. IndexOf seems to be working as expected, its just that under text its returning a 1 based position and instr under string is using a 0 based position.

Interestingly, if you put app.htmlbody into a string and use the string version of mid it works on both mac and windows as expected.

That might be why noone has found this yet as there doesn’t seem to be much uptake on text in new framework.

anon20074439 · June 3, 2018, 9:28am

Try replacing the whole of findToken with this:

[code]// Searches for the token in the supplied string. Having found it, compares it to the
// requested token (they should match) and reports whether it does or not.

Dim start, endpos As Integer, msg As Text

start = app.htmlBody.IndexOf(tokstr) // Look for the token
If (start = -1) Then
msg = “Token start not found”
writeLogging(msg)
Return
End If

endpos = app.htmlBody.IndexOf(start + tokstr.length, “>”) // Look for token terminator
If (endpos = -1) Then
msg = “Token end not found”
writeLogging(msg)
Return
End If

writeLogging(“Token: '” + app.htmlBody.mid(start, (endpos - start)) + "’ found starting at " + start.ToText)
writeLogging("Token end found at " + endpos.ToText)
msg = “String found " + (If(app.htmlBody.mid(start, (endpos - start)) = tokstr, “MATCHES”, “DOES NOT MATCH”)) + " the search token”
writeLogging(msg)

Dim s As String
s = app.htmlBody
start = s.InStr(tokstr)
endpos = s.instr(start + tokstr.length, “>”)
writeLogging(“STRING Token: '” + s.mid(start, (endpos - start)).totext + "’ found starting at " + str(start).ToText)
writeLogging("STRING Token end found at " + str(endpos).ToText)
msg = “STRING String found " + (If(s.mid(start, (endpos - start)) = tokstr, “MATCHES”, “DOES NOT MATCH”)) + " the search token”
writeLogging(msg)[/code]

Kem_Tekinay · June 3, 2018, 1:03pm

Try this until it’s fixed:

Public Function MidFixed(Extends t As Text, start As Integer, length As Integer = -1) as Text
  dim chars() as text = t.Split
  dim builder() as text
  
  dim stopIndex as integer 
  if length < 0 then
    stopIndex = chars.Ubound
  else
    stopIndex = start + length - 1
  end if
  
  for i as integer = start to stopIndex
    builder.Append chars( i )
  next
  
  dim r as text = Text.Join( builder, "" )
  return r
  
End Function

Or just use String if you can.

TimStreater · June 3, 2018, 5:44pm

Yes, this is probably what I’ll should do. I’ll need a working .left too. Oddly enough I’ve already had to convert some texts to arrays of chars to speed the app up.