[Desktop] Creating a Table of Contents for html file

That is what I ws doing when falling into a trap and now I a at its bottom.

What is wrong in the code below:


// Real search fail: <a name=" | "></a>
Var Tag_Start  As String  = "<img src="""
Var Tag_End    As String  = """ border=1>"
Var Pos_Actual As Integer = 0
Var Pos_End    As Integer
Var Sel_Len    As Integer

TA_Results.Text = ""

Do Until TA_Source.Text.IndexOf(Pos_Actual, Tag_Start) = -1
  Pos_Actual = TA_Source.Text.IndexOf(Pos_Actual, Tag_Start)
  
  Pos_End = TA_Source.Text.IndexOf(Pos_Actual, Tag_End)
  
  Sel_Len = (Pos_End - Pos_Actual) - Tag_End.Length
  
  TA_Source.SelectionStart  = Pos_Actual + Tag_Start.Length
  TA_Source.SelectionLength = Sel_Len + 1
  
  TA_Results.AddText TA_Source.SelectedText + EndOfLine
  
  // Start the next search out of the current found
  Pos_Actual = Pos_End + 1
  
  // To avoid Infinite oops…
  If UserCancelled Then Exit
Loop

The example project and its two supported .txt files:`
Test Reader.zip (9.2 KB)

The first txt file works perfectly. But this is not real life, so I added some data at the end of each lines (different length).

Thry the "alternate.txt file in second position and see the result is different that with the other one.

In my real life project, even the start of the data to take is bad !

Real search:

Tag_Start: "<a name="""
Tag_End:  """></a>"

For unknow reason, I used the load an image line instead of the name block…

Ideas ?

PS: I create the html data I analyze later, so, a correct search find each and every occurence because they are correctly formated.

Wojo 2025r1.1
Sequoia 15.5
MacBook 13" m1

1 Like

Here the wrong result using the“ Alternate.txt“ file:

Using the other test file (not “Alternate”), the result is correct / what I was expecting.
The lines in the other test file have all the same width.

I do not looked at the code since I posted it (20 hours) and I still do not see something wrong in it. (And I changed my glasses in-between) :wink:

What is “wrong” is that you are using accented characters and not defining the locale to ‘fr-FR’. Once you add the locale to IndexOf it works correctly.

Edit: I also had to fix the start position (Pos_Actual)after this line of code:

Pos_Actual = TA_Source.Text.IndexOf("<a name=""ToC""></a>") // There is one.

as your sample text does not have ToC and skipped the 01.png

1 Like

Hi Alberto,

thank you for your answer.

I tried the Locale way, but failed.

I just realized that the code skip the 01.png file :wink:

I’ve made another test using a ListBox as Target, then removed all non ASCII (32-126) characters and run the test: the results does not show any error.

What I do not understand is … the documentation. They says TextArea is UTF8:>

Text encoding

TextAreas store all text internally in Unicode, which is able to represent a mixture of characters from different writing systems. When you extract the text via the Text or SelectedText properties, this text is returned in UTF-8.

TextArea.SelectedLength says nothing about Locale too.

In fact, in the Desktop.TextArea page of the documentation, there is no “Locale”, and the only time an Encoding appears (beside the text above) is in sample code, to load text from disk.
And I do that while loading the text (in the Open button)

I will explorate right now, before lunch.

Here’s the result of a Locale addition:

After that result and consulting the documentation, I gave away the idea to use a Locale.

I think that I found the missing part:

, ComparisonOptions.None before the locale compile. Now will it achieve the goal ?

BTW: IndexOf is wild, as it needs one use without a position, then it works for the other lines.

Unless you load it from a string with a different or unknown encoding, such as when you read from a file or paste text into it. Then things get weird.

The error is because you are missing the Comparison option.

Not sure why you want Comparison None when the default is CaseInsensitive. Are you sure you want to use that one? I used CaseInsensitive.

I can guess you are referring that you are getting an OutOfBoundsException when you changed your code to add the comparison and locale. Is this the case?
It looks like a Xojo bug, I had no time to file a bug/example yet.

You need to make sure that Position is => 0 when you use Comparison and Locale, if Position is -1 then the application breaks on an OutOfBoundsException.

If this is not clear, you can take a look at the code:

Code

// Real search fail: <a name=" | "></a>
Var Tag_Start  As String  = "<img src="""
Var Tag_End    As String  = """ border=1>"
Var Pos_Actual As Integer = 0
'Var Pos_Start  As Integer = 0 // Surprise: not used !
Var Pos_End    As Integer
Var Sel_Len    As Integer

TA_Results.Text = ""

//no need to avoid the position on first IndexOf (added 0 here)
Pos_Actual = TA_Source.Text.IndexOf(0, "<a name=""ToC""></a>") // There is one.

//check Pos_Actual as -1 creates an OutOfBoundsException
If Pos_Actual > 0 Then
  //preserve the 'skip some characters'
  Pos_Actual = Pos_Actual + 10 // Just to skip some characters !
Else
  //avoid the OutOfBoundException when using Comparison and Locale, 
  //without them Position -1 is fine
  Pos_Actual = 0
End If

Do Until TA_Source.Text.IndexOf(Pos_Actual, Tag_Start) = -1
  //to add locale we need to provide the ComparisonOptions too
  Pos_Actual = TA_Source.Text.IndexOf(Pos_Actual, Tag_Start, ComparisonOptions.CaseInsensitive,New Locale("fr-FR"))
  
  Pos_End = TA_Source.Text.IndexOf(Pos_Actual, Tag_End, ComparisonOptions.CaseInsensitive, New Locale("fr-FR"))
  
  Sel_Len = (Pos_End - Pos_Actual) - Tag_End.Length
  
  TA_Source.SelectionStart  = Pos_Actual + Tag_Start.Length
  TA_Source.SelectionLength = Sel_Len + 1
  
  TA_Results.AddText TA_Source.SelectedText + EndOfLine
  
  // Start the next search out of the current found
  Pos_Actual = Pos_End + 1
  
  // To avoid Infinite oops…
  If UserCancelled Then Exit
Loop

What correcton are-you doing here ?
I specifically say the text was readed from disk and the encoding set to UTF8.
I do not wrote a documentation text, it is a question to why my code does not works as it must do.

I come back in 5 mns.

I’m just saying that a TextArea is UTF8 (IF YOU TYPE INTO IT), it will not be otherwise. TextArea will not automatically transform the text into UTF8. You get back what you put in. Using DefineEncoding in this situation is incorrect, as Alberto has shown.

There are two ways to represent an accented character, for example é.
In your text é is represented in one way:

it is set as e and ’ back one space (65 CC 81)

The other way is:

where é is (C3 A9)

There are a number of posts in the forum about this, even code to ‘normalize’ the accented characters.

I have not tested but I can guess if that your accented characters were in the second format you may not get the problems you are having.

Hope this helps.

I do not get that. No, apparently, the first IndexOf call must be:

I am now unable to find the documentation part where it is used.

In my memory (read around two/three days ago), the start position have not to be used since IndexOf start the search at Pos 0 by default. And this gentle compiler complain when one is set (that was because I added a search for <a name="Toc" etc. without start position.

BTW: your revised code works fine with the Alternate.txt text (with non ASCII characters).

In the project I was working - and did not work, reason I opened this thread - I am creating with Xojo the html code, so I know the search string case (lowercase). That is why I used ComparisonOptions.None.

From the beginning, I was sure the problem comes from non ASCII characters, but with bad documentation… (I do not talk about my bad vision).

The IndexOf examples I have on screen, do not use a Start Position value.
Of course, these are not real life examples…

IndexOf has 2 signatures, one without startPosition and one with startPosition:

The examples for the second signature (with startPosition), do offer Start Position value:

Maybe you didn’t scroll down a little bit to see that while reviewing the documentation?

Issue created, as this is not consistent with other IndexOf options.
It is only triggered if we use a Locale.
#79152 - IndexOf can’t use < 0 as start position only with locale definition

1 Like