[Desktop] Creating a Table of Contents for html file

Emile_Schwarz · May 15, 2025, 8:25am

That is what I ws doing when falling into a trap and now I a at its bottom.

What is wrong in the code below:


// Real search fail: <a name=" | "></a>
Var Tag_Start  As String  = "<img src="""
Var Tag_End    As String  = """ border=1>"
Var Pos_Actual As Integer = 0
Var Pos_End    As Integer
Var Sel_Len    As Integer

TA_Results.Text = ""

Do Until TA_Source.Text.IndexOf(Pos_Actual, Tag_Start) = -1
  Pos_Actual = TA_Source.Text.IndexOf(Pos_Actual, Tag_Start)
  
  Pos_End = TA_Source.Text.IndexOf(Pos_Actual, Tag_End)
  
  Sel_Len = (Pos_End - Pos_Actual) - Tag_End.Length
  
  TA_Source.SelectionStart  = Pos_Actual + Tag_Start.Length
  TA_Source.SelectionLength = Sel_Len + 1
  
  TA_Results.AddText TA_Source.SelectedText + EndOfLine
  
  // Start the next search out of the current found
  Pos_Actual = Pos_End + 1
  
  // To avoid Infinite oops…
  If UserCancelled Then Exit
Loop

The example project and its two supported .txt files:`
Test Reader.zip (9.2 KB)

The first txt file works perfectly. But this is not real life, so I added some data at the end of each lines (different length).

Thry the "alternate.txt file in second position and see the result is different that with the other one.

In my real life project, even the start of the data to take is bad !

Real search:

Tag_Start: "<a name="""
Tag_End:  """></a>"

For unknow reason, I used the load an image line instead of the name block…

Ideas ?

PS: I create the html data I analyze later, so, a correct search find each and every occurence because they are correctly formated.

Wojo 2025r1.1
Sequoia 15.5
MacBook 13" m1

Emile_Schwarz · May 16, 2025, 4:03am

Here the wrong result using the“ Alternate.txt“ file:

Using the other test file (not “Alternate”), the result is correct / what I was expecting.
The lines in the other test file have all the same width.

I do not looked at the code since I posted it (20 hours) and I still do not see something wrong in it. (And I changed my glasses in-between)

AlbertoD · May 16, 2025, 12:39pm

What is “wrong” is that you are using accented characters and not defining the locale to ‘fr-FR’. Once you add the locale to IndexOf it works correctly.

Edit: I also had to fix the start position (Pos_Actual)after this line of code:

Pos_Actual = TA_Source.Text.IndexOf("<a name=""ToC""></a>") // There is one.

as your sample text does not have ToC and skipped the 01.png

Emile_Schwarz · May 18, 2025, 9:34am

Hi Alberto,

thank you for your answer.

I tried the Locale way, but failed.

I just realized that the code skip the 01.png file

I’ve made another test using a ListBox as Target, then removed all non ASCII (32-126) characters and run the test: the results does not show any error.

What I do not understand is … the documentation. They says TextArea is UTF8:>

Text encoding

TextAreas store all text internally in Unicode, which is able to represent a mixture of characters from different writing systems. When you extract the text via the Text or SelectedText properties, this text is returned in UTF-8.

TextArea.SelectedLength says nothing about Locale too.

In fact, in the Desktop.TextArea page of the documentation, there is no “Locale”, and the only time an Encoding appears (beside the text above) is in sample code, to load text from disk.
And I do that while loading the text (in the Open button)

I will explorate right now, before lunch.

Emile_Schwarz · May 18, 2025, 9:41am

Here’s the result of a Locale addition:

After that result and consulting the documentation, I gave away the idea to use a Locale.

Emile_Schwarz · May 18, 2025, 9:54am

I think that I found the missing part:

, ComparisonOptions.None before the locale compile. Now will it achieve the goal ?

BTW: IndexOf is wild, as it needs one use without a position, then it works for the other lines.

Tim_Hare · May 18, 2025, 10:02am

Unless you load it from a string with a different or unknown encoding, such as when you read from a file or paste text into it. Then things get weird.

AlbertoD · May 18, 2025, 11:24am

The error is because you are missing the Comparison option.

Not sure why you want Comparison None when the default is CaseInsensitive. Are you sure you want to use that one? I used CaseInsensitive.

I can guess you are referring that you are getting an OutOfBoundsException when you changed your code to add the comparison and locale. Is this the case?
It looks like a Xojo bug, I had no time to file a bug/example yet.

You need to make sure that Position is => 0 when you use Comparison and Locale, if Position is -1 then the application breaks on an OutOfBoundsException.

If this is not clear, you can take a look at the code:

Code


// Real search fail: <a name=" | "></a>
Var Tag_Start  As String  = "<img src="""
Var Tag_End    As String  = """ border=1>"
Var Pos_Actual As Integer = 0
'Var Pos_Start  As Integer = 0 // Surprise: not used !
Var Pos_End    As Integer
Var Sel_Len    As Integer

TA_Results.Text = ""

//no need to avoid the position on first IndexOf (added 0 here)
Pos_Actual = TA_Source.Text.IndexOf(0, "<a name=""ToC""></a>") // There is one.

//check Pos_Actual as -1 creates an OutOfBoundsException
If Pos_Actual > 0 Then
  //preserve the 'skip some characters'
  Pos_Actual = Pos_Actual + 10 // Just to skip some characters !
Else
  //avoid the OutOfBoundException when using Comparison and Locale, 
  //without them Position -1 is fine
  Pos_Actual = 0
End If

Do Until TA_Source.Text.IndexOf(Pos_Actual, Tag_Start) = -1
  //to add locale we need to provide the ComparisonOptions too
  Pos_Actual = TA_Source.Text.IndexOf(Pos_Actual, Tag_Start, ComparisonOptions.CaseInsensitive,New Locale("fr-FR"))
  
  Pos_End = TA_Source.Text.IndexOf(Pos_Actual, Tag_End, ComparisonOptions.CaseInsensitive, New Locale("fr-FR"))
  
  Sel_Len = (Pos_End - Pos_Actual) - Tag_End.Length
  
  TA_Source.SelectionStart  = Pos_Actual + Tag_Start.Length
  TA_Source.SelectionLength = Sel_Len + 1
  
  TA_Results.AddText TA_Source.SelectedText + EndOfLine
  
  // Start the next search out of the current found
  Pos_Actual = Pos_End + 1
  
  // To avoid Infinite oops…
  If UserCancelled Then Exit
Loop

Emile_Schwarz · May 18, 2025, 11:29am

What correcton are-you doing here ?
I specifically say the text was readed from disk and the encoding set to UTF8.
I do not wrote a documentation text, it is a question to why my code does not works as it must do.

I come back in 5 mns.

Tim_Hare · May 18, 2025, 11:41am

I’m just saying that a TextArea is UTF8 (IF YOU TYPE INTO IT), it will not be otherwise. TextArea will not automatically transform the text into UTF8. You get back what you put in. Using DefineEncoding in this situation is incorrect, as Alberto has shown.

AlbertoD · May 18, 2025, 11:42am

There are two ways to represent an accented character, for example é.
In your text é is represented in one way:

it is set as e and ’ back one space (65 CC 81)

The other way is:

where é is (C3 A9)

There are a number of posts in the forum about this, even code to ‘normalize’ the accented characters.

I have not tested but I can guess if that your accented characters were in the second format you may not get the problems you are having.

Hope this helps.

Emile_Schwarz · May 18, 2025, 12:29pm

I do not get that. No, apparently, the first IndexOf call must be:

I am now unable to find the documentation part where it is used.

In my memory (read around two/three days ago), the start position have not to be used since IndexOf start the search at Pos 0 by default. And this gentle compiler complain when one is set (that was because I added a search for <a name="Toc" etc. without start position.

BTW: your revised code works fine with the Alternate.txt text (with non ASCII characters).

In the project I was working - and did not work, reason I opened this thread - I am creating with Xojo the html code, so I know the search string case (lowercase). That is why I used ComparisonOptions.None.

From the beginning, I was sure the problem comes from non ASCII characters, but with bad documentation… (I do not talk about my bad vision).

The IndexOf examples I have on screen, do not use a Start Position value.
Of course, these are not real life examples…

AlbertoD · May 18, 2025, 4:11pm

IndexOf has 2 signatures, one without startPosition and one with startPosition:

The examples for the second signature (with startPosition), do offer Start Position value:

Maybe you didn’t scroll down a little bit to see that while reviewing the documentation?

AlbertoD · May 19, 2025, 3:48pm

Issue created, as this is not consistent with other IndexOf options.
It is only triggered if we use a Locale.
#79152 - IndexOf can’t use < 0 as start position only with locale definition

Emile_Schwarz · July 27, 2025, 9:58am

I put this on my plate yesterday - as a brand new project - and apparently, I found the solution (locale based…).

So, with a little luck, I was able to change the code in the larger / api2 project and it seems to work fine (slow).

Now, I have to remove the DeblogLog lines…

thank you all for your answers.