Bug with variant and some encodings?

Bonjour,

When reading big text files (DXF files drawing) using variants and their values can take many more time depending the encodings affected to the file.

For example with a file of more than 8 000 000 of lines and doing this :

Var f as TextInputStream
Var time0, Time1 As Integer
Var LVariant1, LVariant2 As Variant
Var Name As String
Var x1, y1, x2, y2 As Double

f = TextInputStream.Open(mFile)
f.Encoding = Encodings.WindowsLatin1

time0 = System.Ticks

Do
  
  LVariant1 = f.readLine
  LVariant2 = f.readLine
  
  select case LVariant1
  case 6   //Linetype name
    Name = LVariant2
    
  case 10  //X
    x1 = LVariant2 / 2
    
  case 20  //Y
    y1 = -LVariant2 / 2
    
  case 11   //X 2ème point
    x2 = LVariant2 / 2
    
  case 21   //Y 2ème point
    y2 = -LVariant2 / 2
    
  end select
  
Loop until f.EndOfFile

time1 = System.Ticks
Label4.Text = Str(time1 - time0)

I have this times :

Without encoding : 103 ticks

UTF8 : 107 ticks

ISOLatin 1 : 2940 ticks

and with ISOLatin1 but with String instead of Variant and Val() when necessary : 142 ticks

Is that a bug ? or a mistake from me ?

Thanks.

I would use strings instead of variants as you tested.

I do that… 142 ticks.

I can’t tell the exact reason for the 2940 ticks, but I would just eliminate the variants. What you are doing is not necessarily wrong, but the combination of the encoding with the conversion to variants is the problem.

The slowness is coming from the repeated use of TextInputStream.ReadLine, which is doing the encoding translation each time you call it. Instead, consider doing it like this:

Var Tin as TextInputStream
Var AllData() as String
Var AllLines() as String

Tin=TextInputStream.Open(mFile)
Tin.Encoding=Encodings.WindowsLatin1

AllData=Tin.ReadAll

Tin.Close

AllLines=AllData.ToArray(EndOfLine)

Var LineType as String
Var LineValue as Double

For i as Integer = 0 to AllLines.LastIndex - 1 Step 2
    LineType = AllLines(i)
    LineValue = AllLines(i+1).ToDouble

    // processing goes here
Next

This will be faster for several reasons. First, you aren’t continually hitting the file system to read the data – reading it all at once is considerably more efficient. Secondly, you’re only translating it to WindowsLatin1 once instead of countless times (once per file read). Thirdly, the built-in ToArray function is going to work much faster than reading through a delimiter in an input stream.

Give it a shot and let us know.

2 Likes

And as Brandon pointed out: it is good coding practice to avoid variants wherever possible, as they are designed to circumvent a lot of the very useful type checking that Xojo provides you. Generally speaking, you should choose variable types that are as restrictive as possible for the data they are intended to hold. For example, if you are storing positive numbers, choose a numeric data type that disallows negative numbers.

Variants do have their uses. However, in your example, the variant data types for the line type indicators serve no purpose because they are only ever assigned string values. Hence, they should simply be strings. The other lines can be treated more directly by utilizing (for example) ToDouble:

Var LineValue as Double

LineValue = Line2.ToDouble

[1]I’ve edited my original post to show this.


  1. Footnotes ↩︎

1 Like

I see one major problem with reading it all into memory. If the file is truly 8 million lines, the amount of memory being used could be huge. Make sure you are clearing out the things you are not using as soon as you can. I suggest writing a function that reads the file and returns the string array. Also, if you’re only reading forward, remove the lines from the array as you go by using a while loop.

while allLines.count > 0
Line1=allLines(0)
allLines.Remove(0)
…do other things…
Wend

In general, yes, reading 8 million lines of text into memory would be reason to pause. But consider what he’s reading: at least half of those lines are type indicators (“6”, “10”, etc) and will be 2-3 bytes in length. That adds up to a measly 11MB. If the remaining lines are 8 bytes each (imagining that they are textual numeric values), that’s still only an additional 30MB.

For this purpose, I think that reading into memory is just fine. The files would have to be truly epic in proportion to run into any memory issues.

2 Likes

That’s going to cause a huge slowdown, right? Removing AllLines(0) causes the entire array to be rearranged every time you go through the loop. It would be more efficient to go through the array from the bottom up, discarding elements as you go, and then inverting the resulting array if needed.

However, I’d forgo doing any of this until there was a demonstrated need for it. I’m not convinced that this code is going to run into any resource constraints.

1 Like

Thank you all for your responses but the question is why is there a so big difference between UTF8 and ISOLatin ?

Cordialement.

The “difference” isn’t that there is actually a difference between the encodings. The slowdown is coming from the fact that TextInputStreams are UTF-8 by default and you are asking it to convert the data to WindowsLatin1.

There’s nothing wrong with doing this, but it does incur a speed penalty. In your original code, this conversion was happening every time you executed TextInputStream.ReadLine – that’s 8 million times! One reason my code is likely to be faster is that the conversion only happens once, when TextInputStream.ReadAll is called.

By the way, if your input data consists only of lines that look like this:

6
12.3456
21
-4.5e10

…then you very likely don’t need to do the conversion at all. ASCII data is identical at the byte level between UTF-8 and almost all typical encodings, including WindowsLatin1.

1 Like

But if I use String instead of Variant and Val() when necessary, the time is 142 ticks even reading line by line…not 2940 !

OK. For starters, 142 ticks is nothing; that’s very fast.

Secondly, using String instead of Variant isn’t likely to make a huge speed difference. It’s the text encoding conversion from UTF to WindowsLatin1 that is the slowdown. When you read the file as straight UTF-8, you’re avoiding the conversion and thus the speed is increased. We’re suggesting you use String instead of Variant for solid coding reasons, not for huge performance gains.

You might see an even larger performance jump if you implement my ReadAll approach.

Thank you Eric, I will use ReadAll, ToArray (of String) and Val() if necessary.

Cordialement.

I was curious and took the liberty of modifying your code using the changes suggested in this thread via a LLM. How many ticks do you get with this version? (I see around 450 on Windows 11) :slight_smile:

Var f As TextInputStream
Var time0, time1 As Integer
Var fileContent As String
Var lines() As String
Var i As Integer
Var lineCode As String
Var lineValue As String
Var Name As String
Var x1, y1, x2, y2 As Double

f = TextInputStream.Open( mFile )
f.Encoding = Encodings.UTF8

time0 = System.Ticks

// Read entire file content at once
fileContent = f.ReadAll
f.Close

// Split content into lines array
lines = fileContent.Split( EndOfLine )

// Process lines in pairs
// Even index contains code, odd index contains value
For i = 0 To lines.LastIndex Step 2
  If i + 1 > lines.LastIndex Then Exit
  
  lineCode = lines( i )
  lineValue = lines( i + 1 )
  
  Select Case lineCode
  Case "6" // Linetype name
    Name = lineValue
    
  Case "10" // X coordinate
    x1 = Val( lineValue ) / 2
    
  Case "20" // Y coordinate
    y1 = -Val( lineValue ) / 2
    
  Case "11" // X coordinate 2nd point
    x2 = Val( lineValue ) / 2
    
  Case "21" // Y coordinate 2nd point
    y2 = -Val( lineValue ) / 2
    
  End Select
Next

time1 = System.Ticks
Label4.Text = Str(time1 - time0)

I don’t personally use LLM for code that’s used in production, but I like to test from time to time how far Claude.ai and similar LLM’s have come. :wink:


For those of us who love to squeeze every last bit of performance out of code, I’ve packed the whole thing into a small project with a text file containing 8 million lines. :smiley:

8mlines.zip (123.7 KB)

Thank you Sasha!

I modified a little your code because lineCode could have extra space, so I use Val(lineCode) and case with integer.

My code with no encodings : 106 ticks

Your code with UTF8 : 49 ticks

Your code with ISOLatin1 : 134 ticks

My code with string instead of variant : 139 ticks

This file of 8000000 of lines has not many texts in it.

Times with another file (4910000 lines) with some texts with accented characters :

My code with no encodings : 62 ticks

Your code with UTF8 : 50 ticks

Your code with ISOLatin1 : 96 ticks

My code with strings instead of variants : 81 ticks

My project with your code and this 2 files are here.

Cordialement.

1 Like

This is a great example of why you have to know what you’re doing if you’re going to use an LLM to create code. Since this code is intended for highest performance - can you look at it and identify the line of code that is completely redundant and can be removed? It does a little work and if the file has several million lines, it will add up.

If i + 1 > lines.LastIndex Then Exit

Yes, but Claude.ai is correct with the following assumption:

This line serves as a safety check in case the file has an odd number of lines (incomplete pair at the end). Without it, accessing lines(i + 1) could cause an out-of-bounds error if the last line doesn’t have a pair.

This time I see it exactly the same way as Claude.ai. :wink:

If you really want to optimize for speed, your situation of reading a long text file and parsing it is very similar to the 1BRC: Xojo1BRC (Xojo One Billion Row Challenge)

You could adopt some of those techniques.

2 Likes

Ok, so I’ll concede that an odd number of lines is a consideration. However, this is a pretty crude and time-consuming way to prevent an error, since it will execute needlessly for each of the 8 million lines.

How about something like this, executed before the loop begins:

If Lines.LastIndex Mod 2 > 0 Then

    Lines.RemiveAt(Lines.LastIndex)

End

That will guarantee that the Lines array only contains an even number of elements.

1 Like