When reading big text files (DXF files drawing) using variants and their values can take many more time depending the encodings affected to the file.
For example with a file of more than 8 000 000 of lines and doing this :
Var f as TextInputStream
Var time0, Time1 As Integer
Var LVariant1, LVariant2 As Variant
Var Name As String
Var x1, y1, x2, y2 As Double
f = TextInputStream.Open(mFile)
f.Encoding = Encodings.WindowsLatin1
time0 = System.Ticks
Do
LVariant1 = f.readLine
LVariant2 = f.readLine
select case LVariant1
case 6 //Linetype name
Name = LVariant2
case 10 //X
x1 = LVariant2 / 2
case 20 //Y
y1 = -LVariant2 / 2
case 11 //X 2ème point
x2 = LVariant2 / 2
case 21 //Y 2ème point
y2 = -LVariant2 / 2
end select
Loop until f.EndOfFile
time1 = System.Ticks
Label4.Text = Str(time1 - time0)
I have this times :
Without encoding : 103 ticks
UTF8 : 107 ticks
ISOLatin 1 : 2940 ticks
and with ISOLatin1 but with String instead of Variant and Val() when necessary : 142 ticks
I can’t tell the exact reason for the 2940 ticks, but I would just eliminate the variants. What you are doing is not necessarily wrong, but the combination of the encoding with the conversion to variants is the problem.
The slowness is coming from the repeated use of TextInputStream.ReadLine, which is doing the encoding translation each time you call it. Instead, consider doing it like this:
Var Tin as TextInputStream
Var AllData() as String
Var AllLines() as String
Tin=TextInputStream.Open(mFile)
Tin.Encoding=Encodings.WindowsLatin1
AllData=Tin.ReadAll
Tin.Close
AllLines=AllData.ToArray(EndOfLine)
Var LineType as String
Var LineValue as Double
For i as Integer = 0 to AllLines.LastIndex - 1 Step 2
LineType = AllLines(i)
LineValue = AllLines(i+1).ToDouble
// processing goes here
Next
This will be faster for several reasons. First, you aren’t continually hitting the file system to read the data – reading it all at once is considerably more efficient. Secondly, you’re only translating it to WindowsLatin1 once instead of countless times (once per file read). Thirdly, the built-in ToArray function is going to work much faster than reading through a delimiter in an input stream.
And as Brandon pointed out: it is good coding practice to avoid variants wherever possible, as they are designed to circumvent a lot of the very useful type checking that Xojo provides you. Generally speaking, you should choose variable types that are as restrictive as possible for the data they are intended to hold. For example, if you are storing positive numbers, choose a numeric data type that disallows negative numbers.
Variants do have their uses. However, in your example, the variant data types for the line type indicators serve no purpose because they are only ever assigned string values. Hence, they should simply be strings. The other lines can be treated more directly by utilizing (for example) ToDouble:
Var LineValue as Double
LineValue = Line2.ToDouble
I see one major problem with reading it all into memory. If the file is truly 8 million lines, the amount of memory being used could be huge. Make sure you are clearing out the things you are not using as soon as you can. I suggest writing a function that reads the file and returns the string array. Also, if you’re only reading forward, remove the lines from the array as you go by using a while loop.
while allLines.count > 0
Line1=allLines(0)
allLines.Remove(0)
…do other things…
Wend
In general, yes, reading 8 million lines of text into memory would be reason to pause. But consider what he’s reading: at least half of those lines are type indicators (“6”, “10”, etc) and will be 2-3 bytes in length. That adds up to a measly 11MB. If the remaining lines are 8 bytes each (imagining that they are textual numeric values), that’s still only an additional 30MB.
For this purpose, I think that reading into memory is just fine. The files would have to be truly epic in proportion to run into any memory issues.
That’s going to cause a huge slowdown, right? Removing AllLines(0) causes the entire array to be rearranged every time you go through the loop. It would be more efficient to go through the array from the bottom up, discarding elements as you go, and then inverting the resulting array if needed.
However, I’d forgo doing any of this until there was a demonstrated need for it. I’m not convinced that this code is going to run into any resource constraints.
The “difference” isn’t that there is actually a difference between the encodings. The slowdown is coming from the fact that TextInputStreams are UTF-8 by default and you are asking it to convert the data to WindowsLatin1.
There’s nothing wrong with doing this, but it does incur a speed penalty. In your original code, this conversion was happening every time you executed TextInputStream.ReadLine – that’s 8 million times! One reason my code is likely to be faster is that the conversion only happens once, when TextInputStream.ReadAll is called.
By the way, if your input data consists only of lines that look like this:
6
12.3456
21
-4.5e10
…then you very likely don’t need to do the conversion at all. ASCII data is identical at the byte level between UTF-8 and almost all typical encodings, including WindowsLatin1.
OK. For starters, 142 ticks is nothing; that’s very fast.
Secondly, using String instead of Variant isn’t likely to make a huge speed difference. It’s the text encoding conversion from UTF to WindowsLatin1 that is the slowdown. When you read the file as straight UTF-8, you’re avoiding the conversion and thus the speed is increased. We’re suggesting you use String instead of Variant for solid coding reasons, not for huge performance gains.
You might see an even larger performance jump if you implement my ReadAll approach.
I was curious and took the liberty of modifying your code using the changes suggested in this thread via a LLM. How many ticks do you get with this version? (I see around 450 on Windows 11)
Var f As TextInputStream
Var time0, time1 As Integer
Var fileContent As String
Var lines() As String
Var i As Integer
Var lineCode As String
Var lineValue As String
Var Name As String
Var x1, y1, x2, y2 As Double
f = TextInputStream.Open( mFile )
f.Encoding = Encodings.UTF8
time0 = System.Ticks
// Read entire file content at once
fileContent = f.ReadAll
f.Close
// Split content into lines array
lines = fileContent.Split( EndOfLine )
// Process lines in pairs
// Even index contains code, odd index contains value
For i = 0 To lines.LastIndex Step 2
If i + 1 > lines.LastIndex Then Exit
lineCode = lines( i )
lineValue = lines( i + 1 )
Select Case lineCode
Case "6" // Linetype name
Name = lineValue
Case "10" // X coordinate
x1 = Val( lineValue ) / 2
Case "20" // Y coordinate
y1 = -Val( lineValue ) / 2
Case "11" // X coordinate 2nd point
x2 = Val( lineValue ) / 2
Case "21" // Y coordinate 2nd point
y2 = -Val( lineValue ) / 2
End Select
Next
time1 = System.Ticks
Label4.Text = Str(time1 - time0)
I don’t personally use LLM for code that’s used in production, but I like to test from time to time how far Claude.ai and similar LLM’s have come.
For those of us who love to squeeze every last bit of performance out of code, I’ve packed the whole thing into a small project with a text file containing 8 million lines.
This is a great example of why you have to know what you’re doing if you’re going to use an LLM to create code. Since this code is intended for highest performance - can you look at it and identify the line of code that is completely redundant and can be removed? It does a little work and if the file has several million lines, it will add up.
Yes, but Claude.ai is correct with the following assumption:
This line serves as a safety check in case the file has an odd number of lines (incomplete pair at the end). Without it, accessing lines(i + 1) could cause an out-of-bounds error if the last line doesn’t have a pair.
This time I see it exactly the same way as Claude.ai.
If you really want to optimize for speed, your situation of reading a long text file and parsing it is very similar to the 1BRC: Xojo1BRC (Xojo One Billion Row Challenge)
Ok, so I’ll concede that an odd number of lines is a consideration. However, this is a pretty crude and time-consuming way to prevent an error, since it will execute needlessly for each of the 8 million lines.
How about something like this, executed before the loop begins:
If Lines.LastIndex Mod 2 > 0 Then
Lines.RemiveAt(Lines.LastIndex)
End
That will guarantee that the Lines array only contains an even number of elements.