ReadLine failure mystery

Mike_Linacre1 · February 13, 2024, 5:52am

Here’s the problem:

2023R4 Windows using a TextInputStream:

I have a text file of about 5,000 lines. Each line is about 60 ASCII characters long and ends with CRLF (0D0A).

Input the file with .ReadLine. The 2,000th line (or so, different every time) is input with only 20 or 30 characters (again different every time).

This happens in Debug or Build mode, with or without threads active.

Could see no reason for this, so switched from TextInputStream to BinaryStream.Read() with suitable buffer management. Data is input correctly.

Has anyone met this type of problem?

Tim_Hare · February 13, 2024, 6:01am

With a file that small, I would ReadAll and then Split on CRLF. Loop through the resulting array and handle each line. It will be much faster than reading line by line.

But no, I’ve never seen that problem.

Jean-Yves_Pochez · February 13, 2024, 7:23am

bad text encoding for sure ?
some ASCII character is not really ascii ?

Sascha_S · February 13, 2024, 7:56am

Do you know the Encoding of the File and do you set it correctly in your Code?
Can we see your Code please?

Emile_Schwarz · February 13, 2024, 8:43am

Better: can you share a simple project with the text file ?

MarkusR · February 13, 2024, 12:16pm

is inside the row maybe any ascii code 13,10,0 before the normal end?
have you looked into the file with notepad++ or hex editor?

Sascha_S · February 13, 2024, 12:21pm

As @Tim_Hare said.

Read the whole file (using the correct Encoding, f.e. “t.Encoding = Encodings.UTF8”)
Replace all LineEndings
Split by LineEndings
Process the Array/Dictionary/etc…

For Each s As String In myArray
   ...
Next

Rick_Araujo · February 13, 2024, 12:30pm

If this has a Xojo bug, please don’t try to work around it before submitting a report with a sample showing the problem.

Before opening the Issue report you could share the sample here for community investigation. Once proven it’s a Xojo bug, you should submit the report for a fix.

Sascha_S · February 13, 2024, 12:36pm

I don’t believe in a bug yet.

@Mike_Linacre1: Try something like the following with your Text File please:

Var f As FolderItem
Var textInput As TextInputStream

f = FolderItem.ShowOpenFileDialog("text/plain")
If f <> Nil Then
  
  If f.Exists Then
    
    Try
      
      textInput = TextInputStream.Open(f)
      
      Var strInput As String = textInput.ReadAll(Encodings.UTF8)
      strInput = strInput.ReplaceLineEndings(EndOfLine.UNIX)
      
      Var sArray() As String
      sArray = strInput.Split(EndOfLine.UNIX)
      
      If sArray.Count > 0 Then
        
        ListBox1.AddAllRows sArray
        
      End If
      
    Catch e As IOException
      MessageBox("Error accessing file...")
    End Try
    
    textInput.Close
    
  End If
  
End If

*The above Code was taken in parts from the Documentation…

Rick_Araujo · February 13, 2024, 12:38pm

I don’t work with beliefs, that’s why I asked for a sample proving the allegation. But I also can’t assume that there’s no bug.

Mike_Linacre1 · February 13, 2024, 12:42pm

Thanks everyone. I generated many similar input files and looked at them with a hex editor. Every line looks the same except for different numbers. They all behaved the same. And yes, .ReadAll would work for these small files, but they are test data for the real files which can be several GB large.

Here is the code, omitting all the irrelevant processing of the input line.

var fitem as folderitem
fitem = getfolderitem(fname)
if fitem<>Nil then
  If fitem.Exists Then
     var fstream as textinputstream
     fstream = textinputstream.Open(fitem)
     fstream.encoding = Encodings.UTF8
  end if
end if
......
while not fstream.EndOfFile then
    var s as string = fstream.ReadLine
    ......
wend

It is strange because this code has been used in production for several months without problem, then suddenly this occurred.

Sascha_S · February 13, 2024, 12:47pm

The above Code looks fine. Maybe it’s not (always) UTF-8 or there are overlooked exceptions to the 0D0A/CRLF “rule”?

But

really sounds like the Encoding is the issue.

Rick_Araujo · February 13, 2024, 1:05pm

I’ve created a sample file with 5000 lines of 60 chars as

123456789012345678901234567890123456789012345678901234567890

And it passed this test:

var f as new folderitem("C:\Users\Rick\Desktop\txt5000\txt5000x60.txt")

If not f.exists then
  MessageBox "txt not found"
  Quit
End

var fs as textinputstream = textinputstream.Open(f)

fs.encoding = Encodings.UTF8

var numLines As Integer = 0

Do until fs.EndOfFile
  var s as string = fs.ReadLine
  If s.Length = 0 Then Continue // Ignore empty ones
  numLines = numLines + 1
  If s.Length <> 60 Then
    MessageBox "Error! line(" + numLines.ToString + ") " + s
    Quit
  End
Loop

MessageBox "Done OK. Read " + numLines.ToString + " data lines"

Quit

So I guess your problem is on your data.

Sascha_S · February 13, 2024, 1:19pm

Did the same with a 3.5GB file (49.500.000.000 Lines with CR/LF ending) without any issues.

(I should mention that i did it in the past with my own project and that the processing took more than 2 days on a weak Windows 10 Machine. )

Emile_Schwarz · February 13, 2024, 1:31pm

Thus the need to create a simple project to demonstrate the behavior WITH the text file.

Ian_Kennedy · February 13, 2024, 1:34pm

One thing comes to mind. This above code is slower than it needs to be and also could be causing an issue in terms of memory utilisation. By declaring the variable inside the while you are destroying and recreating it every time the loop iterates. Not only is this slowing the code down, it could be causing an issue for the memory manager.

The following code will only crate the string variable once and then change its contents each time the loop iterates.

var s as string
while not fstream.EndOfFile then
    s = fstream.ReadLine
    ......
wend

It should be faster and has less chance of old strings not being cleared up during the loop. The old code should be perfectly safe, but, perhaps something is failing to free up the memory and resulting in problems? Just a thought. If it does solve the problem then it would seem like a bug in Xojo that would be worth reporting.

Another question comes to mind. Do you have access to a hexdump program. I know there is one as standard on Mac and Linux, not sure about Windows. If you use it on your files you should be able to spot any “odd characters” within the file that could cause issues. Given your original post suggested that the issue occurs in a different place for each run the first option is perhaps more likely. [Actually you say you’ve done this]

I suppose another option is that the hard drive is starting to fail and giving bad data from time to time.

Rick_Araujo · February 13, 2024, 1:56pm

All those hypothesis are not realistic.

An user code bug, referencing an out of scope variable, would end as an error at compile time. So no runtime errors. As you said, internalizing the instantiation in the loop could penalize speed… a tiny bit. Just it.

Hard drive / SSD unrecoverable failures would rise exceptions. If recoverable, it would be just ok.

His source data is damaged. The events that produced it made it this way. It’s not a reading problem.

Ian_Kennedy · February 13, 2024, 2:04pm

I would love to live in such a world. But I’ve seen hard drives, when failing, do the most odd things over the years. Including non-deterministic results for a given sector. Times where it reads and reads a sector attempting to get a result and then “succeeds” in reading but with somewhat random results. Yes, unrecoverable failures would result in an exception, however, “recoverable” issues, that still fail to read correctly can occur in the early stages of drive issues.

We have seen other issues where memory cleanup within a loop fails to take place until the loop is exited. That was a problem with Date objects not being destroyed until the loop ends. This “could” be happening here, but I agree it would be a bug and not normal.

And yet the description says he’s looked at the data with a hex editor and it is OK. Also the original description is random failures at different points with each run. Data format issues would not produce that result?

@Mike_Linacre1 For clarity. Does a single file process in the same way each time or does it change each time you run?

Rick_Araujo · February 13, 2024, 2:46pm

Send us a copy. Let us inspect it.

If we can’t find a failure, probably there are more issues on his “complete code”.

If we can’t rule it out here, he can open a private issue for Xojo inspection.

I’m not ruling out some possible Xojo bug triggered by some very specific combination of factors.

Arnaud_N · February 13, 2024, 4:07pm

Yes, but, since you’re in a debug stage anyway, you could try that suggestion and inspect the resulting array. If you have less entries in the array than lines in your files, you’ll know the error doesn’t relate to the use of ReadLine. You could then even examine the entries whose length is greater than the average and spot the difference.

If you see as much entries in the array as you have lines in the file, the issue would logically point to the use of ReadLine. Definitively worth trying the suggestion of ReadAll and split, for this debug phase, does it not?