Read text data with BinaryStream until a delimiter

As the subject says:

I want to read text data with BinaryStream until a delimiter. Currently, the delimiter (for me) is quote+EndOfLine, but anyone who want to do that can have a different set of delimiter(s).

The idea is to open the text file as BinaryStream (BS), then load data by 1024 bytes and search inside until I get a Quote that is followed by an EndOfLine character. And I will not forget to set BS.Position to the first character past my Delimiter for the next “record”.

Any better idea is welcome.

[code]Public Function ReadUntil(extends bs as BinaryStream, endchar as Integer) as String
// lit une chaine jusqu’ obtenir le caractere endchar // <-- CONVERTED

dim res as string
dim car as integer
dim found as boolean

found = false
res = “”
do
car=bs.ReadUInt8
if (car=endchar) then
found=true
else
res=res+chr(car)
end if
loop until ((found=true) or (bs.EOF=true))

return res
End Function
[/code]

another useful one :

[code]Public Function ReadBetween(extends ts as string, startstring as string, endstring as string) as String
’ lit une chaine entre les deux chaines donnes
’ ex: ReadBetween(“bonjour tous”,“jour”,“tous”) retourne " "

dim res as string
dim o1,o2,o3 as integer

o1 = InStr( ts, startstring)
if o1<0 then
res = “”
else
o3 = len(startstring)
o2 = InStr( o1+o3, ts, endstring)
if o2<0 then
res = “”
else
res = mid( ts, o1+o3, o2-o1-o3)
end if
end if

return res

End Function
[/code]

That ReadUntil function would be really slow. First, you’re reading a byte at a time. Second, you are concatenating each byte to a string.

Also, it wouldn’t work in this case since Emile is looking to match more than one character.

I recommend something like this (untested):

dim arr() as string

dim lastBytes as string
while not bs.EOF
  dim chunk as string = bs.Read( 1000000 ) // Or some other arbitrary value
  dim pos as integer = InStrB( lastBytes + chunk, delimiter )
  if pos <> 0 then
    arr.Append chunk.LeftB( pos - lastBytes.LenB - 1 )
    exit while
  else
    lastBytes = chunk.RightB( delimiter.LenB - 1 )
    arr.Append chunk
  end if
wend

return join( arr, "" )

If the delimiter is more than one character, as in this case, this ensures that a Read won’t split the delimiter, making it impossible to find.

Edited (found a better example)
I knew I had some code for this, but it took some digging.

Function ReadDelimited(input As TextInputStream, delimiter As String) As string
  ' Reads a text input stream up to and including the specified delimiter text
  ' then returns the text string. If end of file is encountered before delimiter is
  ' found then all remaining text is returned.
  Static buffer As string=""
  dim result As string
  dim posn As Integer
  do
    posn = InStr(buffer,delimiter)
    if posn>0 then
      result=Left(buffer,posn+delimiter.len-1)
      buffer=Mid(buffer,posn+delimiter.len)
      return result
    elseif input.EOF then
      result=buffer
      buffer=""
      return result
    else
      buffer=buffer+input.Read(10000)
    end if
  loop
End Function

I checked that it does work correctly for multi character delimiters. Will need to change text stream to binary stream.

Kem, do you really gain any efficiency using arrays considering that you still end up concatenating strings inside the InStr function?

My experience is that it is faster two use a MemoryBlock and to collect the starting positions of all delimiter occurrences:

[code] Dim delimiter As MemoryBlock = “”"" + EndOfLine.UNIX
Dim delimiterLength As Integer = delimiter.Size

Dim fi As FolderItem = GetFolderItem("…", FolderItem.PathTypeNative)

If fi Is Nil Then Return // Error
If Not fi.Exists Then Return // File does not exist

Dim bis As BinaryStream = BinaryStream.Open(fi, False)
bis.LittleEndian = True
Dim mb As MemoryBlock = bis.Read(1024, Encodings.UTF8)
bis.Close()

Dim delimiterPositions() As Integer

For index As Integer = 0 To mb.Size - delimiterLength
If mb.Byte(index) = delimiter.Byte(0) Then
// Only check for the first byte of the delimiter
If mb.StringValue(index, delimiterLength) = delimiter.StringValue(0, delimiterLength) Then
// The entire delimiter is found
delimiterPositions.Append(index)
End
End
Next

…[/code]

You’re right about that, and I didn’t test it. I assume you’ll get some speed benefits because you’re concatenating smaller amounts, but maybe not.

Honestly, the fastest way to handle this is to read the whole file into memory and Split it by the delimiter. I can envision a ReadBuffer class that will handle large files something like this:

Function ReadNext() As String
  dim result as string
  if NextIndex <= Data.Ubound then
    result = Data( NextIndex )
    NextIndex = NextIndex + 1
  elseif not Stream.EOF then
    dim chunk as string = Stream.Read( 1000000 )
    Data = chunk.SplitB( Delimiter )
    if Data.Ubound > 0 then
      Stream.Position = Stream.Position - Data( Data.Ubound ).LenB
      Data.Remove Data.Ubound
    end if
    result = Data( 0 )
    NextIndex = 1    
  end if
  return result
End Function

Such a class would be constructed with the file and the delimiter and NextIndex and Data would be properties of the class.

You could also avoid the string concatenation of my first example by manipulating Position as I did in the code above.

honestly, I made my ReadUntil method a long time ago, and not really using it anymore
I use splitb the lines of the file to import, then splitb the fields of each line I store in an array before processing them.

Thank you all for your answers. My mind was black (or white) yesterday and still was until Kem answer that open it. Read below what I wrote.

Kem answer inspire me the following code.
TA_Data is a TextArea to watch data for debug purposes
FT_Text is a File Type Set (with RAW TEXT and CSV).

I tested it with a 520KB csv file and the data appears at the time the Dialog close (fast, very fast).

My current use is to keep Returns in csv fields (fields are surrounded by quotes, separated with comma, ended with EndOfLine).

This I do not know how to do. Care to expand Kem ?

Jean-Yves:
I used TextInputStream.ReadLine until I add a Return in one field: that does not works in that case (IMHO). That is the reason why I want to Read until my (sort of) EndOfLine (quote for the last field + EndOfLine).

[code] //
// Read Text Data until Delimiter
//
Dim OpenDlg As New OpenDialog
Dim OpenFI As FolderItem
Dim OpenBS As BinaryStream
Dim Delimiter As String
Dim DataArr() As String

// 0. Set the Read Delimiter String
Delimiter = Chr(34) + EndOfLine

// 1. Let the user choose a text file
#If Not TargetLinux Then
OpenDlg.InitialDirectory = SpecialFolder.Documents.Parent.Child(“Downloads”)
#Else //open Home directory on linux
OpenDlg.InitialDirectory = SpecialFolder.Home
#Endif

OpenDlg.Title = “Select a Text file”
OpenDlg.Filter = FT_Text.All
OpenFI = OpenDlg.ShowModal
If OpenFI = Nil Then
// User Cancelled
Return
End If

// 2. Get a BinaryStream Reference
OpenBS = BinaryStream.Open(OpenFI, False)
If OpenBS = Nil Then
MsgBox “Read Data Until Delimiter” + EndOfLine + EndOfLine +_
“An error occured while I was trying to get a BinaryStream.”
Return
End If

// 3. Get the whole data into an array
// DataArr.Append Split(OpenBS.Read(OpenFI.Length), Chr(34) + EndOfLine)
DataArr = Split(OpenBS.Read(OpenFI.Length,Encodings.UTF8), Delimiter)

// 4. Report its contents into the TextArea (TextArea)
TA_Data.Text = Join(DataArr , Delimiter)[/code]

Is the code above OK ?

PS: I have to adapt the code to insert the result into a Listbox, but I have the first steps of the stairs :slight_smile:

Done.

Thank you all.

I remove the TA_Data.Text = line, then I compute the number of entries in the array, take DataArr(0) as the Listbox Heading, and in a loop, I LB.AddRow DataArr(LoopIdx)…

Niiiice. Now I can Read / Save multilines csv files into/from Listbox.

For completeness, I thought of another way to do this.

Public Function ReadField (Extends stream As BinaryStream, delimiter As String) as String
  if delimiter = "" then
    //
    // Raise an exception or something
    //
  end if
  
  dim result as string
  
  dim initialPosition as integer = stream.Position
  
  dim block as string
  dim pos as integer
  
  while not stream.EOF and pos = 0
    block = block + stream.Read( 32 * 1024 )
    pos = block.InStrB( delimiter )
  wend
  
  if pos <> 0 then
    result = block.LeftB( pos - 1 )
    stream.Position = initialPosition + result.LenB + delimiter.LenB
  end if
  
  return result
  
End Function

You’d cal this like item = bs.ReadField( ChrB( 9 ) ).

But if you need to do this repeatedly, the fastest way is still to read the whole file into memory, if you can, and parse it.

One correction:

  if pos <> 0 then
    result = block.LeftB( pos - 1 )
    stream.Position = initialPosition + result.LenB + delimiter.LenB
  else
    result = block
  end if