Large CSV file to ListBox & 2D Array

Hello,

I need to work with some large csv files. Some of these files have columns exceeding several hundred and rows exceed several thousand. I am moving these needs off Excel VBA and onto XOJO. At this time Excel VBA is faster so I need to make some improvements

Once I load these files I will make modifications and then create another csv file and also create a MySQL CREATE/UPDATE capability.

But first improving the speed of bringing in csv file so I can then perform other work

I am using the following code to separate and placed data into a 2D string array

[code] tis = TextInputStream.Open(f) ’ read file into grid
while not tis.EOF ’ while not end-of-file
ListBox2.AddRow “” ’ add row to grid
s=tis.ReadLine ’ read line from file
fields=Split(s,",") ’ put items in fields() array
currentRow = currentRow + 1 ’ increment 2D array row pointer

Redim my_2d_array1(currentRow, Ubound(fields))
for i=0 to ListBox2.ColumnCount-1                              ' copy to grid
  ListBox2.Cell(ListBox2.ListCount-1,i)=Trim(fields(i))        ' Fill listbox to limit of columns
  my_2d_array1(currentRow, i) = fields(i)                      ' Fill 2D Array second element
next

wend
[/code]

That is painfully slow with a file that has more than 400 columns and 300 rows. It is excruciatingly slow when I grab a file that is thousands of lines long. and hundreds of more columns long.

This construct works quickly

fields=Split(s,",") 

However it is only good for a 1D array

So I tried to do something like this

 my_2d_array1(currentRow, i) = Split(s,",")

However I could not find a way to do this on a full row due to the need for the second element and there is no ADD or APPEND for a multi dimensional array

So is the 2D array the best way to bring in large files?
Is there a way to speed up multi dimensional arrays input

Or what so I can speed this up and leave Excel VBA in the dust.

Thanks

Carl

I made the following changes to enable grabbing all columns into the 2D array

tis = TextInputStream.Open(f)                                      ' read file into grid
  while not tis.EOF                                                ' while not end-of-file
    ListBox2.AddRow ""                                             ' add row to grid
    s=tis.ReadLine                                                 ' read line from file
    fields=Split(s,",")                                            ' put items in fields() array
    currentRow = currentRow + 1                                    ' increment 2D array row pointer
    
    Redim my_2d_array1(currentRow, Ubound(fields))
    for i=0 to  UBound(fields)                                     ' copy to grid
      If i <  ListBox2.ColumnCount-1    Then
        ListBox2.Cell(ListBox2.ListCount-1,i)=Trim(fields(i))      ' Fill listbox to limit of columns
      End
      my_2d_array1(currentRow, i) = fields(i)                      ' Fill 2D Array second element
    next
  
  wend

redimming on every row is going to slow you down

since you cannot simply append to a 2d array I’d suggest NOT using a 2d array but a class that behaves like one that you CAN just append to. designing such a class is a grey exercise in learning class design & data type implementation.

you might make the listbox NOT visible while you fill it - sometimes this has surprising boosts to performance

My guess is that it’s the constant Redim’ing of the array that’s taking a lot of time. What if you look through the file first to get a line count, then redim accordingly?

I also made some other adjustments.

dim lastRowIndex as integer = -1
dim lastColumnIndex as integer = ListBox2.ColumnCount - 1

tis = TextInputStream.Open( f )
while not tis.EOF
  s = tis.ReadLine
  lastRowIndex = lastRowIndex + 1
wend
tis = nil

redim my_2d_array( lastRowIndex, ListBox2.ColumnCount - 1 )

tis = TextInputStream.Open(f)                                      ' read file into grid
  while not tis.EOF                                                ' while not end-of-file
    ListBox2.AddRow ""                                             ' add row to grid
    s=tis.ReadLine                                                 ' read line from file
    fields=Split(s,",")                                            ' put items in fields() array
    currentRow = currentRow + 1                                    ' increment 2D array row pointer
    
    for i=0 to  lastColumnIndex                                     ' copy to grid
      ListBox2.Cell(currentRow,i)=Trim(fields(i))      ' Fill listbox to limit of columns
      my_2d_array1(currentRow, i) = fields(i)                      ' Fill 2D Array second element
    next
  
  wend

This is a different approach: process the file and store the column separator positions and the line break positions immediately after reading the file, but extract the values later on only when needed.

Remarks:

  • The code is not very well tested.
  • Since I only use it with “self formed” CSV files, some things are not considered (empty lines don’t work; their must be a “record separator” after the last record, and so on…)

[code]Class CSVImport

Private mColumnCount As Integer
Private mColumns() As Integer
Private mLines() As Integer
Private mMemoryBlock As MemoryBlock

Private Sub Constructor()
// prevent creation without FolderItem
End Sub

Sub Constructor(file As FolderItem)
Static encoding As TextEncoding = Encodings.UTF8
Static doubleQuote As UInt8 = Asc("""")
Static recordSeparator As UInt8 = Asc(EndOfLine)
Static columnSeparator As UInt8 = Asc(",")

Dim tis As TextInputStream = TextInputStream.Open(file)
mMemoryBlock = tis.ReadAll(encoding)
tis.Close()

Dim inQuotedString As Boolean = False
Dim ub As Integer = mMemoryBlock.Size - 1

// Get the count of columns
For i As Integer = 0 To ub
Dim char As UInt8 = mMemoryBlock.UInt8Value(i)
If char = doubleQuote Then
If inQuotedString And i < ub And mMemoryBlock.UInt8Value(i + 1) <> doubleQuote Then
inQuotedString = False
Else
inQuotedString = True
End
ElseIf Not inQuotedString Then
If char = columnSeparator Then
mColumnCount = mColumnCount + 1
ElseIf char = recordSeparator Then
mColumnCount = mColumnCount + 1
Exit
End
End
Next

// Get the offset of all columnSeparator and all recordSeparator positions
For i As Integer = 0 To ub
Dim char As UInt8 = mMemoryBlock.UInt8Value(i)
If char = doubleQuote Then
If inQuotedString And i < ub And mMemoryBlock.UInt8Value(i + 1) <> doubleQuote Then
inQuotedString = False
Else
inQuotedString = True
End
? ElseIf Not inQuotedString Then
If char = columnSeparator Then
mColumns.Append(i)
ElseIf char = recordSeparator Then
mColumns.Append(i)
mLines.Append(i)
End
End
Next
End Sub

Function Value(row As Integer, column As Integer) As String
Dim index As Integer = mColumns.IndexOf(mLines(row))
Dim pos1 As Integer = mColumns(index + column) + 1
Dim pos2 As Integer = mColumns(index + column + 1)
Return mMemoryBlock.StringValue(pos1, pos2 - pos1).DefineEncoding(Encodings.UTF8)
End Function

Property ColumnCount As Integer
Get
Return mColumnCount
End Get
End Property

Property RowCount As Integer
Get
Return mLines.Ubound
End Get
End Property

End Class[/code]

or you could use my csv file reader http://www.great-white-software.com

Although you did not mention it in the body of your post, I noticed that you mention the listbox in the title as well as hundreds of columns in the post…

In case you forgot just a reminder that the Xojo listbox can only have 64 columns

Karen: Yep - The list box limit is something I would l really like to find a solution for. No matter how much you do in code people are mainly from Missouri and would like to see the data just like they do in Excel.

Norm & Ken: The ReDim definitely was causing significant delay. I have just modified code to address this and for files with more than 3000 rows and several hundred columns the work is completed in less than 2 minutes verses hours. Thanks

Eli: It looks like you suggest something along the lines Norm was suggesting and it is something I need to take a serious look as since I will be dealing with files like this for a while.

Norm: I just took a look at your CSV parser and I think I am going to take longer look at it.

I also noticed you have a Update UI from Thread which may be helpful. I have a ProgressBar and have tried to use the Timer and Task examples to update the ProgressBar but neither works while file IO is occurring. Could be something I am doing.

[quote=78263:@Carl Fitzsimmons]Karen: Yep - The list box limit is something I would l really like to find a solution for. No matter how much you do in code people are mainly from Missouri and would like to see the data just like they do in Excel.
[/quote]
Won’t happen anytime soon
The listbox is a beast of a code base to deal with hence why its still stuck at 64 columns
There are some other alternatives like Einhugurs that can handle more columns & do some other interesting things like styling etc

Norman,

Bummer on the re-code. Sounds like a large but important effort. I am looking at Einhugurs Plugins and that may offer a solution

Thanks to everyone who responded - have a great day