Split with escaped items?

I normally use Split to take the contents of a CSV and fill an array. It works fine until my CSV data contains a CrLF as part of a field.

Sort of like this:

FullFile as String = Readall from my CSV
Items as String() would = Split(FullFile,CrLF)

Does anyone have a replacement for Split that would recognize that a CrLf is inside of quotes and not treat it as a delimiter?

my Tadpole SQL editor needed exactly that feature… turns out it can be done, but you need to scan the file data character by character, checking for quoted bound (a CR or LF INSIDE of quotes is data), outside of quotes is the line/record delimiter

while on that subject there are other character sequences that need to be possibly dealt with… the data may contain escape sequences “\0x123” that you would want to convert … so CSV is more complicated that Split by lineend, then split by delimiter (because the delimiter COULD also be inside quotes) … (plus does you file use Single or Double quotes? are they escaped?)

Not meaning to rain on your parade… just pointing out some things you need to be aware of.

Look on this forum. A while back Norman wrote something like this, Simon Berridge and I both created versions that were a bit faster and added some additional checks… I think we both posted our solutions (it would be a year or so back)

I posted a method in the Just Code challenge: CSV File Editor
It uses 3 levels of the Split function, first using the quote character as the delimiter, then the EndOfLine as delimiter, and finally the comma as a delimiter. This is the code for the actual CSV parsing function:

[code]Public Function ImportCSV(csvText As TextInputStream) as DataRecord()
'Parses comma delimited TextInputStream into an array of “record” arrays.

'Return type from this routine is an array of type DataRecord.
'DataRecord is a class containing nothing but a string array property: dataField()
'So, an array of DataRecord is essentially a general two dimensional array
'that can have a variable number of rows and columns.
dim outData() as DataRecord
dim delimField As string = “,”
dim delimQuote As string = chr(34)
dim rawInput,FieldData As String
columnCount = 0
While not csvText.EOF
'Read a line of text
rawInput = csvText.ReadLine
'Read more if pending line has embedded line endings
While (max(1,CountFields(rawInput,delimQuote)) mod 2=0) 'While quote parity is odd…
if csvText.EOF then 'Big trouble!
MsgBox “Encountered EOF while processing quoted text. Closing quote is missing.”
'Could handle this by returning outData as is,
'or add a closing quote to the last record.
'Or…
return nil 'which means bad file data regardless.
end if
rawInput = rawInput + EndOfLine + csvText.ReadLine
Wend
’ ********** Start new record
outData.Append(new DataRecord)
FieldData=""
dim currentRecordNo As Integer = UBound(outData)
dim Qgroup() As String = split(rawInput,delimQuote) 'Odd numbered elements are quoted text
dim nQgroups As Integer = UBound(Qgroup)
for i as integer = 0 to nQgroups step 2 'Skip over quoted text …for now
dim field() As string = split(Qgroup(i),delimField)
if UBound(field)<0 Then field.Append("") 'fix inconsistency in how Split() handles null string
dim nFields As Integer = UBound(field)
for j as Integer = 0 to nFields
if j<>0 then
'********** Save field data for current field in current record
outData(currentRecordNo).dataField.Append(UnQuote(FieldData,delimQuote))
FieldData=""
end if
FieldData = FieldData+field(j)
if j=nFields and i<nQgroups then 'This is where we include the quoted text
FieldData=FieldData+delimQuote+Qgroup(i+1)+delimQuote
end if
next
next
'********** Save field data for last field in record
outData(currentRecordNo).dataField.Append(UnQuote(FieldData,delimQuote))
columnCount=max(columnCount,UBound(outData(currentRecordNo).dataField))
Wend
Return outData
End Function
[/code]
DataRecord is a class containing just a single property, a string array dataField(). This functions as a two dimensional array with resizable rows and columns.

what happens with an Escaped Quote?

123 , “test “stuff”” , fred

the meaning of " , ’ \ as well as CR/LF depends on the context surrounding each character

Quotes toggle a context (in quotes or out), as long as its not an Escaped Quote for example

The code that derived from what Norman designed, and Simon and I altered handles all these correctly

Quotes in CSV files are normally escaped by using a pair of quotes together: “” and this is correctly parsed in the above function. If you do have quotes that are escaped with a backslash, then the file could be preprocessed using a ReplaceAll to convert " to “”.

for the record… the words CSV and “normal” never belong in the same sentence…
and using replaceALL to preprocess CSV data is a recipe for disaster…

example… \" is not ""

You’re right. There’s no real standard for a CSV file, so anything is possible. However, you have to draw the line somewhere when it comes to trying to predict what kind of format that someone may invent. For example what if someone decides to use the backslash character just as a backslash? I based my code on the format of CSV files produced from Microsoft Excel which uses the “” for embedded quotes and uses the backslash as a backslash.

which is in fact valid 80% of the time anyways… as I said “context”
when a slash is encountered its context, and therefore what is or is not altered is based on the character(s) the immediatly follow it.

  • \’ single quote.
  • " double quote.
  • \\ backslash.

  • new line.
  • \r carriage return.
  • \t tab.
  • \b backspace.
  • \f form feed.

but something like “\z” is,… well “\z”
and "\
" is "
" not a newline

[quote=410297:@Dave S]what happens with an Escaped Quote?

123 , “test “stuff”” , fred

the meaning of " , ’ \ as well as CR/LF depends on the context surrounding each character

Quotes toggle a context (in quotes or out), as long as its not an Escaped Quote for example

The code that derived from what Norman designed, and Simon and I altered handles all these correctly[/quote]

Hi Dave,

Is the code you described available?

The code I currently use is simply:

Dim AllFileContent as String //Will hold a ReadAll of the file
Dim DataFileLines() as String //An array with each row of data
DataFileLines = SplitEx(AllFileContent,CrLf)

Which obviously can’t deal with a CrLf in the middle of a data row

Thanks, Bill

I believe it is posted on this forum in a database discussion that occurred a year or so ago. At that time Simon and I were both working on our own SQL Editors, and this became a topic of conversation over a period of a week or so.

Norman posted his version, then Simon and I independantly altered it.
Another thing is does is read the file in chunks as some of our test files were 100megs and larger :slight_smile:

If it is not here (and I think it is)… I would have to parse it out of the Tadpole code, which might prove difficult

Hi Dave,

[quote=410308:@Dave S]I believe it is posted on this forum in a database discussion that occurred a year or so ago. At that time Simon and I were both working on our own SQL Editors, and this became a topic of conversation over a period of a week or so.

Norman posted his version, then Simon and I independantly altered it.
Another thing is does is read the file in chunks as some of our test files were 100megs and larger :slight_smile:

If it is not here (and I think it is)… I would have to parse it out of the Tadpole code, which might prove difficult[/quote]

Looks like this is the discussion:
https://forum.xojo.com/40860-csv-parser/0

It references a 404 link:
http://www.rdS.com/CSVImporter.xojo_xml_project.zip

But further checking leads me to:
http://www.great-white-software.com/Xojo_Code.html
with link:
http://great-white-software.com/gws-rb-samples/CSVParser.zip

Is this the the one?

the 404 link is/was mine… I will see if I can restore it
The great-white is Norman’s original version

I have restored my copy
http://www.rdS.com/csvimporter.xojo_xml_project.zip

Hi Dave,

That link still 404’s.

Thanks, Bill

sorry… typo on the server… .it works now

Thank you sir! Got it.

Bill, I’m rather curious about which software produces CSV files where the quote character is escaped with a backslash. Could you tell me what you’re using that generates these files?

per Wikipedia

[quote]
CSV is a delimited text file that uses a comma to separate values (many implementations of CSV import/export tools allow other separators to be used). Simple CSV implementations may prohibit field values that contain a comma or other special characters such as newlines. More sophisticated CSV implementations permit them, often by requiring " (double quote) characters around values that contain reserved characters (such as commas, double quotes, or less commonly, newlines). Embedded double quote characters may then be represented by a pair of consecutive double quotes,[13] or by prefixing a double quote with an escape character such as a backslash (for example in Sybase Central).[/quote]

seems a number of database engines use " instead of “” and as previously stated there is no “standard”, so you need to assume and deal with “worst case” scenarios

That’s interesting, because neither LibreOffice nor OpenOffice will import a CSV file correctly (or at all in some cases) if the quote is escaped with a backslash. They require the quote to be escaped as per the original CSV method by adding a second quote. Although there is no official CSV standard, there is RFC4180 Common Format and MIME Type for Comma-Separated Values (CSV) Files which doesn’t make any allowance for backslash escaping. The RFC only specifies the usual three special characters: The end of line as a record delimiter, the comma as a field delimiter, and the quote character as a literal delimiter, and for escaping embedded quotes. This would imply that RFC4180 compliant parsers should treat a backslash as an ordinary non-escape character. It would seem that in order to have the ability to deal with both RFC4180 compliant and Sybase files, the user would need to be presented with the option to select whether a backslash should be treated as an escape character or not.

none of this discussion alters the facts.
There are many instances where " is 100% valid (and I personally prefer it, as I hate counting “”""" , but thats me)
but if there is even ONE datasource that does it that way, you need to be able to deal it (ie. “worst case” scenario)

and I will bet $$$ to donuts that that chances of running across " in the datastream where it should be interpeted as " are going to be be very very very small (but again, thats me)

FYI… its not just Sybase… many DB engines work that way [google it]

I’m not arguing the fact that a backslash makes more sense than two quotes. I agree that a backslash is more sensible, but it doesn’t change the fact that the quote pair is well established as a standard way of embedding quotes, at least in the spreadsheet world. You could certainly make the assumption that a backslash should always be treated as an escape character, but that goes contrary to the closest thing that there is to a standard (RFC4180), and guaranteed as soon as you make such an assumption, a counter example is going to turn up and cause the parser to fail.

I was considering whether I should revise my csv parser to optionally deal with backslash escaping. I might put that on my to-do list. BTW, It can still be done without having to parse character by character.