EndOfLine.Windows on a Mac

Marc_COURAUD · April 12, 2018, 7:29pm

I want to read a text file on a Mac

the ends of line is CR+LF
each line contains some data separated by a SEMICOLON (or a TAB), and some values may contain a CR surrounded by double quot

example :
ggg;hhh;jjj;kkk CRLF
eeee;ddd;“ggg CR ggggggg”; CRLF
aaaaa;“bbb CR ccc”;dddd; CRLF

dim file as FolderItem = GetOpenFolderItem("")
dim t as TextInputStream = TextInputStream.open(file)
t.Encoding = Encodings.WindowsLatin1
while not t.EOF
dim wline as string = t.ReadLine
–> the first line is correctly read, not the second : wline contains only : eeee;ddd;"ggg

I thought the end of line were always CRLF when Encoding is declared as WindowsLatin1, even on a Mac
This is not the case ?

Michel_Bujardet · April 12, 2018, 8:07pm

You may want to do something like that :

dim file as FolderItem = GetOpenFolderItem("") dim t as TextInputStream = TextInputStream.open(file) t.Encoding = Encodings.WindowsLatin1 while not t.EOF dim wline as string = ReplaceLineEndings(t.ReadAll, endofLine) wend dim lines() as string = split(wline, endofline) // Then you read from the lines() array

By the way, it could be an excellent idea to select the code and click on the code icon above the editor next time. Notice how more legible the same code is from your post and mine.

http://documentation.xojo.com/index.php/ReplaceLineEndings

Philippe_Schmid · April 12, 2018, 8:10pm

or If the file is not too big, you could read it in memory, and then split the lines with

textfile = t.ReadAll( Encodings.WindowsLatin1 ) ... lines = split( textfile, EndOfLine.Windows )

Robert_Weaver · April 12, 2018, 11:46pm

[quote=382236:@Marc COURAUD]–> the first line is correctly read, not the second : wline contains only : eeee;ddd;"ggg

I thought the end of line were always CRLF when Encoding is declared as WindowsLatin1, even on a Mac
This is not the case ?[/quote]
From the language reference:
TextInputStream.ReadLine
Returns the next line of text (as a string) from the TextInputstream. Any valid end-of-line indicator is used to identify a line.

So, regardless of the encoding, readline stops when is encounters the CR.
Your file format appears to be a CSV file except for the use of a semicolon (or TAB) instead of a comma for the field delimiter. So, you may wish to check out some of the discussions on reading CSV files.

Emile_Schwarz · April 13, 2018, 12:21am

Why ?

Marc_COURAUD · April 13, 2018, 9:20am

Thanks for answering

yes it’s a solution

I can not find a subject with “csv” in the forum : do you have a link ?

because for me the end of Windows line are associated with the WindowsLatin1 encoding

Emile_Schwarz · April 13, 2018, 9:55am

And that is the reason of the question.
I rephrase it: why dont you use UTF-8 ?

It represent the 0 to 255 character set as used by Windows (1 thru XP ?).

Emile_Schwarz · April 13, 2018, 9:57am

Probably a bug in the Forum software search feature. Search: comma separated value

Some info here:
https://en.wikipedia.org/wiki/CSV_application_support

kevin_g · April 13, 2018, 10:13am

Character encoding and line endings are completely separate.

CR+LF is the standard line ending on MS-Windows.
Latin1 is a very common character encoding.

If you generate ASCII text files on MS-Windows there is a high probability that they will be Latin1 encoded with CR+LF as the line ending. However, it would be perfectly valid for an application to generate a Latin1 file that used CR (Mac) or LF (Unix) as the line ending.

If you are reading text files you should be prepared to support any of the 3 line ending variations. Supporting different encodings really depends on how much effort you want to put into your file reading code.

Jeff_Tullin · April 13, 2018, 10:14am

ReplaceLineEndings(t.ReadAll, endofLine)

I suspect that this will have the same problem, because it might convert both the wanted end-of-lines , and the CR embedded in a string field.
(will endofline ONLY hit the Windows CRLF ones?)

If it does have a problem, I might suggest this two-pass version (change the Windows ones to something odd, change the CR to something else, change the something odd back to windows line endings, then process the file)

[code]dim Wholefile as string

Wholefile = t.ReadAll

Wholefile = replaceall(Wholefile,endofline.windows,"||") // I used || assuming they dont appear in the file!
Wholefile = replaceall(Wholefile,chr(13),"~")
Wholefile = replaceall(Wholefile,"||",endofline.windows)

dim lines() as string = split(Wholefile , endofline.windows)
[/code]
now iterate through the lines array, and change ~ to be chr(13) before use.

Michel_Bujardet · April 13, 2018, 11:54am

[quote=382309:@Marc COURAUD]Thanks for answering

yes it’s a solution

I can not find a subject with “csv” in the forum : do you have a link ?

because for me the end of Windows line are associated with the WindowsLatin1 encoding[/quote]

Que s’est-il passé ? Vous n’avez pas lu ma réponse, la première de la conversation, qui contient une suggestion de code ?

dim file as FolderItem = GetOpenFolderItem("") dim t as TextInputStream = TextInputStream.open(file) t.Encoding = Encodings.WindowsLatin1 while not t.EOF dim wline as string = ReplaceLineEndings(t.ReadAll, endofLine) wend dim lines() as string = split(wline, endofline) // Then you read from the lines() array

DaveS · April 13, 2018, 5:49pm

This is what I usually do

dim Wholefile as string

Wholefile  = t.ReadAll

Wholefile  = replaceall(Wholefile,endofline.windows,endofline.unix)   // normalize eol to be 0x0A
Wholefile  = replaceall(Wholefile,endofline.macintosh,endofline.unix)  // 99% of the time this is not required
dim lines() as string = split(Wholefile  , endofline.unix)

Jeff_Tullin · April 13, 2018, 6:38pm

Again, that code breaks in this instance… because the data has CR in the middle of a line, surrounded by quotes

eeee;ddd;“ggg CR ggggggg”; CRLF

Which means that

[quote]ggg;hhh;jjj;kkk
eeee;ddd;“ggg CR ggggggg”;
aaaaa;“bbb CR ccc”;dddd;[/quote]

becomes

[quote]ggg;hhh;jjj;kkk
eeee;ddd;“ggg
ggggggg”;
aaaaa;“bbb CR ccc”;dddd;[/quote]

The OP needs to replace CR on its own, with something else, before parsing the file line by line.
If CR is replaced first, it breaks the existing CRLF at the real end of lines.
So replace the CRLF with
Replace CR with
Replace the back to CRLF
then the file can be split on line endings

Jason_Parsley · April 13, 2018, 7:00pm

The CSV Parser here might help:

http://www.great-white-software.com/Xojo_Code.html

It may already handle the CR LF and embedded CR in field values.

Robert_Weaver · April 13, 2018, 10:25pm

[quote=382422:@Jeff Tullin]Again, that code breaks in this instance… because the data has CR in the middle of a line, surrounded by quotes
eeee;ddd;“ggg CR ggggggg”; CRLF[/quote]
Norman’s parser (that Jason linked to) will fix this, but needs to be modified slightly to change the field delimiter to a semicolon.
Another option is to make use of the split function to isolate the quoted material. This is the code that I use for reading CSV files (modified to use a semicolon):

[code]Function ImportCSV(csvText As TextInputStream) as DataRecord()
'Parses semicolon delimited TextInputStream into an array of “record” arrays.
'Return type from this routine is an array of type DataRecord.
'DataRecord is a class containing nothing but a string array property: dataField()
'So, an array of DataRecord is essentially a general two dimensional array
'that can have a variable number of rows and columns.
dim outData() as DataRecord
dim delimField As string = “;” 'For CSV change this to “,” or for TAB delimited to chr(9)
dim delimQuote As string = chr(34)
dim rawInput,FieldData As String
While not csvText.EOF
'Read a line of text
rawInput = csvText.ReadLine
'Read more if pending line has embedded line endings
While (max(1,CountFields(rawInput,delimQuote)) mod 2=0) 'While quote parity is odd…
if csvText.EOF then 'Big trouble!
MsgBox “Encountered EOF while processing quoted text. Closing quote is missing.”
'Could handle this by returning outData as is,
'or add a closing quote to the last record.
'Or…
return nil 'which means bad file data regardless.
end if
rawInput = rawInput + EndOfLine + csvText.ReadLine
Wend
’ ********** Start new record
outData.Append(new DataRecord)
FieldData=""
dim currentRecordNo As Integer = UBound(outData)
dim Qgroup() As String = split(rawInput,delimQuote) 'Odd numbered elements are quoted text
dim nQgroups As Integer = UBound(Qgroup)
for i as integer = 0 to nQgroups step 2 'Skip over quoted text …for now
dim field() As string = split(Qgroup(i),delimField)
if UBound(field)<0 Then field.Append("") 'fix inconsistency in how Split() handles null string
dim nFields As Integer = UBound(field)
for j as Integer = 0 to nFields
if j<>0 then
'********** Save field data for current field in current record
outData(currentRecordNo).dataField.Append(UnQuote(FieldData,delimQuote))
FieldData=""
end if
FieldData = FieldData+field(j)
if j=nFields and i<nQgroups then 'This is where we include the quoted text
FieldData=FieldData+delimQuote+Qgroup(i+1)+delimQuote
end if
next
next
'********** Save field data for last field in record
outData(currentRecordNo).dataField.Append(UnQuote(FieldData,delimQuote))
Wend
Return outData
End Function

Function UnQuote(s As String, Qchar As String) as string
’ Remove enclosing quotes (if any) from CSV field and unEscape embedded quotes
’ Called by ImportCSV and ImportCSVdb
dim temp As String = s
if left(s,1)=Qchar and right(s,1)=Qchar then
temp=Mid(s,2,Len(s)-2)
else
temp=s
end if
Return ReplaceAll(temp,Qchar+Qchar,Qchar)
End Function

Class DataRecord
Property
dataField() As String
EndProperty
End Class
[/code]
I’ve fed it a lot of messy CSV text, and haven’t managed to break it so far. This example puts the data into a ‘DataRecord’ object which is essentially a variable size two dimensional string array. To handle the field data differently you need to alter the code following the 3 comment lines that begin with ’ **********.