Remove BOM characters from csv file

Craig_Grech · May 6, 2024, 9:17am

My app is reading a csv file then comparing the first row of headers with a list.
Trouble is the file is saved as UTF8 and has BOM characters in front so when I compare the first column header it doesn’t match any of the choices even though the matching text looks the same (binary doesn’t).
How do I remove the BOM characters?

Christian_Schmitz · May 6, 2024, 9:29am

you may read the bytes and then decide which text encoding to use.
And for UTF-8 you get 1 character as the first letter in the text, which encoded takes 3 bytes. First character may then be 65279.

e.g.

if ASC(s.left(1)) = 65279 then
   s = s.mid(2)
end if

TimStreater · May 6, 2024, 10:20am

Arrange to have it saved without the BOM, which is not needed (or desired) for UTF-8.

Jeff_Tullin · May 6, 2024, 11:50am

I know from experience that it is rare to have control over the file creation.
I have in the past read the first line and compared not the entire line, but instead checked whether the read line contained enough source fields, using


readline = thefile.readline
if  instr(readline, "FIELD1,FIELD2,ETC") > 0 then
//i like it

end if

When I know the file is intended correctly, I can open and use it.
Usually you can decide that it is UTF8 , and still open a straight ASCII file without issue.
The problems happen in the other direction - assuming ASCII, getting ‘something else’

MarkusR · May 6, 2024, 2:15pm

you could test the first column with and without BOM as valid file.

i remember notepad app at windows pc add a BOM in front of a text/batch file.

Brian_O_Brien · May 6, 2024, 2:28pm

Can you not, redefine the encoding?
Kem?

Craig_Grech · May 6, 2024, 3:56pm

The csv file usually looks like this (but may have different headers and columns in any order) -

name,color,breed,DOB, …
Fluffy,red,labrador,2021-03-02, …

After reading the first line and splitting I compare with common terms to find matching column numbers. This fails on a UTF-8 file created on a Mac and won’t find a match for the first word “name”.

'Store column number from user title
Do Until arrFile(0, i) = ""
  Dim selectField as string = trim(Lowercase(arrFile(0, i)))
    
  Select Case  selectField
  Case "uid","id","oldid"
    hPrimaryId = i
  Case "name","animal name"
    hName = i
  Case "color", "colour"
    hColour = i
  Case "breed", "strain", "species"

Craig_Grech · May 6, 2024, 4:25pm

I’ve tried UTF16 and ASCII and it adds even more bytes to the front of the file.
What encoding would you suggest?

Craig_Grech · May 6, 2024, 4:29pm

Unfortunately most of my customers have saved an Excel spreadsheet on Mac as a csv file. There isn’t any other options on the list. And yes Excel for Mac is where the problem starts.

Craig_Grech · May 6, 2024, 4:31pm

Thanks that works. I don’t know about whether other encodings will pop up but that is certainly a good start.

TimStreater · May 6, 2024, 4:49pm

Erm, you should mark @Christian_Schmitz 's post as the solution, not your own.

Brian_O_Brien · May 6, 2024, 5:54pm

ASCII-7bit.

Brian_O_Brien · May 6, 2024, 6:02pm

While the subject is being discussed. This BOM (Byte Order Marker.)
When there are two 16 bit characters, might the BOM be ‘indicating’ the endianness of the data.
I found the Memory Block class to be very useful in this context.

I sympathies with you on this, I had MS Notepad mess up a document on me as I was not paying attention to the encoding at the bottom right of the application.

Watch out for CR/LF ENDOFLINE .
Isn’t there a new Text Class for this?

TimStreater · May 6, 2024, 6:08pm

Yes, exactly. That’s why one uses UTF-8, which doesn’t need it.

Serge_Louvet1 · May 11, 2024, 8:34pm

The BOM gives two information: 1/ the file is UTF-encoded, 2/how the code points are saved to file. AFAIK, that character was introduced when UCS-2 encoding was used, and this is the origin of its name. For UTF-8, there are no question about byte order. The presence of the BOM tells you the file is UTF encoded, not encoded using one of the zillions of ‘extended’ ascii, hence its importance.