Windows text files NULL character ... something changed?

I ran into an unexpected behaviour with text files on Windows in Xojo 2017r2.1, comparing it with behaviour of Xojo 2015r2.4

I’m finding a NULL character chr(0) written in a text file written on Windows by the latest Xojo. When compared with files produced by the same app compiled with earlier versions of Xojo on WIndows, that character doesn’t appear. This caused some very strange file parsing behaviour in one of my apps, only on windows. So I had to add a method to remove the null character.

Is it a bug? Or by design? Something changed?

Thanks.

UTF8 encoding as opposed to ANSI

Chr(0) is same for all 8-bit encodings.

Are you sure you convert all strings to right encoding?

Normally you only get extra null bytes when you write UTF16 or UTF32 to a file mixed within other encodings.

Since this particular file is written and read only by my app, I wasn’t worrying about encodings at all. I assumed Xojo always used UTF8.

The Chr(0) character confuses Xojo’s split and join functions. I got very weird results with the character there.

It doesn’t. You must set the encoding yourself.

I don’t think this is a Xojo problem. I noticed a couple of months ago that notepad on Windows adds an extra character at the beginning of a textfile. I didn’t look what character, I just don’t want any extra characters. Since then I use a (Xojo-) program to look at texts.

I work with data supplied as text files a lot.
In recent years, now and then I get a text file that behaves strangely (some apps refuse to believe it contains any data) , and every time, when I check, it is because of the invisible prefix characters at the first byte or two of the file.

You can check easily:
Open the file in notepad, then go to file/save as
If the file type says UTF8, change it to ANSI and save.
That ‘fixes’ the problem if you can’t accept a UTF8 file.

I usually find this for files generated by apps like Business Objects… if there is any kind of special character in a name field or similar, you get a unicode file when exporting to text.

In reverse, if you save an ANSI file as UTF in notepad, it prefixes &hEF, &BB, &hBF before the first letter.

It’s the byte order mark.

What I find baffling about this is that it never happened before, and I can’t find any reason that it should just start happening, that chr(0) somehow gets into the file. This is only on Windows, and appears to be only with the latest version of Xojo.

Xojo writes the text file, and Xojo reads the text file. I do some string parsing after reading the file. This has worked for over a decade. Now all of a sudden when Xojo reads the file on Windows, it sees a chr(0) there. Well, that’s just bizarre. How did it get there? I’m not setting or converting any encoding. Ever since I have been using Xojo, I haven’t had to with files that are only used internally this way by the app. The user never touches the file. Seems like a bug to me, or an API change that leads to a “by design” difference in operation. Either way I’d just like to know what’s going on.

Take a look at the code base for Custom Edit Field (CEF)…(its mentioned many times on this forum)… In there is a method used to “guess” a files encoding, including the detection of (and removal) of a BOM if one should exist

and BOM can be encoded in a few different ways…

  • 0xFFFE0000 - UTF32 (little)
  • 0xFEFF - UTF32 (big)
  • 0xFFFExxxxx - UTF16 (little)
  • 0xEFBBBF - UTF8

How are you writing the file? TextOutputStream? BinaryOutputStream? Writeline? Write?

Can you reproduce the issue in a small project?

Where are you getting the data from in the first place? Perhaps it’s something else - socket, db, etc - that is introducing the stray byte.

[quote=353572:@Tim Hare]How are you writing the file? TextOutputStream? BinaryOutputStream? Writeline? Write?

Can you reproduce the issue in a small project?

Where are you getting the data from in the first place? Perhaps it’s something else - socket, db, etc - that is introducing the stray byte.[/quote]

It’s a TextOutputStream using WriteLine, and thank you Tim Hare; I think you just gave me the clue I needed to solve this problem…

stream.WriteLine EncodeBase64(myTextField.text)

How much do you want to bet that the byte is getting entered into the TextField? Probably by using copy/paste from something like NotePad.

I can’t test this now, but if that’s what’s happening, then this is a real forehead-slapper.

In my defense, the reason I didn’t “sanitize” the TextField contents is that it gets approved by a server query. Looks like the server ignores chr(0), but the file doesn’t.

I can’t prove it yet, but I think this mystery is solved.

Thanks, everyone.

Try running your app in 2017r1.1. It’s possible that the String refactor in 2017r2 changed or broke something that no one else has come across yet.

If so, please file a bug report.