Encodings... not sure what is happening. Any ideas?

Encodings issue I’m trying to track down. I’d say upfront, don’t ask exactly why I’m doing what I’m doing here; I’m more interested in understanding what is going on so I can get things consistent. I’ve tried to be clear about how I think things are working in terms of characters vs bytes.

Here’s what happens:

  1. Read a PNG file in a binary stream into a string; encoding is nil. First bytes in hex are 89 50 4e 47
  2. Convertencoding to UTF-8; string binary content stays the same but encoding changes to utf-8
  3. Add double-quotes on both ends as follows: “”""+png_string+""""; first bytes are now 22 89 50 4e 47
    – Having converted to UTF-8, my understanding is that this statement concatenates a double-quote character in the front and back of the string
  4. I now test the first character of the string to see if it is a double-quote:
    – if I test png_string.left(1)="""", it fails; why? I’m testing the first character which I previous concatenated with the same syntax
    – if I test png_String.left(1).asc = &h22, it succeeds. However, I don’t see the difference since asc examines the first character, according to the documentation.
  5. I take png_string and execute png_string.mid(2) to remove the double-quote. However, it removes the first two bytes, so I have hex bytes: 50 4e 47…
  6. Later, I assign png_string (with the double-quote) to a textarea’s .value. If I look at the textarea.value in the debugger immediately after this assignment, it contains bytes 22 EF BF BD 50 4e… it has replaced 89 with EF BF BD. The encoding on png_string is still UTF-8 and on the textarea.value is UTF-8. What happened here?

More importantly, is there any way at this point recover the original data? It would be OK if there were a ConvertEncoding I could use to get it back, but I don’t know what transform it did because it says UTF-8.

Thanks -

Don’t convert the encoding. That will alter the contents of your data. Just leave it nil-encoded. Or, if you know that you have valid UTF-8 data, use DefineEncoding. That will preserve the data and just set the encoding. But I don’t believe hex 89 is valid for UTF-8.

I originally wasn’t converting encodings. But actually, it made no difference. I assumed that, if it was nil encoded, convert encoding would render it as something valid in UTF-8, while DefineEncoding would assert that it is UTF-8 (Which would be wrong, because as you point out, 89 probably isn’t valid). But, convert did absolutely nothing other than change the indicated encoding on the string to UTF-8.
I would happily leave it as nil, except that all of the other issues still happen, especially the unexplained conversion happening on assignment to the textarea.value.

TextArea automatically converts to UTF-8, which makes it a poor choice for data storage. You’re not guaranteed to get out of it exactly what you put in. What are you using the textarea for? It appears you’re putting binary data into it.

That was the “don’t ask what I’m doing part” :wink:

But, in short, it may be the case that a user opens a binary file through this process. I would like to have a single, consistent process for capturing whatever data is selected. But, yes, I’m pretty sure this isn’t how I’m going to proceed. I was really more interested in figuring out what was going on with encodings throughout this scenario.

I assumed it might be, but if that is the case, why didn’t ConvertEncoding originally make these changes to the data?
The follow-up question is, if this is a conversion to UTF-8, can I recover the binary string? I tried WIndowsANSI, which does change the data, but not correctly (the 89 that became EF BF BD turns into something else – one byte, but incorrect).

Store your data into a memory block BEFORE you try to manipulate it for display. This will allow you to try multiple conversions without the need to reload the physical data.

Also, what if you create a multi-line label with selectable set and place the data into that for display?

ConvertEncoding uses a different mechanism than the control does. The control does its own thing based on the OS.

Hmmm… that either means there’s more than one right way to do the same thing that gives different results, or… I think I’ll just leave that alone :wink: