Some additional information:
The word “This” appears in the original UTF-16 encoded HTML file as “7KLV” or the hex characters 37 4b 4c 56. Note that the rest of the file, as viewed in BBEdit, renders as normal HTML (again, BBEdit recognizes the file as UTF16, though). Just the display strings within the file show in this encoding. Somehow, the browser knows how to convert this back to “This” for display.
Sorry, yes I did that. I’m actually opening the HTML file as a binarystream and then testing the data against a few encodings; this is working to identify the encoding (UTF-16BE in this case).
I’ve made some progress; the above “7KLV” is the UTF8 encoding of the data from the HTML file. Using ConvertEncoding, I get back to the proper UTF-16BE encoding in the string (so, can’t just DefineEncoding because the data has already been re-encoded by the browser, I guess). So, now I have a UTF16-BE string with content that exactly matches the original HTML file (verified with a hex editor). The question now is, how do I recover the plaintext? There’s some other layer of encoding here, and I’m not sure what it is, but the browser can figure it out and render the text properly.
HTML file: _7_K_L_V (where underscore is a 00 in the UTF-16BE encoding)
Copy from clipboard: 7KLV
Convert to UTF-16BE: _7_K_L_V
Yes. The problem is the browser. It is passing me back the underlying data (and as UTF8) rather than the actual decoded text. So, when Xojo receives the text, it has the proper encoding for the string – it is a UTF8 string, but it is a UTF8 encoding of the original UTF16BE string, so I need to do a convertencoding to get back to that (simply defining the encoding in this case is wrong).
The problem is, there is still another layer of encoding. The original HTML file contains “7KLV” to represent the word “This” for some reason. I have no idea what this encoding is. It’s like Base64 or something (but not). However, the browser seems to know what it is because it has no trouble displaying “This”. I just want to do whatever it is doing to get from “7KLV” --> “This”.
I have an HTML file encoded in UTF-16BE (I don’t know why… it just is). It contains a text node with the content “_7_K_L_V” (hex: 0037 004b 004c 0056). The browser loads and properly renders this is the word “This”
I am getting the user’s selection of the word “This” that they see on the screen in two ways, but both result in the same thing:
b. A simple copy/paste via the clipboard
Both return a UTF-8 version of the underlying string: “7KLV” (hex: 374b 4c56)
Starting with this string in Xojo, I can get back to UTF-16BE with ConvertEncoding (not sure if this will be necessary or important to the final solution, but I wanted to make sure I could recover exactly what was in the HTML file.
What I need to do now is take this string “7KLV” and convert it to “This” – I just have no idea what the encoding is. However the browser clearly does and is doing it – does anyone know what that is?
I did the BBEdit save-as-UTF8 experiment. It still opens fine in Safari, but the text in BBEdit is still encoded. Interestingly, I think BBEdit was smart for me and actually changed the meta charset by itself to “utf-8”… I didn’t edit it, but it was changed when I saved.
Yes, the plain text between the tags is this gibberish. In this case , “7KLV” comes right after a <div> tag.
Your observation about adding dec 29 is intriguing. I’m wondering how it would work for the unprintable characters. Adding the next three characters in the string, “This in”, we have:
Well, ConvertEncoding successfully takes the UTF-8 version here back to the UTF-16 original. So at least in whatever is going on through the browser and copy/paste, no data is being lost, which was a concern since it would make the original unrecoverable.