Encoding issue with text from HTML

Matthew_Dinmore1 · January 30, 2021, 2:10pm

I have an HTML file I’m loading into an HTMLViewer, and I have javascript I execute against it to retrieve the user’s selection so I can copy it to a text field. One file I’ve opened is encoded in UTF16BE. I get the selected data back from the javascript as it appears in the HTML file (at least, when viewed in BBEdit), but I don’t know how to convert it to plain text. (One interesting thing here is that the HTML file is in UTF16BE, and has the charset meta set to utf16-be, but the actual text elements appear as gibberish in the file itself. However, it renders properly in the browser. I’m not sure I understand what’s happening there).

What I’ve tried so far:

on the returned string, .DefineEncoding(UTF16)
ConvertEncoding(returnedString,UTF16)
ConvertEncoding(returnedString.DefineEncoding(UTF16),UTF8)
ConvertEncoding(ConvertEncoding(returnedString.DefineEncoding(nil),UTF16),UTF8)

These all basically turn it into Chinese (the last one actually gives me back the original (UTF16) string, more or less).

Any thoughts on how to recover the text (in UTF8)?

Matthew_Dinmore1 · January 30, 2021, 3:17pm

Some additional information:
The word “This” appears in the original UTF-16 encoded HTML file as “7KLV” or the hex characters 37 4b 4c 56. Note that the rest of the file, as viewed in BBEdit, renders as normal HTML (again, BBEdit recognizes the file as UTF16, though). Just the display strings within the file show in this encoding. Somehow, the browser knows how to convert this back to “This” for display.

The javascript returns “7KLV” as well. My question is, what do I have to do with the encodings to get it back to “This”?

Mike_D · January 30, 2021, 4:03pm

Why not use UTF16BE? Xojo has this specific encoding available: See https://documentation.xojo.com/api/text/encoding_text/encoding.htmls

Matthew_Dinmore1 · January 30, 2021, 4:13pm

Sorry, yes I did that. I’m actually opening the HTML file as a binarystream and then testing the data against a few encodings; this is working to identify the encoding (UTF-16BE in this case).

I’ve made some progress; the above “7KLV” is the UTF8 encoding of the data from the HTML file. Using ConvertEncoding, I get back to the proper UTF-16BE encoding in the string (so, can’t just DefineEncoding because the data has already been re-encoded by the browser, I guess). So, now I have a UTF16-BE string with content that exactly matches the original HTML file (verified with a hex editor). The question now is, how do I recover the plaintext? There’s some other layer of encoding here, and I’m not sure what it is, but the browser can figure it out and render the text properly.

TimStreater · January 30, 2021, 4:22pm

What do you mean by the plaintext, here? The text that the browser displays? Or the html file in UTf8?

Matthew_Dinmore1 · January 30, 2021, 4:28pm

Yes, the text that the browser displays:

HTML file: _7_K_L_V (where underscore is a 00 in the UTF-16BE encoding)
Browser: This
Copy from clipboard: 7KLV
Convert to UTF-16BE: _7_K_L_V

What’s happening is the browser displays “This” and the user is selecting the word “This” to copy. The string I receive (through both Javascript to get the selected text in the HTMLViewer or normal copy via the clipboard) is “7KLV” which is a UTF-8 encoding. I can convert that back to the original UTF-16BE encoding ("_7_K_L_V"), but I still need to get back to “This” somehow. I’m not clear on what the last layer of encoding is.

TimStreater · January 30, 2021, 4:46pm

Why do you say that 7KLV is specifically UTF8? Upthread you listed the bytes as being 37 4b 4c 56. These are ASCII values (which also happen to be the same in UTF8).

I would have expected you to get (for “This”): 54 68 69 73.

Just add 29 decimal to each byte

If the string “This” is visible to be selected then that’s what it willbe like in the html file. Can you post the html file as seen in BBEdit? (or some of it)

Beatrix_Willius · January 30, 2021, 4:49pm

Do you already have an encoding defined for your html? The order for fixing encodings normally is
string = DefineEncoding(String, Encodings.UTF16xx)
string = ConvertEncoding(String, Encodings.UTF8)

Matthew_Dinmore1 · January 30, 2021, 5:15pm

Yes. The problem is the browser. It is passing me back the underlying data (and as UTF8) rather than the actual decoded text. So, when Xojo receives the text, it has the proper encoding for the string – it is a UTF8 string, but it is a UTF8 encoding of the original UTF16BE string, so I need to do a convertencoding to get back to that (simply defining the encoding in this case is wrong).

The problem is, there is still another layer of encoding. The original HTML file contains “7KLV” to represent the word “This” for some reason. I have no idea what this encoding is. It’s like Base64 or something (but not). However, the browser seems to know what it is because it has no trouble displaying “This”. I just want to do whatever it is doing to get from “7KLV” --> “This”.

Mike_D · January 30, 2021, 5:18pm

Weird. Can you give us a small zipped, sample HTML file we can take a look at?

TimStreater · January 30, 2021, 5:19pm

If you’re loading the html file into the Viewer, why don’t you have a UTF8 version to load in?

By the way I wasn’t kidding about the 29 decimal. Each byte is off by that amount.

Edit: Hmm, it’s not some odd ROT-13 business is it?

Matthew_Dinmore1 · January 30, 2021, 5:22pm

What’s happening is this:

I have an HTML file encoded in UTF-16BE (I don’t know why… it just is). It contains a text node with the content “_7_K_L_V” (hex: 0037 004b 004c 0056). The browser loads and properly renders this is the word “This”
I am getting the user’s selection of the word “This” that they see on the screen in two ways, but both result in the same thing:
a. I have a javascript that runs in the HTMLViewer that uses the selection API to recover the text selection
b. A simple copy/paste via the clipboard
Both return a UTF-8 version of the underlying string: “7KLV” (hex: 374b 4c56)
Starting with this string in Xojo, I can get back to UTF-16BE with ConvertEncoding (not sure if this will be necessary or important to the final solution, but I wanted to make sure I could recover exactly what was in the HTML file.

What I need to do now is take this string “7KLV” and convert it to “This” – I just have no idea what the encoding is. However the browser clearly does and is doing it – does anyone know what that is?

TimStreater · January 30, 2021, 5:26pm

OK so one word, supposed to be “This” is “7KLV” in the original file. What about all the rest of the original file? Does that look like garbage too?

By the way I’m sure that if BBEdit can read the html file properly and show its contents (and show the encoding at the bottom of the screen as UTF16BE), then you can tell it to save a copy as UTF8.

Matthew_Dinmore1 · January 30, 2021, 5:27pm

Unfortunately, I can’t – the original HTML file is private. What I can say about it is that the entire file is encoded in UTF-16BE, the header meta charset = utf-16be, and most of the HTML markup looks just fine in BBEdit. However, the text nodes are all in this strange encoding, even in the HTML file. The browser seems to know what to do with it, though. I’m just trying to figure out what it is doing so I can replicate it to convert “7KLV” --> “This”. The added wrinkle that had me stuck for awhile is that copy/paste (and javascript) were giving me a UTF-8 version, which is the same (other than being 4 bytes vs 8) for “7KLV”, but there are other characters after that are unprintable. However, ConvertEncoding(UTF-16BE) gets me back to exactly what is in the raw HTML file on disk that the browser is reading, so if I could figure out what it is doing to get to “This”, I could replicate it.

Matthew_Dinmore1 · January 30, 2021, 5:28pm

The rest of the original file is “plaintext” – reads like normal HTML in BBEdit. It’s just the text nodes in it are in the strange encoding.

TimStreater · January 30, 2021, 5:30pm

What do you mean by text nodes? Things like <p></p> ? So you have (say) <p>7KLV</p> (or other html element) somewhere within the html file?

Does the html file contain any javascript that might be messing with DOM nodes after the file is loaded into the viewer?

Matthew_Dinmore1 · January 30, 2021, 5:33pm

I did the BBEdit save-as-UTF8 experiment. It still opens fine in Safari, but the text in BBEdit is still encoded. Interestingly, I think BBEdit was smart for me and actually changed the meta charset by itself to “utf-8”… I didn’t edit it, but it was changed when I saved.

Matthew_Dinmore1 · January 30, 2021, 5:43pm

Yes, the plain text between the tags is this gibberish. In this case , “7KLV” comes right after a <div> tag.

There’s a lot of javascript in there, so that’s a good question. This is the first UTF-16 file I’ve run into. Normally, they are UTF-8, and the text is plain. They are generated by a PDF-to-HTML conversion process (pdf2htmlEX); I do not know why this one came out in UTF16 – probably something with the source PDF file. Unfortunately, I also can’t change that.

Your observation about adding dec 29 is intriguing. I’m wondering how it would work for the unprintable characters. Adding the next three characters in the string, “This in”, we have:

UTF-16 source (hex): 0037 004b 004c 0056 e003 004c 0051
UTF-8 (from copy): 374b 4c56 ee80 834c 51

Again, ConvertEncoding will get me from the UTF-8 back to the UTF-16 just fine. I just need to be able to decode one of them back to plaintext.

TimStreater · January 30, 2021, 5:52pm

Yes, the “in” part matches the 29 business, but e003 is odd. Add 29 to that gets you e032, and 32 is a space. I’m just looking through the UTF8 table (not sure that e003 or e032 are valid UTF8).

… or ee80 either, but I’ll check.

EDIT: actually it’s ee 80 83 which is a valid UTF8 character, but is one in the “private use” area.

EDIT: OK belay my comment about the e003, I was looking in the wrong place.

Matthew_Dinmore1 · January 30, 2021, 5:57pm

Well, ConvertEncoding successfully takes the UTF-8 version here back to the UTF-16 original. So at least in whatever is going on through the browser and copy/paste, no data is being lost, which was a concern since it would make the original unrecoverable.