Encoding issue with text from HTML

TimStreater · January 30, 2021, 5:58pm

Meanwhile, the javascript. Can you make any sense of it?

TimStreater · January 30, 2021, 6:00pm

(The e003 is the Unicode code-point equivalent of the UTF8 ee 80 83).

Matthew_Dinmore1 · January 30, 2021, 6:02pm

It’s minimized. I looked to see if there was anything obvious being applied. I also looked in the source for pdf2htmlEX, but I actually think where all of this happens may be in one of his external dependencies, and haven’t traced it through.

Here’s the thing: the “7KLV” is the content of the text node in the DOM at the time of the selection. So, if there were a javascript that were transforming the text, I’d expect to see “This” in the node after the transformation. It has to be the browser understanding what that encoding is and then properly rendering the text.

TimStreater · January 30, 2021, 6:04pm

Hmm. I’m out of ideas at this point.

Matthew_Dinmore1 · January 30, 2021, 6:07pm

Thanks. I’ll play with the “add 29” idea and see what happens. Maybe “good enough.” There’s got to be something in the HTML encoding standard about this, if the browser knows what to do. Will try pursuing that, as well.

Matthew_Dinmore1 · January 30, 2021, 7:48pm

Well, not sure why, but the “add 29” solution largely works. Convert to UTF-16, then add 29 to each Second byte (ignore the first), and concatenate into a new string. Some issues with quotes which I’m going to assume are typical smart quote problems.
I have no idea why this works, but thanks for that insight!

Matthew_Dinmore1 · February 1, 2021, 1:37pm

Mystery solved (for posterity): the HTML file uses an embedded web font that has an alternate character mapping for some reason. Has nothing to do with the fact that it was also UTF-16 encoded. Fun times.