Encoding issue with text from HTML

Meanwhile, the javascript. Can you make any sense of it?

(The e003 is the Unicode code-point equivalent of the UTF8 ee 80 83).

It’s minimized. I looked to see if there was anything obvious being applied. I also looked in the source for pdf2htmlEX, but I actually think where all of this happens may be in one of his external dependencies, and haven’t traced it through.

Here’s the thing: the “7KLV” is the content of the text node in the DOM at the time of the selection. So, if there were a javascript that were transforming the text, I’d expect to see “This” in the node after the transformation. It has to be the browser understanding what that encoding is and then properly rendering the text.

Hmm. I’m out of ideas at this point.

Thanks. I’ll play with the “add 29” idea and see what happens. Maybe “good enough.” There’s got to be something in the HTML encoding standard about this, if the browser knows what to do. Will try pursuing that, as well.

Well, not sure why, but the “add 29” solution largely works. Convert to UTF-16, then add 29 to each Second byte (ignore the first), and concatenate into a new string. Some issues with quotes which I’m going to assume are typical smart quote problems.
I have no idea why this works, but thanks for that insight!

Mystery solved (for posterity): the HTML file uses an embedded web font that has an alternate character mapping for some reason. Has nothing to do with the fact that it was also UTF-16 encoded. Fun times.

This post was flagged by the community and is temporarily hidden.