Getting the contents of HTMLViewer

As part of my (seemingly never ending) quest to create a cross-platform Markdown parser for Xojo (iOS module done - GitHub) I need to access the contents of a HTMLViewer in desktop projects. Is this possible? Seems like it should be but I can’t see a property or method in the docs to let me do this. I don’t want to use a plugin. I don’t mind declares but it must work on macOS, Windows and Linux.

Why am I trying to get the contents of an HTMLViewer you ask? Well, I am trying to use Javascript on a HTMLViewer to parse Markdown. I can do this (I have it working excellently on iOS) but I need access to the source code of the HTMLViewer after the javascript has run.

Any help is appreciated.

Thanks,

do you want to use MBS Plugin for desktop.
First you will notice that you have to have different codes for

  • Windows with IE
  • Windows with Chrome
  • Linux with WebKit
  • Mac with WebKit

The current method of getting data out of a HTML Viewer is to push the data through window.status so you can capture it in the StatusChanged event. If MBS is an option I would really recommend it, I went through a lot of trouble to make sure HTML Edit can get the data out on both Mac and Windows reliably (more an issue on Windows, really.)

According to StackOverflow you should be able to get the source out quite easily. The only thing you won’t get is the doctype apparently.

window.status = document.documentElement.innerHTML; and then capture the result in StatusChanged.

However if you’re using Javascript to parse markdown you’re just going to get HTML elements back.
You may want to try the headless browser approach as well, you can get results back from it through shell without hacking HTML viewer.

I’m still really surprised that Markdown is hard to parse, it seems so simple™
(I don’t disbelieve you, I’m just legitimately surprised.)

@Christian Schmitz As I said in the question, I don’t want to use plugins. I’m trying to do this for two reasons:

1). It should be doable in Xojo
2). I want to open source the module and I can’t do that idealistically if the code requires a proprietary license.

I have no issues with your plugins Christian - mostly they are excellent. In fact, I own a full license to them already. I’m also aware that you have a Markdown plugin but that has a couple of issues:

1). I don’t know what renderer you’ve ported but it doesn’t correctly parse John Gruber’s seminal syntax text (link) and it also doesn’t support various newer elements of Markdown such as code fences, tables or checklists
2). See point (2) above.

@Tim Parnell The hack to get the source code via the StatusChanged event seems too convoluted. Not sure the approach would work as the value I’m interested in (the source code) would be available asynchronously in the StatusChanged() event making it difficult for me to utilise. Essentially I want to be able to grab the contents synchronously like:

theContents = myHTMLViewer.contents

Is this doable with declares? It seems bananas that it’s not possible.

Regarding Markdown parsing - I’m really starting to get fed up of it! Here’s a brief summary of what I’ve been doing and why it’s hard:

1). I need a truly cross-platform (desktop and iOS) way to parse Markdown. This rules out plugins because they don’t work on iOS
2). Even if iOS wasn’t an issue (technically it’s not anymore because I’ve published an open source working fast Markdown parser for it that’s essentially a wrapper for remarkable.js) I can’t use the MBS plugins because of the reasons cited above as well as the fact that they don’t offer feature parity with the iOS implementation.
3). I tried (and have had moderate success with) writing my own parser in native Xojo. It made heavy use of RegEx. Trouble is it’s slow, incomplete and I’ve reached an impasse.
4). I’ve written a wrapper for Pandoc which is a great cross-platform (desktop) command line utility that parses Markdown. The problem with this is that it’s not as fast as javascript on iOS (100 ms vs 10 ms) and it requires bundling with the module the Pandoc binary which is > 75Mb in size
5). I’m trying not to re-invent the wheel. There are loads of robust 3rd party libraries out there (particularly in javascript) that handle every edge case. Sadly, the only one currently “ported” to Xojo is the MBS one which isn’t great.

You mentioned the HTMLEdit control. How does that get the contents of an HTML viewer?

I’ve read this topic a few times, and am confused as to the intent.
If you wish to get the “HTML” behind a webpage to perform some processing on… why not bypass the HTMLVIEWER all together, and just use a socket to download the page into a string variable directly?

Or am I missing something else here?

I also got one markdown engine for a client running in HTMLViewer already.
No problem, but uses plugin functions to put source text in a form field, call javascript to convert and grab result from another form field. We use text areas in the form to transfer multi line text.

[quote=307203:@Garry Pettet]@@Tim Parnell The hack to get the source code via the StatusChanged event seems too convoluted. Not sure the approach would work as the value I’m interested in (the source code) would be available asynchronously in the StatusChanged() event making it difficult for me to utilise. Essentially I want to be able to grab the contents synchronously like:

theContents = myHTMLViewer.contents

Is this doable with declares? It seems bananas that it’s not possible.[/quote]
I have just a little experience with HTML Viewer hacks There are a few feature requests in Feedback for things like access to the page source, Javascript return values, and updates to the CEF, but we’re not holding our collective breath.

It is possible with declares to get the result of a Javascript, and the Mac declare is simple. @shao sean was helping me with the Windows declare, but we were both stuck on it. Should a way to get the return value of Javascript be available, you’d be able to grab the source of the page in a method similar to the one that uses StatusChanged. Personally however, I’m much less interested in a HTMLViewer.Contents property.

Hours of fiddling around with StatusChanged and dealing with the shortcomings intricacies of Windows.

As for RegEx to parse, I don’t understand why it would be slow. I don’t use Markdown regularly, so I frequently check this reference which makes it seem like there isn’t really too much to it. This is where my confusion comes from. I haven’t researched it or anything, just my quick glance feels like it shouldn’t be hard.

It would seem that the most efficient way for a cross platform solution is to do just the same as what you do in iOS : use HTMLViewer.

On macOS, it is possible through declares to do exactly the same as the iOS declare.

See Shao Sean contribution here https://forum.xojo.com/14481-should-changing-an-htmlviewer-s-window-status-work-in-windows/0

On Windows, you must use StatusChanged, as described by Tim, or TitleChanged. As far as I know there is no equivalent to evaluateJavaScript available for Windows HTMLViewer. Frankly, it is not as convoluted as you think.

He’s using a local copy of the Javascript, and it’s not a static page.

I advise using StatusChanged over TitleChanged

OK. I think I’ve got my head around how to use JS to push the contents of a div into window.status and then retrieve it. Trouble is, I can’t get it to work.

Here’s a link to a completely stripped down example project - link.

The project has two modules: (1) Shakespeare (the module for using JS to parse Markdown) and (2) MarkdownKit (my Xojo native attempt at a Markdown parser). Running the app will let you try to parse with either Shakespeare or MarkDownKit.

Basically, Shakespeare creates an instance of my HTMLViewer subclass (BetterHTMLViewer) and loads a string constant containing a bare bones HTML document, the embedded remarkable.js script and a custom JS function to call remarkable.js. The constant is stored in Shakespeare.MARKDOWN_CONVERTER_HTML.

You’ll notice that no answer is given when using Shakespeare. In fact, the StatusChanged event never seems to fire.

What am I doing wrong here?

Tested adding actual HTML elements to the Shakespeare constant.
It’s not loading up, so there’s a bug in your code somewhere.

Still looking, just thought you might like a status update.

@Tim Parnell Thanks a lot for taking a look. Much appreciated. Dumping the contents of the constant into a .html file and visiting it in a browser seems to work so I can’t see the error. Then again, I’ve been staring at this project too long!

Oh. I see.
You can’t use HTML Viewer to render anything without actually displaying it.

I had done testing with that when people wanted to print HTML Edit contents without MBS, we ended up gathering the contents, opening a new HTML Viewer and printing it.

Aaaargh!!

That’s different than iOS it would seem. The same trick works on iOS by only instantiating iOSHTMLViewer in code. I hadn’t realised that the desktop HTMLViewer must be instantiated in the IDE.

Hmmm. I feel like I’m SOL with this approach.

@Christian Schmitz Fancy wrapping Pandoc in a plugin? The Markdown engine is way better than the one you currently have (no offence) plus as a bonus, it offers a bunch of other conversion formats…

@Christian Schmitz Failing that, perhaps add Win/Linux support for your Javascript plugin?

I just looked at the iOS project.

If you use the macOS Method I linked to above, you can do exactly the same.

In Windows, I think all you need to do is to replace the JavaScript command that returns the value.

return xxx;

by

window.status = xxx;

I could not really find the proper return in the huge ball of spaghetti that is the JavaScript with no line break, but that is all that is needed.

I could even be conditional based on platform.

I could make an example for using JavaScript engine.

Pandoc is GPL, so I can’t use it in a plugin.
(and you not in a commercial app!)