Extracting data from a web page?

KarenA · December 7, 2017, 4:46am

I need to pull a number of pieces of information that may or may not be present, some of from tables , from a web page.

This from a free publicly accessible chemical database that does not have have a way of getting data back in any way except the HTML to display the page… and I need to pull out information for a few thousand compounds … so it needs to be done in code … not manually using a browser and copy and paste…

But the string I get for the HTML for the page is such mess that there is no obvious way to pull out what I need… It would be a different story if the the text was line by line in the same order as on the displayed webpage …

Does anyone have a suggestion of how I might approach this?

Is there a free command line tool I can call from an Xojo app to help with this?

Thanks,
Karen

shao_sean · December 7, 2017, 5:53am

Perhaps try emailing them to see if there is some kind of API (they use something)… Otherwise, it’s just striping the bits you need out of the HTML…

Beatrix_Willius · December 7, 2017, 7:07am

If there is no API you may have to use Tidy from the MBS plugin for parsing the Html.

Emile_Schwarz · December 7, 2017, 8:11am

I do that for an image from a web page.

I checked what is around that image (in html code),then I fine tune the search (InStr, not even RegEx) and compute the image URL.
Then I download it.

I have errors - sometimes - but I feel the error comesfrom the web server since the image url is correct (I put it in the Clipboard for debug first, then I keep it there in case of error, so I can get it manually).

It is doable only if you can make a blind search to that image html code (in fact your search data) and the data location is always at the same location.

If the data are in an html <table, you only have to search for a table, then fine tune the search with criteria.

For the unreadable (or so) html data:

Get that html (cmd-u or ctrl+u) in Firefox *, then Copy / Paste in a text editor and add a Return after the <br> and </p> tags. And things can start to be readable.

Do that for some pages to get a schema ?

Or if this is a public web page, share its URL (and where in the html the data are) ?
(so some people can share clues, eventually)

Markus_Winter · December 7, 2017, 8:24am

A link or sample of the HTML might be helpful.

Mathias_Maes1 · December 7, 2017, 2:41pm

Well, HTML should be valid XML, so you can easily use the builtin XML classes.

DaveS · December 7, 2017, 2:50pm

assuming it is well formed HTML, which we all know is rarely the case

Jean-Yves_Pochez · December 7, 2017, 2:59pm

and we know that xojo xml classes are not fun if there are errors in the source xml …

Kem_Tekinay · December 7, 2017, 3:01pm

Yes, a link or sample would be helpful. If it’s sufficiently representative, I might be able to come up with a regular expression for you that will work in a pinch.

shao_sean · December 7, 2017, 11:33pm

RegEx is the lazy persons way out

DaveS · December 7, 2017, 11:35pm

not when its the correct tool for the job it isn’t

KarenA · December 7, 2017, 11:39pm

Thanks all…

It was rather late last night and I forgot to DefineEncoding on the returned text, so the EOLs were not breaking lines and it all ran together looking like a unintelligible mess… I kind of freaked out when I saw that!

I think I can deal with it now… If not, I’ll be back asking more specific questions.

karen