Extracting data from a web page?

I need to pull a number of pieces of information that may or may not be present, some of from tables , from a web page.

This from a free publicly accessible chemical database that does not have have a way of getting data back in any way except the HTML to display the page… and I need to pull out information for a few thousand compounds … so it needs to be done in code … not manually using a browser and copy and paste…

But the string I get for the HTML for the page is such mess that there is no obvious way to pull out what I need… It would be a different story if the the text was line by line in the same order as on the displayed webpage …

Does anyone have a suggestion of how I might approach this?

Is there a free command line tool I can call from an Xojo app to help with this?


Perhaps try emailing them to see if there is some kind of API (they use something)… Otherwise, it’s just striping the bits you need out of the HTML…

If there is no API you may have to use Tidy from the MBS plugin for parsing the Html.

I do that for an image from a web page.

I checked what is around that image (in html code),then I fine tune the search (InStr, not even RegEx) and compute the image URL.
Then I download it.

I have errors - sometimes - but I feel the error comesfrom the web server since the image url is correct (I put it in the Clipboard for debug first, then I keep it there in case of error, so I can get it manually).

It is doable only if you can make a blind search to that image html code (in fact your search data) and the data location is always at the same location.

If the data are in an html <table, you only have to search for a table, then fine tune the search with criteria.

For the “unreadable” (or so) html data:

Get that html (cmd-u or ctrl+u) in Firefox *, then Copy / Paste in a text editor and add a Return after the <br> and </p> tags. And things can start to be readable.

Do that for some pages to get a schema ?

Or if this is a public web page, share its URL (and where in the html the data are) ?
(so some people can share clues, eventually)

A link or sample of the HTML might be helpful.

Well, HTML should be valid XML, so you can easily use the builtin XML classes.

assuming it is well formed HTML, which we all know is rarely the case

and we know that xojo xml classes are not fun if there are errors in the source xml …

Yes, a link or sample would be helpful. If it’s sufficiently representative, I might be able to come up with a regular expression for you that will work in a pinch.

RegEx is the lazy persons way out :stuck_out_tongue:

not when its the correct tool for the job it isn’t

Thanks all…

It was rather late last night and I forgot to DefineEncoding on the returned text, so the EOLs were not breaking lines and it all ran together looking like a unintelligible mess… I kind of freaked out when I saw that!

I think I can deal with it now… If not, I’ll be back asking more specific questions.

  • karen