I need to pull a number of pieces of information that may or may not be present, some of from tables , from a web page.
This from a free publicly accessible chemical database that does not have have a way of getting data back in any way except the HTML to display the page… and I need to pull out information for a few thousand compounds … so it needs to be done in code … not manually using a browser and copy and paste…
But the string I get for the HTML for the page is such mess that there is no obvious way to pull out what I need… It would be a different story if the the text was line by line in the same order as on the displayed webpage …
Does anyone have a suggestion of how I might approach this?
Is there a free command line tool I can call from an Xojo app to help with this?
I checked what is around that image (in html code),then I fine tune the search (InStr, not even RegEx) and compute the image URL.
Then I download it.
I have errors - sometimes - but I feel the error comesfrom the web server since the image url is correct (I put it in the Clipboard for debug first, then I keep it there in case of error, so I can get it manually).
It is doable only if you can make a blind search to that image html code (in fact your search data) and the data location is always at the same location.
If the data are in an html <table, you only have to search for a table, then fine tune the search with criteria.
For the unreadable (or so) html data:
Get that html (cmd-u or ctrl+u) in Firefox *, then Copy / Paste in a text editor and add a Return after the <br> and </p> tags. And things can start to be readable.
Do that for some pages to get a schema ?
Or if this is a public web page, share its URL (and where in the html the data are) ?
(so some people can share clues, eventually)
Yes, a link or sample would be helpful. If it’s sufficiently representative, I might be able to come up with a regular expression for you that will work in a pinch.
It was rather late last night and I forgot to DefineEncoding on the returned text, so the EOLs were not breaking lines and it all ran together looking like a unintelligible mess… I kind of freaked out when I saw that!
I think I can deal with it now… If not, I’ll be back asking more specific questions.