Screen scraping

John_Marshall · October 6, 2018, 11:07am

Has anyone worked with XOJO and screen scraping? If so, was the functionality 100% XOJO, or did you use another piece of software and run an interface of sorts. I am looking to grab data from a website and bring back to a XOJO database.

Thanks
John

Emile_Schwarz · October 6, 2018, 11:36am

You mean extract image URL from a page and storing them in a Data Base ?
You mean extract text data from a page and storing them in a Data Base ?

Both: online or offline (from an html file) ?

All are feasible, 100% Xojo.

John_Marshall · October 6, 2018, 2:32pm

Text extraction. Online from a browser window… an example would be sports scores from a sports webpage.

John_Marshall · October 6, 2018, 5:51pm

Anyone that can point me in a general direction or concept that nudge would be appreciated.

Greg_O_Lone · October 6, 2018, 6:04pm

What youre asking is a complicated subject and its the weekend so youll get fewer responses. It depends largely on the site/page and how the data is delivered. If theyre just straight html pages ( with an htm or html extension) then youd use an HTTPSocket.

If the page requires JavaScript, youre better off to remotely control a browser.

That said, often times the data is copyrighted or requires a license for its use, so you should find out if there is any legal repercussions for doing this. Scraping is generally frowned upon as you are basically stealing someone elses info and bandwidth ( and yes, many providers still charge for outgoing bandwidth). You should also look and see if the site has an API that you can call into, as that would be the simplest (and most legal) way to do this.

John_Marshall · October 6, 2018, 6:37pm

Thanks Greg. I will do some investigation before I proceed.

Emile_Schwarz · October 6, 2018, 7:07pm

John,

beside what Greg wrote, you have to:
a. Download the html page (as text, not composite)
b. Use RegEx (for example) to reach the text you are seeking *
But, to reach the text you are seeking, you have to have a way to determine the text you want to seek !

In other words, if you are seeking the text after <name="<a-tag-name">Text you are seeking</a>.

So, after checking the legal status, the presence (or absence) of API to get the data, download the html text and observe it, find tag(s) to allow you to locate the place where your data in the html text.

I hope this is clear.

I used a simple InStr() set of commands to achive my needs.

John_Marshall · October 6, 2018, 7:44pm

Thanks Emile. Step B is crystal clear to me…
I have some background with Perl and regex. It is step A That I need to muster through. I assume we are talking about capturing the target page as a string, then whittling through it to reach the text I desire.

Legal status is marked as copyright, and there in no API. I pay for access to the site, and this information is for my own use. I dont plan on making this activity operate on the level of spam. Just looking for a way to periodically refresh the data via script, rather than the manual copy and paste of the table that I am doing now.

Off to play some more…

Douglas_Handy · October 7, 2018, 3:44am

Depending on the website, what you may find is the actual raw page source may not easily have all the scores or other things you are looking for if some HTML tables are getting populated via javascript. But using MBS, you can access the HTML of the page after that is done (as opposed to the original HTML loaded, which can be different).

For example, you can create a window with a HTMLViewer control and use LoadURL to load the webpage and let it process the javascript, then depending on your OS use different MBS extensions to the viewer to extract the net HTML source. Perhaps something like:

dim html as string

#if TargetMacOS
  dim w as WebViewMBS = HTMLViewer1.WebViewMBS
  html = w.HTMLText()
#endif

#if TargetWindows
  html = HTMLViewer1.IEHTMLTextMBS()
#endif

Then you can use your regex knowledge to extract the data you are looking to parse.

But screen scraping is far less desirable than finding a webservice for the data you want. Aside from copyright or usage restrictions, your code can break very easily if the site changes their construct and the “eye catchers” you seek during parsing change.

Emile_Schwarz · October 7, 2018, 8:20am

Also, when the site use a different way to display the data (either in protocol or “tags” / data flow in the html page).

I also know about a web site that encode (unknow encoding schema) a part of the URL (why ?) that let the job worse to do.

Douglas_Handy · October 7, 2018, 3:58pm

If it is part of the URL instead of part of the contents, then it does not make the screen scraping of the data harder, but perhaps makes knowing the URL to request harder. In my experience, three cases of URLs that may appear to users to be cryptic are:

Simple URL encoding to safeguard against “unsafe” characters, for example some punctuation symbols
I need for uniqueness and including a UUID value
Intentional obfuscation

But the rise in popularity of CAPTCHA can probably be directly attributed to trying to reduce the ability to easily screen scape sites. After all, CAPTCHA stands for “Completely Automated Public Turing test to tell Computers and Humans Apart” at least according to Wikipedia page.

Emile_Schwarz · October 7, 2018, 10:08pm

My why question is because I think this is stupid: a screen shot is enough to get the image(s). In my case, my counter measure is to create a download application to avoid doing it manually: it downloads in the background the images (once each month) and that is all.

The reason why they do that ? My deep feeling is answer 3: Intentional obfuscation.
I cannot think why for option 1.
Option 2: use a date associated with a simple text is far enough to be unique (in that case).

The Daily Mirror (UK) use a long number to make urls unique (I think),

It goes in the same way as some web sites reject a right-click or even disallow Copy text. But they cannot disallow a look at the html as text

I found another web site that blackbool your IP if you download too many (in MB) stuff from them. Another that limit the connection to 3 files at the same time, etc.

Who talked about free internet ?
(all of these exists since many years, not just in 2018).