Getting the HTML

Jym_Morton · November 29, 2015, 4:26am

I wrote a small program to scrape a website. Even with a minute between url Gets it ‘boots’ me somehow after 2 or 3 if I’m using HTTPSecureSocket I have the connection type 3 and Secure checked. It’s not blocking my i.p. like it did when I tried at 10 seconds between url Gets. I know this b/c I when I was going fast and I was blocked my Browsers would be blocked. Now I can still pull the next url on my browser but not with the HTTPSecureSocket.

However the HTMLviewer seems to be able to pull the URL. Is there a way to store the HTML from the HTMLViewer?

Beatrix_Willius · November 29, 2015, 5:56am

Some more details please:

What do you mean with “boots”?
Does this happen for all websites you want to scrape?
Why are you using secure sockets?
Have you tried to change the user agent?

Sure you can get the html from a html viewer. But loading data into the htmlviewer makes things way more complicated.

Michel_Bujardet · November 29, 2015, 7:20am

[quote=232622:@Jym Morton]I wrote a small program to scrape a website. Even with a minute between url Gets it ‘boots’ me somehow after 2 or 3 if I’m using HTTPSecureSocket I have the connection type 3 and Secure checked. It’s not blocking my i.p. like it did when I tried at 10 seconds between url Gets. I know this b/c I when I was going fast and I was blocked my Browsers would be blocked. Now I can still pull the next url on my browser but not with the HTTPSecureSocket.

However the HTMLviewer seems to be able to pull the URL. Is there a way to store the HTML from the HTMLViewer?[/quote]

Looks as you are using the synchronous mode of the HTTPSecureSocket. In asynchronous mode, you would not have to time things.

MBS has plugins to get the content of the HTMLViewer.

The scrapping can also be done in JavaScript.

Jym_Morton · November 29, 2015, 3:25pm

All this worked in June without a timer 1 hour got all the data I needed, the website did something and now after I .Get(mylink) 2 times it returns something like “The site is down please try back later” even though it’s not as I can access it from my browser. A week ago when I was running the script without a timer I got this message then I couldn’t access the website from any browser on my computer but could from my other computers so I figured my i.p. was blocked 2 hours later I had access to their site again.

I’m using a secure socket because the URL is https
I have no clue what a user agent is.

I don’t do web stuff, so I don’t know what “synchronous mode of the HTTPSecureSocket. In asynchronous mode” means. I just need to get the data so I can write a few learning cases for classes and if there’s a something notable (which I doubt there is) I may attempt a Scholarly Article.

I have 3200 links in a database that I just want the data from.

Jym_Morton · November 29, 2015, 3:29pm

Oh the htmlviewer, ran over night pulling every 10 minutes and it only got 2 pages before it got the message “The site is down please try back later” which. So I’m either doing something horribly wrong (which worked in June) or I don’t know. I know from the pages I do have it says I’m using I.E. 8 which of course I’m not.