PLEASE DON'T SCRAPE THE WIKI

Norman_P · June 3, 2013, 3:59pm

We’ve asked this before and someone already scraped the entire NEW wiki last night.

PLEASE DON’T DO THIS

It makes the entire thing run very slowly for everyone & IF you really want the whole thing locally we CAN provide you with a dump you can import
You just have to ask

Norman Palardy

Tim_Jones · June 4, 2013, 2:34am

Are you sure that it’s not a web crawler? Our KB gets hit 2x a month by Google and Bing and things slow to a crawl…

Massimo_Valle · June 4, 2013, 6:46am

Asking for a dump is not likely to happen if the scraper need it immediately.
I’d suggest to provide a weekly (or monthly) zipped dump of the wiki and publish a link.

Norman_P · June 4, 2013, 1:47pm

Yeah we’re pretty sure

scott_boss · June 4, 2013, 2:02pm

they probably know whom it is. they are smart like that.

Norman_P · June 12, 2013, 10:33pm

We blacklisted a couple IP’s for a few days and … no more scraping for them !

GarryPettet · June 21, 2013, 12:41pm

I’ve been thinking about generating a searchable offline docset for Dash for Xojo.

Could I possibly get a dump of the documentation Wiki since you’ve offered? (no - it wasn’t me who was scraping it just prompted me to ask!).

Norman_P · June 21, 2013, 3:42pm

In order to generate it for Dash you’ll want the HTML not the wiki markup (or you have to run the entire wiki locally which you can do - but then you 'll want to scrape your local wiki)

You could run your own if we give you a dump + our extensions + images

GarryPettet · June 21, 2013, 9:30pm

Been reading about creating a docset and it doesn’t look like it should take me too long. I’m happy to do it (I use Dash a lot for PHP and Python work and love it) but I’m just trying to figure out the best way to get the HTML. What’d you reckon?

GarryPettet · June 21, 2013, 9:47pm

Looks like there’s an extension for MediaWiki called DumpHTML. If I could get a dump of your docs I could add this extension, dump the HTML and then write a little Xojo program to parse it and add it to a Dash docset database. Unless you want to give me the dump and I’ll just knock up a Xojo app…

Norman_P · June 21, 2013, 9:54pm

Gary read yer private messages

Norman_P · June 22, 2013, 3:55am

OK I need a couple guinea pigs to try out the extract & set up so you can run a local wiki
email me directly norman@xojo.com and we’ll get the files & instructions to you

Norman_P · April 28, 2017, 2:38pm

And someone did this yet again so the doc wiki was offline

Michel_Bujardet · April 28, 2017, 2:40pm

Confirmed. It went down.

Michel_Bujardet · April 28, 2017, 2:45pm

I thought abour archive.org but it would not scrape several times in a raw.

Tim_Parnell · April 28, 2017, 2:47pm

I would also assume archive.org knows how to properly scrape a website, wouldn’t slam it all at once, and would respect robots.txt

Don_VF · April 28, 2017, 2:49pm

Done a lot of data scraping before… but not sure what the purpose of scraping the xojo docs would be. Unless they think it is the only way they can get an offline version or something.

archive.org changed the way they scrape, normally they do it over a long course of time now.

Bing is notorious for ravaging a server though.

Annoying Badu, doesn’t usually get too bad.

Google you can throttle if you have webmaster tools, but you can’t throttle them any other way. Well you can, but it isn’t great for SEO.

Norman_P · April 28, 2017, 3:18pm

Yeah I wish this was something like one of the bots from google etc
It’s not