We’ve asked this before and someone already scraped the entire NEW wiki last night.
PLEASE DON’T DO THIS
It makes the entire thing run very slowly for everyone & IF you really want the whole thing locally we CAN provide you with a dump you can import
You just have to ask
Are you sure that it’s not a web crawler? Our KB gets hit 2x a month by Google and Bing and things slow to a crawl…
Asking for a dump is not likely to happen if the scraper need it immediately.
I’d suggest to provide a weekly (or monthly) zipped dump of the wiki and publish a link.
they probably know whom it is. they are smart like that.
We blacklisted a couple IP’s for a few days and … no more scraping for them !
I’ve been thinking about generating a searchable offline docset for Dash for Xojo.
Could I possibly get a dump of the documentation Wiki since you’ve offered? (no - it wasn’t me who was scraping it just prompted me to ask!).
In order to generate it for Dash you’ll want the HTML not the wiki markup (or you have to run the entire wiki locally which you can do - but then you 'll want to scrape your local wiki)
You could run your own if we give you a dump + our extensions + images
Been reading about creating a docset and it doesn’t look like it should take me too long. I’m happy to do it (I use Dash a lot for PHP and Python work and love it) but I’m just trying to figure out the best way to get the HTML. What’d you reckon?
Looks like there’s an extension for MediaWiki called DumpHTML. If I could get a dump of your docs I could add this extension, dump the HTML and then write a little Xojo program to parse it and add it to a Dash docset database. Unless you want to give me the dump and I’ll just knock up a Xojo app…
Gary read yer private messages
OK I need a couple guinea pigs to try out the extract & set up so you can run a local wiki
email me directly email@example.com and we’ll get the files & instructions to you
And someone did this yet again so the doc wiki was offline
I thought abour archive.org but it would not scrape several times in a raw.
I would also assume archive.org knows how to properly scrape a website, wouldn’t slam it all at once, and would respect robots.txt
Done a lot of data scraping before… but not sure what the purpose of scraping the xojo docs would be. Unless they think it is the only way they can get an offline version or something.
archive.org changed the way they scrape, normally they do it over a long course of time now.
Bing is notorious for ravaging a server though.
Annoying Badu, doesn’t usually get too bad.
Google you can throttle if you have webmaster tools, but you can’t throttle them any other way. Well you can, but it isn’t great for SEO.
Yeah I wish this was something like one of the bots from google etc