Using a HTTPSocket to search Google Scholar

I’ve been using an HTTPSocket to query Google Scholar for years with no problems. Lately, for a few days at a time, I’ve been unable to get this to work because Google Scholar sees something wrong with the query and thinks I’m a robot. This is bizarre, because I can still query Google Scholar if I use a Xojo HTMLViewer or if I use a standalone browser like Safari or Firefox. So it’s not an IP ban, it’s specific to the way I’m querying with an HTTPSocket.

Here’s the sequence:

  1. Query
  2. 302 redirect from GS
  3. The redirect takes me to a page where I should fill out a CAPTCHA (which I can do in an HTMLViewer, but that doesn’t let subsequent HTTP requests work).

Note that the HTTPSocket User-Agent is set correctly (to Safari, Mozilla, it doesn’t matter which one I use).

If I do the same search in FireFox using Live HTTP headers to track the transactions, there is never a 302 redirect, the search is returned with 200 OK immediately.

So it seems to me that there’s something in the initial GET that I’m not setting properly (presumably a request header). I’ve tried playing with that, but no luck.

Does anyone have any idea of what they’re server is looking for in the GET that my app isn’t supplying? (I repeat, this HTTPSocket search has worked for years, and even now the block is intermittent, but I has happened twice in the last 2 weeks to me and several of my users).

After a lot of work, I found that Google Scholar now requires a cookie on query submission. This is not in reply to a server request, but is required on submission. This is a change from previous behavior.

Browsers like Safari and Firefox supply this cookie automatically when the query is submitted. How do they know to do this? That is, how do they know a cookie is required, and how do they know what it is (the actual cookie string is not supplied in the HTML source of the submit page)? I’d like to know so if Google changes the cookie in the future my app can adjust dynamically. Can anyone explain this magic?

Here’s my final post about the resolution of the problem (in case anyone else runs into this). There is indeed a cookie (and I still don’t know how the browser knows to send it). But the real problem was something to do with the default RequestHeaders. When I do a ClearRequestHeaders before the HTTPSocket.get, Google Scholar sends the results of the query.

The cookie is provided to you in the previous response headers. Then as long as you are talking to that server, you’re supposed to send it back. I suspect you are getting a new cookie with each request now.

Thanks for the response, but no, actually I’m not getting a cookie from Google Scholar at all. The initial request (with the default headers) -> 302 response (with no Set-Cookie in the header). When I ClearRequestHeaders before making the initial request, I now get a 200 response and the data.

So Google Scholar sees the default headers that Xojo generates as “wrong” in some way and indicative of a robot (which is the message it sends along with a CAPTCHA for me to confirm I’m human).

(The cookie issue turned out to be a red herring, as I tried to point out in my last post).

I’m happy to send you a typical query to Google Scholar and you can try yourself with a default HTTPSocket.get.

Actually, after working this this some more I’m more confused than ever. Greg, if you have the time, would you please simply make an HTTPSocket and a look at the response to this request (searching for the word “test”):

HTTPSocket1.get “http://scholar.google.com/scholar?q=test&num=1&start=0&btnG=&as_sdt=1,21&as_sdtp=

No cookie is sent in reply, just a 302 that leads to a page asking if I’m a robot. Thanks if you can help, it’s driving me nuts.

I guess you can get the cookie by doing a GET request to https://scholar.google.com , and then use the cookies you get from there, to do a search query.

I really need to understand what’s going on (because Google Scholar keeps changing the rules, without notification). For example, why do I need a cookie for the first request (for subsequent requests I get it, but not the first one). And why do I have to clear the RequestHeaders (they seem innocuous to me, and obviously to Xojo). And now I’m seeing that in some cases if I set the User-Agent (correctly) that causes an issue, too. I’m getting this to work intermittently, but I don’t see what the rules are.

It also points you to the Terms of Service. Google doesn’t offer an API and they probably don’t want it used this way so I doubt we can know what their detection strategy is. Maybe they purposefully serve the page at times even when they think its a bot just to throw it off.

I had a youtube downloader app working for a few years. Every 6 months or so I’d have to update it for some new trick to get the download going but eventually it became too tricky and gave up. They can be very crafty:)

Also, about ClearRequestHeaders, a new socket starts out with default headers which you can see by…

[code]dim sock As new HTTPSocket
Msgbox sock.RequestHeaders.Source

Accept: /
Accept-Language: en
Content-length: 0[/code]

I was getting 302s also when trying to query Scholar with all of the same exact request headers I see in Safaris Web Inspector. ClearRequestHeaders didn’t make a difference but commenting out the User-Agent did and get a 200. Not sure why Safari can use that User-Agent but in an HTTPSocket its flagged, or why having no User-Agent gets accepted. It’s a world of hurt.

No, there is no API, unfortunately. GS doesn’t mind having apps access their data on an individual level (many apps do it, including mine). But they limit the frequency with which you can download (in my experience), and I try to comply with that.

I’ve, too, found that in some cases leaving out a user-agent works. In other cases, it doesn’t matter (and there’s no obvious reason why they’re treated differently).

I may post a link to a bare bones project that shows what happens.

Have you tried setting a user agent?

Yes, I’ve been doing that for years. Ironically, one Google Scholar search that my app generates works only if the user-agent is left empty! (Will Shank makes the same observation at the end of his comment above). Another search requires the user-agent. The user-agent setting is correct (I’ve tried several, including those for Safari and Firefox).

I’ve created a simple little project that demonstrates the problem.

It works when I include a cookie that I found with Firefox (using HTTP trace).

But how did Firefox determine what it was? I don’t see a cookie ever exchanged in the InternetHeaders. Any insights appreciated!

https://dl.dropboxusercontent.com/u/31064864/Google%20Scholar%20search.xojo_binary_project.zip

@Jonathan Ashwell , your link is being broken (the one to the dropbox). Can you please post a working link, I’m very interested to learn from your project as I’m stucked with a website requesting a cookie for proper login as well.

Thanks!