HTTPSocket get nothing…

Often (usually today’s files), I get the first image, and the two others (width=1500, height=1500) are left with 0 bytes in the Images (target) folder.
Most of the time, a second click in my application’s Download button… download correctly the asked files.

I narrow down the problem to the download line (read below). I just added the If block and put a Break on the Exit line (but the program does not falls in the Debugger). I get the MsgBox and the empty file is not written on disk (nice).

Before the download loop, I have:

     Comics_Socket.Yield = True

And this is nice.

The relevant code is:

Comics_Data = Comics_Socket.Get(URL_Image + "?width=1500", 30) // Downloading Error If Len(Comics_Data) < 1000 Then MsgBox "An error occured" + EndOfLine + EndOfLine +_ "No data sent for the -1500 image." Exit End If

BTW: When I set a month to download, usually, all files are downloaded.

In my download log text file, I save the HTTPHeaders. The relevant names:Values are:

Content Length --> Empty Age --> 0 Download and Expires date --> OK Cache control: max-age=7200

I am open to suggestions.

Sometimes, a site will block the download if https://en.wikipedia.org/wiki/HTTP_referer is not correctly sent for the image download. This is to stop people scraping/linking to images on their site. You might need to calculate the correct referrer and send that in the header of your download.

Julian:

Thanks. Good idea, I forgot it *.

But in this case, a second download try is always successfull. Better, I am able to download a whole year “at once” (call them archives), and sometimes, more than one year at once (but one year, time ellapsed 'till I realize… year is complete, then resume downloadings with another year.

  • In a somewhat similar (but more open) application I do not updated in years (still in Real Studio project ?), I used the Referer. I really forgot about it.

Edit:
How do I set a Referer in the HTTPSocket command(s) ?

A quick google search agains either docs… and developer… returns nothing.

It might be that some cookie or tracking information is not set up on first connect, then works during proceeding connections. I dont really know how HTTPSocket works at the moment as I’ve not spent any time playing with it so I can’t really advise on that (someone else might be able to chime in on that). I dont know if it internally saves state between calls, if it does this might be the reason. It might be that the server at their end is set up to ignore the first connection to an image to limit cross site linking if you havent been there for a while or havent initiated a connection to another type of content (for example, a html request would come before a request for an image). There’s a lot of things that could be stopping this, without trial and error testing it would be hard for me to guess :slight_smile:

Your best bet to make sure its not a xojo problem or your code, is trying grabbing images from a different site and see if you encounter the same problem.

See http://developer.xojo.com/xojo-net-httpsocket$RequestHeader set this before your call.

Julian:

I can get the file(s) from Firefox without trouble. But the idea of the project is to avoid wasting time to manually get years of files (365 x 3 for each year).

Your best bet to make sure its not a xojo problem or your code, is trying grabbing images from a different site and see if you encounter the same problem.
I cannot do that. I get 3 files: the first one have the “base” url, and the two others have a parameter (?height=integer / ?width=integer) to append to the base URL. I really do not know how to replicate that. The problem happens with these two larger files.

About Referer:
I found (at home) in my archives the last version (2009r5) of the project where I use Referer. In fact, I use it in a HTTPSocket SubClass. I will do the same in that new project.

Thank you for your help Julian.

BTW: the original URL is crypted *even if the file itself is freely downloadable: strange method … As far as I understand looking at the html file, the images are available at FaceBook, Twiter, Pintinterest, etc. Places where I do not go.

  • I was not able to understand the crypting shema / method and since recently (for 15 years), I save all files by hand: think at the wasted time every single day of the weeks / months / years…
    Now I hav more time to do other things :wink:

If this is a project that is just for your personal use, you can take a look at SiteSucker (macOS only) or the FireFox add-on called DownloadThemAll…

Thanks shao:

my project downloads the html file, search for the image url, then download them.

This is not just like a download manager. However, I do not know your two suggestions.

Mo other (and older) project is a download manager (more or less): I give it a text file and it download them.

You need to contact that comics site and get access to an API.

Do you mean a form parameter or a query_string at the end of the URL?

If you explain that a little clearer maybe with some examples we might get the correct call for you.

If you can/want, PM me the link of the comic (don’t really want to leave a trace for google) and I’ll see what I can see.

Try timeout 60 instead of 30

Thank you for your answers.

Julian:
The two “resized” strip used: ?height=1500 and ?width=1500 as an URL suffix.
This kind of things is usual in the internet images URLs. Sometimes, changing the size in the URL, I get far larger images (vs stamp sized images)…

At first, I was thinking: what did I miss to do ?

Then, I watched carefully my BinaryStream lines used to save the image data. I added .Flush, followed by .Close.

Then, I do upward in my code until I add the download variable size: if less than say 500 Bytes, consider the download is bad. I will change that by checking the download Bytes with the size found in the HTTPSocket.GetHeaders …

That is why and how I know that when a trouble (no file) happens, it is because the file was not downloaded.

Derk:

Try timeout 60 instead of 30
Will try, and I also will try 0 (wait until I get the data).

I made an app a few years back that would enable automated shopping to the Tesco.com grocery site which had a .net backend (and can be a pain to automate). I found that if I passed along the cookies and correct headers and referrers, there was little they could do in terms of stopping me from automating the process. It used to turn a 2h process into <10min.

It could be that they are rate limiting your request for the 1500 sized images after getting the first one. Perhaps a little delay between the gets or put the gets for the 1500’s into a loop with a delay so they retry until completed?

Ahh, I’ve just spotted this in your returned header:

Cache control: max-age=7200

Try delaying your 1500 gets by at least 7200ms (7.2 seconds) perhaps set them to a 10 seconds spacing, even though the content is technically not the same so shouldn’t be cached, they might have something at their end stopping it, its worth a try.

Hi Julian,

I seem to be in a good mood today.

After a project backup, I reformat my linear method to download these files. Now I call the same method three times and started to get troubles…

At last, I had a strange idea (sometimes I love strange ideas): change the method calling order.

Instead of calling it from 900 (standard), 1500 (Width) and H1500 (Height), I totally reverse it.
It is a bit early to say something, but the simple tests I’ve made looks like if it is the correct order… The higher file (around 3MB or more) is downloaded first, then < 1MB and the smaller comes.

In the called Method, I instanciate the HTTPSocket on entry and close (HTTPSocket = Nil) it before the end.

OK: I spoke too soon. I will do something else and resume this project debugging tomorrow or the day after. Or