HTTPSocket-Multiple pages on site requiring user name and password

Using an older version (REAL Studio 2011 r1).

The goal is to download images files from a site that requires a user name and password. The user name and password are submitted from a form on the site’s main page.

I was able to figure out how to log in by submitting a form using the HTTPSocket, but after logging in, I am unsure of how to proceed. I have used the HTTPSocket for this sort of thing before, but the method I used with other sites doesn’t work with this site. With the other sites, I know the path to the files on the web server in advance and can point directly to them using: http://thewebserver/thefolders/thefile. The user name and password are send each time.

When I try this method with this site, even when using a path I know to be valid, it instead returns the login page.

I’m not sure it makes a difference, but with this site, the path does not point directly to a file. Instead it’s being processed through a CGI script using a URL something like this:

With the site I am working with now, I am unable to figure out how they come up with some of the information that’s used in this path. I’m sure it’s not, but it seems random to me. However, I can get this information by parsing the HTML of the pages during navigation.

So, my idea was to:
Log in
Go to the standardized URL that has all of the links available that week.
Search for the file name I need in the HTML of that page and grab the next URL pointing to where it can be found.
Go to that file’s URL. This page has the URL that will actually initiate the download.
Parse the page for this URL and go get the file.

Now that I know how to submit the log in form, how do I begin “navigating” to or returning the pages I need to parse afterward?

I think I could do this with an HTMLViewer, but this process needs to run without intervention. Too many pop-up script errors are generated using the HTMLViewer.

I also have CURLMBS, if that would be a better route. I didn’t get very far with it when trying with this site, but have used it with others and had no problems.

Thanks for any help.

They might be sending a cookie when you login, and expect you to send it back on each subsequent request.

yes a cookie, or a session id you must provide on the other pages.
listen to the page with a developper tool from firefox or opera to discover what’s asked by the pages.

I do see that cookie information as well as a session ID are returned in headers.Source when I log in.

I will try to figure it.


No luck so far. Anything obviously wrong in what I’m doing? Thanks.

In HeadersReceived I’ve added:

Dim i as Integer For i = 0 To headers.NameCount("set-cookie") - 1 Session_Cookies.Append headers.Value( i ) + "; " Next
In the method I have:

HTTPSocketTrib1.Post( "" ) // This login form submission attempt works. Headers are received and include cookie info. Cookie info is saved to a string array (Session_Cookies) in HeadersRecieved. HTTPSocketTrib1.SetRequestHeader( "Cookie", Join(Session_Cookies(),"; ") ) HTTPSocketTrib1.Get( "" )

The only things I can think of is try adding the colon after “cookie” and remove the trailing semi-colon and space from the end of the string…

 HTTPSocketTrib1.SetRequestHeader( "Cookie:", Join(Session_Cookies(),"; ") )

headers.Value(i) doesn’t return only Set-Cookie: headers. Use headers.Value("Set-Cookie", i) to do that.

Also, the contents of a Set-Cookie: header can include additional data such as expiration date, domain name, etc.; you must strip these out before sending it back. For example, in the header Set-Cookie: visitor=12345; expires=Thu, 25 Jul 2019 20:45:23 GMT; path=/, the cookie value is visitor=12345.

I tried storing the cookie information in a global array.

I found that the array is empty before and after the login form is posted.

It contains 9 elements when checked from HeadersReceived during the post.

I was able to get this working by moving the code to the PageReceived event and incrementing a counter that will take the process through the navigation of several pages based on the number of the counter.

The cookie tips were key to making this work.

This doesn’t seem optimal and I’m sure I’m missing something “simple,” but it’s working.

Thanks to both of you for the help on this!