Pausing Code

I’m writing a small app that uses Wget.exe to pull html files from our library website. There are about 20,000 pages.

I have Wget.exe as a FolderItem and use launch(parameters) to get the html files (1 at a time). Wget.exe opens in a command window, gets the file and writes it to disk then closes the command window

Problem:
Sometimes it takes a microsecond to get the file sometimes it takes 15 seconds. So my question is there a way of just waiting until the command window is closed?

I don’t know that this will work either but it’s an easy fix if it works
(not the code, just the jist)
Do
Check to see if the file exists yet
read new html file
If it contains then loopControl = True
Loop until loopControl = True

Any thoughts would be appreciated. I don’t need help with the above code, just want to know if that will work or if there’s a better technique. i.e. monitor the command window. Sleeping the thread for 20 seconds on every pull doesn’t make a lot of sense to me.
TIA

Tight loops freeze the app.

Instead, use a multiple timer. Basically same code and if contains </html> stop the timer and return, call a method containing the code you have after the loop…

is there a reason you can’t use HTTP Sockets? these have events that indicate when file transfer is complete etc…

Polling using timers can be problematic, especially when there are proper event driven objects to do this same thing

Couldn’t get the Secure socket to work properly. I’d download 10 pages then I’d get about 20 blank ones, then I’d get barred from the site for 30 mins.
Michael said last year: https://forum.xojo.com/28253-getting-the-html but as you see I didn’t get an answer to synchronous vrs. asynchronous so I went to wget.exe

Well you have two choices.

  • Take the time to figure out why you are having trouble with the “right” way (sockets), and have a proper tool for the future
    -or-
  • Waste time figuring out why the “wrong” way doesn’t work…

Did you try to find the answer for yourself, or just gave up when Michel didn’t explain it for you?

You have another choice too:
Ask the site owner for the data you’re trying to scrape.

To pause code in general, one would not use a Timer but threads and semaphores. The thread, when it knows it has to wait, would put itself dormant by suspending itself using the semaphore, and then, when the signal appears that tells the app that new data has arrived and the paused code should continue, it would wake it thru the semaphore again.
That way, the paused code doesn’t have to be split up into multiple methods and doesn’t need to remember state in variables outside of that code.

I thought I had written an article about it a few years ago, but can only find the sample code: http://files.tempel.org/RB/Thread%20Blocking%20Demo.rbp.zip

If you only want to do one file at a time using a blocking shell would work too.

[quote=260350:@Dave S]Well you have two choices.

  • Take the time to figure out why you are having trouble with the “right” way (sockets), and have a proper tool for the future
    -or-
  • Waste time figuring out why the “wrong” way doesn’t work…

Did you try to find the answer for yourself, or just gave up when Michel didn’t explain it for you?[/quote]
Yes and it still made no sense to me. http://stackoverflow.com/questions/748175/asynchronous-vs-synchronous-execution-what-does-it-really-mean

I think you need to set the user agent of your socket. If it works with a browser and wget in quick succession, but fails via the socket, it’s probably because your socket looks like a bot (which it is), while wget does not. This is less about async vs sync than presenting yourself as a proper browser to the website.

[quote=260375:@Thomas Tempelmann]To pause code in general, one would not use a Timer but threads and semaphores. The thread, when it knows it has to wait, would put itself dormant by suspending itself using the semaphore, and then, when the signal appears that tells the app that new data has arrived and the paused code should continue, it would wake it thru the semaphore again.
That way, the paused code doesn’t have to be split up into multiple methods and doesn’t need to remember state in variables outside of that code.

I thought I had written an article about it a few years ago, but can only find the sample code: http://files.tempel.org/RB/Thread%20Blocking%20Demo.rbp.zip[/quote]

That’s what I was hoping would be the answer as I saw Mutex and Semaphore in the LR but it really doesn’t do a very good job of explaining it. It was what gave me the idea of the Do Loop. To me the Semaphore in the LR is not even monitoring a file.

This is one case where I would prefer a timer over a thread/semaphore. But a socket is better yet.

Is this what you are talking about ? HTTPSecureSocket.SetRequestHeader

Yes.

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246

Does that look right? I didn’t write it, I just found it on UserAgentString.com and clicked on the Edge Browser as I have no issue looking at the library on Edge. I’m looking for the pages that are javascript enabled. I see from last year’s pull a lot of the pages I posted for the students said “We’re sorry, some parts of the RyeLib website don’t work properly without JavaScript enabled.”

Now the LR is a bit confusing. Does the code (assuming this is correct)

myHTTPSecureSocket.SetRequestHeader(“User-Agent:”, "Mozilla/5.0 … ")

change it in the socket and then I just code GET(https://thelink) ?
Is it 1 off or do I use this before every GET?

TIA

If you set it on every request then you’re sure its set
And you won’t end up with several settings for it either

Great thanks for the info.