What's wrong with this code ?

[code]Dim URL as string = “https://www.nytimes.com/es/

Socket1.SetRequestHeader( “user-agent”, “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36” )
Dim pageData As String
pageData = Socket1.Get(url, 10)[/code]

Why does pageData remain empty ?

Rough guess: Server Redirection? What does Errorcode and HTTPStatuscode say? Maybe you use URLConnection instead?

Hey, Tomas

LastErrorCode = 102

HTTPStatuscode = 301

But why does chrome load the page ok ?

because chrome automatically follows the redirects

an HTTPSocket is NOT a web browser and does NOT automatically follow redirects etc
it grabs whatever the server replies with from that URL and then YOU need to read that response and act accordingly - like a web browser would

I understand, Norman… but the reply is empty…

No. It is not. The reply is 301. 301 means moved.

you should get a ResponseHeader(“Location”) when 301 or 302 is given which has the link to the actual location (url)

what is absolutly okay. the http status code 301 is important. if it is 301 then the body may be empty. it’s defined in RFC 2616

should… not must…

[quote=455569:@Tomas Jakobs]what is absolutly okay. the http status code 301 is important. if it is 301 then the body may be empty. it’s defined in RFC 2616

should… not must…[/quote]

well must, but trust me some servers don’t have it…:wink: this one has it, as the browser can follow it.

@Tim Parnell : Ok, right… so the CONTENT is empty…

@Derk Jochems : The ‘location’ attribute points exactly to the URL I have already defined. (It’s like… moved to the same place?)

I still don’t get it… how does chrome know where the ‘moved’ site is now… ? The only thing I’ve seen in the headers is the same URL…

the 301 response should have several headers and one of those will say where it moved to

that’s odd,

in postman the result is 301
the Location header has value “https://www.nytimes.com/es/

I think this has something to do with the server requiring eighter cookies or some other data ?

This exact url “https://www.nytimes.com/es/” works 9/10 times in postman (if follow redirects = set to off)
so it should do the same with xojo.

Dim URL as string = "https://www.nytimes.com/es/"

Socket1.SetRequestHeader( "User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" )
Dim pageData As String
pageData = Socket1.Get(url, 10)

Note the capitalized “User-Agent”

This has everything to do with The New York Times being copyrighted material, and them putting in some scraping and theft protection.

Perhaps.

If you set a breakpoint @ pageData = Socket1.Get(url, 10)
you won’t see the data in the debugger until you step over it. so he might not see data while it’s actually there.

Dim URL As String = "https://www.nytimes.com/es/"

Dim s1 As New URLConnection

s1.RequestHeader( "User-Agent") = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
Dim pageData As String

pageData = s1.SendSync("GET", url, 10)
break

If s1.HTTPStatusCode = 301 Then
  
  If s1.ResponseHeader("Location") <> "" Then
    Dim NewURL As String = s1.ResponseHeader("Location")
    
    Dim s2 As New URLConnection
    
    s1.RequestHeader( "User-Agent") = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
    Dim p2dat As String
    
    p2dat = s1.SendSync("GET", NewURL, 10) 
    break
  End If
  
End If

FWIW, It’s not only Nytimes… I’ve tried a couple of ws in different parts of the world, with the same result… (yeah, they are all newspapers and banking sites, so they probably care about intellectual property…) But chrome can redirect…

I’ll repeat myself

[quote=455563:@Norman Palardy]because chrome automatically follows the redirects

an HTTPSocket is NOT a web browser and does NOT automatically follow redirects etc
it grabs whatever the server replies with from that URL and then YOU need to read that response and act accordingly - like a web browser would[/quote]

Yes. They do that on purpose. Website owners only have to respond if they want to, so many websites are designed to check for the facets of a browser. User-Agent is one. There are other ways.

You’re asking us to help you circumvent a specific server by guessing at it’s requirements.

@Norman Palardy : Since we’re repeating ourselves I’ll do it as well : I understand, Norman… but the reply is empty…

I am not comparing xojo with chorme. I am just trying to understand the logic, and where does the data for the redirection come from…

ok, Tim. This helps me understand better … “so many websites are designed to check for the facets of a browser. User-Agent is one. There are other ways.”

servers return data in one of many ways

  1. headers that indicate if things have moved cant be found redirected etc
    a properly written web client, like chrome, will know what to do with these and behave properly

  2. the content - or “the reply”
    for certain requests there will be headers but no reply
    for some there will be headers and a reply

a properly written client, like chrome, knows the HTTP protocol and deals with all this accordingly
for YOUR application to do what it wants you will need to implement all that as well

and you may still get empty replies & headers because, as tim notes, the NYT may in fact be determining that this “client” is just someone trying to scrape data because it doesnt send the right headers, doesnt support capability examinination so the web site can tell if you are indeed a legitimate web client, or some other technique to stop scraping etc

without you properly determinging what it is the NYT is looking for you’ll probably make no neadway
And until your code replies inthe the way the NYT expects you’ll make no headway

I’d start by reading the HTTP protocol docs and implementing as much of that as you need

further see https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol

Thanks Norman!