How do I grab content from a web page - Mac platform

Paul_Chance · December 5, 2023, 12:21am

Using just Xojo or Plug-ins (MonkeyBread, etc.) how would I go to a password protected web page - I have the username and password - which displays a page of text. I want to grab that page of text into a variable I would then parse into fields/records to add to an SQLite database (no need for multiuser)

Using a Mac Database, Panorama, I have a command that takes the username, password, and URL destination as parameters and can load the text content into a variable. From there I can parse the data into approrpriate fields (it has no formal format - like JASON or XML).

I’d like to do the same thing in Xojo,

Is that possible with the Xojo feature set, perhaps supplemented with 3rd party plug-ins?

I did search for this but didn’t find anything close.

Eric_Williams · December 5, 2023, 1:40am

Look into the command line tool ‘curl’. It allows you to download a URL into a file - ie, the contents of the page you’re interested in. Then you can parse the file and extract whatever you need.

Andrew_Lambert · December 5, 2023, 3:05am

Add a URLConnection object to your window. Implement its AuthenticationRequested event to handle the username/password, and its ContentReceived event to receive the page:

Event AuthenticationRequested(realm As String, ByRef name As String, ByRef password As String) As Boolean
  name = "myUsername"
  password = "myPassword"
  Return True
End Event

Event ContentReceived(URL As String, HTTPStatus As Integer, content As String)
  ' the received page is in the content parameter
End Event

Then elsewhere in the window you can use the URLConnection.Send method to initiate the request:

URLConnection1.Send("GET", "http://www.example.com/page.html")

Christian_Schmitz · December 5, 2023, 8:38am

What kind of authentication do they use?

e.g. is it HTTP Basic Authentication or a login form?

You can automate both. For the login form, you may need to run JavaScript to fill it automatically.

Paul_Chance · December 5, 2023, 5:50pm

Eric and Andrew, thank you for those nudges in the right direction.

Christian, I don’t know about any special authentication. This is what the Panarama statement looks like:

url("http://www.somesite/securepage","USER","johnsmith:secret")

johnsmith is the username and secret is the password. So the statement only needs a URL, username, and password. I figured there was something similar in Xojo - but I expect it to take more lines of code and structure to achieve the same thing.

looks like I need

A connection
Authentication Requested
3.Content Received

Tim_Parnell · December 5, 2023, 6:09pm

I’m having trouble pulling up the website and documentation for that language. If you have a link or document that could be of use. There are quite a few different ways to authenticate, and AuthenticationRequested only occurs for HTTP Basic Auth. I’m not sure that includes a “USER” parameter, so you may be looking at some other type.

Documentation for either Panarama’s url function or the actual page you’re trying to access would allow the community to be more helpful to you.

Paul_Chance · December 5, 2023, 11:41pm

Tim, I want to check with the Panorama (Mac Database) developer before I copy/paste his documentation somewhere else - could be a copyright issue. The point of posting the Panorama statement was to it just needed a URL, user, and PW. I believe theres a "pulleddata = " in front of it so the pulled content ends up in the string variable, “pulleddata” and I would parse it from there.

I figure there was some equivalent way to do that in Xojo - just pull text from a web page that needs a username/password if you went to it manually.

Tim_Parnell · December 6, 2023, 1:48pm

The problem is that there are quite a few different ways to authenticate. The “USER” parameter was raising some concern.

However, I was able to find the correct tool now knowing “Panorama (Mac Database)” and their documentation is freely accessible online. The relevant page is here url function. Thanks to the documentation, we now know the “USER” parameter is for Panorama and is not part of the authentication. The good news for you is that this means the authentication is likely HTTP Basic, so you’ll be able to use the AuthenticationRequested event.

As a gesture of good faith in apology for any undue concern, I’ve built you an example project for how to use the AuthenticationRequested event with HTTP Basic Auth. basic_auth.xojo_xml_project

Best wishes!

Douglas_Handy · December 6, 2023, 5:15pm

Be aware that this will pull the raw HTML contents, which may not include data which is populated by javascript on the page by the browser. In these cases, one option is to use an HTML viewer control to render the page then in the DocumentComplete event do something like this to grab the contents:

Var rawHtml As String = me.ExecuteJavaScriptSync("document.getElementsByTagName('html')[0].innerHTML;")

This typically yields the effective HTML once javascript has finished processing, and sometimes is needed to get data dynamically loaded via js instead of sent directly with the webpage source.

Paul_Chance · December 6, 2023, 5:56pm

Thank you Tim, I wasn’t sure if the Panorama documentation was “public” or not. I’ve downloaded your project and will look into it soon. Because I know Panorama, I usually build in that first - to get the concepts and details I might not have thought about clearer. Then implement the same thing in Xojo.

In fact, so not to get off-topic, I’ll post another query about an attribute that was available in a (very) old rival the Xojo - FutureBasic (nee ZBasic).

Douglas, Thank you for the reminder about raw HTML vs “pure” data. I believe the commands I’m using with pull out just the data content, leaving the HTML behind. But if not, there are filters that will strip out the HTML tags.

Douglas_Handy · December 6, 2023, 6:08pm

This was not so much about HTML vs “pure” data as it is HTML vs rendered data. Some sites use javascript to populate data (especially tables) and in such cases, getting the raw page contents is not enough and you need a browser (or headless browser) to get the data. And then strip out the HTML etc.