Web robot ?

Michel_Bujardet · April 4, 2014, 10:26am

In https://forum.xojo.com/10623-80-language-translation-class a member is apparently using Google translate to power his SimTranslator class.

Google TOS aside, and pass the pretention to sell engineering belonging to someone else, I find the concept intriguing. He pulled off using a web site by running a program that must send events, character chains and maybe clicks, which produce a result meant for human eyes, and repackage that into a text field.

There are uses for web robots, to fetch informations not necessarily accessible through APIs.

I want to explore the concept. At this time, I will experiment with HTTPSocket which is meant to send and receive HTTP.

Any advice will be appreciated.

Michel_Bujardet · April 4, 2014, 12:59pm

OK. I started work on the web robot. An HTTPSocket I use to open the page, and an HTMLViewer to see the result of the incoming HTML. At first glance, translate.google.com seems like it can be automated with keystrokes.

The field to translate has the focus as default. A paste seems possible. Then, tab jumps between popmenus, and Return clicks “Translate”.

I have now to figure how to feed the keystrokes through the HTTPSocket

Ashot_Khachatryan · April 4, 2014, 1:33pm

Don’t use HTMLViewer to do website automation. It’s far too limited in Xojo and not even necessary for what you want to do. You don’t have access to the cookies, to the DOM of the website and can only execute javascript.

I use HTTPSocket for all website automation.

Install Fiddler, sniff the traffic of the website you want to automate and start coding it with HTTPSocket.

Ask any specific questions you may have for web automation with HTTPSocket, and I will try to answer them!

scott_boss · April 4, 2014, 1:41pm

use HTTPsocket or HTTPSecuresocket to fetch the data. If the site has an API, use the API. APIs dont change that much. And are designed to be called by “bots” or programs. If there is no API, you can fetch the pages, then parse the source. Problem with doing the later is, when the marketing dept changes the page, your program breaks. Some companies change their website layout (even if it is only under the hood) to break people page scraping.

Michel_Bujardet · April 4, 2014, 3:26pm

I just use it to visualize.

Should be useful. Thank you.

Michel_Bujardet · April 4, 2014, 3:28pm

I understand. I just want to see the feasibility.

Alain_Bailleul · April 4, 2014, 4:34pm

Maybe this little thing can be of some assistance to build your headers for the http socket? This example is done with a TCPSocket.

source code RB2007R4:
TCPSocketTranslate.rbp

Julie_Scofield · April 4, 2014, 5:38pm

Michel,

Not sure if this will be of any help or not, but I’ve done some web bot work using cURL’s command line interface, SHELL and RealStudio. This is just a preference thing, but I much preferred working with cURL than HTTPSocket.

Ben

Michel_Bujardet · April 4, 2014, 5:41pm

Thank you

Michel_Bujardet · April 4, 2014, 5:42pm

[quote=76889:@Ben Scofield]Michel,

Not sure if this will be of any help or not, but I’ve done some web bot work using cURL’s command line interface, SHELL and RealStudio. This is just a preference thing, but I much preferred working with cURL than HTTPSocket.

Ben[/quote]

I started with the tool I am most comfortable with. But am open to any solution that works. Basically, I want to see if it can be done

Michel_Bujardet · April 4, 2014, 5:51pm

This is quite interesting. I did not realize the query could be URL encoded. This opens new perspectives.

Alain_Bailleul · April 4, 2014, 5:57pm

I’ve ripped some webpages with AutoIt (Windows only I think) in the past. Very powerful scripting stuff. Maybe mac has a similar thing?

The result I got back from the TCPSocket looked suprisingly easy. A simple parse gets the translation.

Michel_Bujardet · April 4, 2014, 6:05pm

[quote=76898:@Alain Bailleul]I’ve ripped some webpages with AutoIt (Windows only I think) in the past. Very powerful scripting stuff. Maybe mac has a similar thing?

[/quote]

Actually, I would love to build such a tool. May not be awfully powerful, but a small scripting language and some parsing of the results. Back in the eighties, I had devised a bot in QuickBasic for QuickBasic and the French Minitel, and have some ideas about what can be involved. For some reason, I thought HTTP transaction where more complicated than line oriented terminals.

[quote=76898:@Alain Bailleul]The result I got back from the TCPSocket looked suprisingly easy. A simple parse gets the translation.

[/quote]

I see. Now it makes the packaged $40 class even less justified.

Alain_Bailleul · April 4, 2014, 6:16pm

QuickBasic! This brings back memories… As a young teenager I wrote my first memory optimalization tool in QuickBasic so I could get the extra 5KB I needed to play Police Quest 4 on my x86

Michel_Bujardet · April 4, 2014, 6:21pm

OK. Now I have a question : with HTMLViewer1.LoadPage(http.Get(url, 30),f) I get a page. Now, to simulate a user, I need to send keys to the server. It is not a URL. How to I do that ?

Norman_P · April 4, 2014, 6:36pm

I’ll say you can most definitely make a bot to walk a set of pages that has low dynamic content (since there’s no nice way to execute on page javascript that maybe modifies the dom & computes urls etc with an httpsocket)

I’ve done it using just the httpsocket for producing the internal documentation set from a private copy of the wiki (so we DONT scraps the online one & cause performance issues)
But those pages how no dynamic content that is created or accessed using Java script - the URL’s for images & references to other pages are all static or tags in html pages

Michel_Bujardet · April 4, 2014, 7:33pm

I understand. Maybe I will work from the project built by Alain Bailleul now.

Norman_P · April 4, 2014, 7:38pm

the tricky part is

keeping track of the URL’s you have previously visited so you don’t visit them over & over
limiting things so you don’t go trying to visit URLS that would take you OFF the site you are scraping

oh and if someone has a robots.txt file in place you should honor that too

Julie_Scofield · April 4, 2014, 9:44pm

I’ve got to interrupt normal “work type” conversation to say:

POLICE QUEST (Along with King’s Quest, et al) was AWESOME!!

That is all.

Alwyn_Bester · April 5, 2014, 9:25am

Most of the Sierra quests were awesome. Actually downloaded Hero’s Quest I (my favorite series of theirs) a few months ago and played it in DosBOX. Probably got most of my typing skills from these typing quests.