Parse data from webpage in batch

Sergio_Ciordia · January 14, 2015, 7:31pm

Hi everybody,

I am newbie in Xojo and I’m trying to do a simpler app. I have several values in a listbox (for example: P02768, P02647 and P49840). I would like to load this webpage in the background,

http://www.uniprot.org/uniprot/ [+ value] , for each value (from the listbox).

Then I would like to get the source-code from each results webpage, parse two values “protein name” and “Gene” for each one (in batch), as in this example:

“…

Glycogen synthase kinase-3 alpha

Gene

GSK3A

Organism

Homo sapiens (Human)

Status

…”

Finally I would like to include the obtained values (protein name and gene) in the listbox.

Is it possible to do it with Xojo? I got it with Filemaker using the webviewer and a script but I don’t know how to do it in Xojo. Could anyone help me please? I am lost.

Thank you very much,
Sergio

Tim_Hare · January 14, 2015, 7:46pm

One approach would be to load the result into an XMLDocument and use XQL to locate the nodes you want based on their “class” property.

Kem_Tekinay · January 14, 2015, 8:23pm

The answer is, yes, it’s all possible, but you have to be more specific about which part confuses you. If you ask specific questions, you’ll most likely get more responses.

Aaron_Martinez · January 14, 2015, 8:36pm

To download the source code of a web page use a socket:

http://documentation.xojo.com/index.php/HTTPSocket.Get

Then save the resulting string in some format that suits your preferred way of parsing, such as the XMLDocument that Tim mentioned. For similar tasks I have used array = string.split(EndOfLine) to get an array that can be looped through and searched.

Also, there is a “Preview First 10” link in the download button that you might be able to manipulate to get multiple items at once in a format that is easier to work with.

http://www.uniprot.org/uniprot/?sort=&desc=&query=&fil=&limit=10&force=no&format=xml

Tim_Hare · January 14, 2015, 8:59pm

That’s a lot of data returned on each page. I wouldn’t use xml, just search for the unique terms “content-protein” and “content-gene” using Instr().

Sergio_Ciordia · January 15, 2015, 10:43pm

Thanks everybody,

finally I have followed the recommendations from Aaron and Tim and I have used HTTPSocket.Get and Instr() to do my app. To help other newbies in Xojo. I have used this method to parse the text from HTML obtained:

Method: ParseData (theText As String, theStartTag As String, theEndTag As String)

[code] Dim theStartPos As Integer
Dim theEndPos As Integer
Dim theLengthToKeep As Integer
Dim theResult As String

theStartPos = InStr (theText, theStartTag)

If theStartPos = 0 Then
theResult = “”

ElseIf theStartPos > 0 Then
theStartPos = theStartPos + Len (theStartTag)
theEndPos = InStr (theStartPos, theText, theEndTag)
theLengthToKeep = theEndPos - theStartPos
theResult = Mid (theText, theStartPos, theLengthToKeep)

End if

return theResult[/code]

What do you think about this? I have adapted the function that I had to Filemaker.

Thanks again,
Sergio

Kem_Tekinay · January 16, 2015, 1:10am

FileMaker doesn’t have built-in regular expressions, although you can certainly add a plugin for that. Xojo does, and you should take advantage. I’d recommend this:

theStartTag = "\\Q" + theStartTag.ReplaceAllB( "\\E", "\\\\EE\\Q" ) + "\\E"
theEndTag = "\\Q" + theEndTag.ReplaceAllB( "\\E", "\\\\EE\\Q" ) + "\\E"

dim rx as new RegEx
rx.Options.Greedy = false
rx.SearchPattern = theStartTag + "([\\s\\S]*)" + theEndTag

dim theResult as string
dim match as RegExMatch = rx.Search( theText )
if match <> nil then
  theResult = match.SubExpressionString( 1 )
end if

return theResult

The encoding of theStartTag and theEndTag is safely put “\Q” and “\E” around the text. That tells the regex engine to take the text literally even if you’ve included symbols that would otherwise mean something like “*” or parens.

Sergio_Ciordia · January 16, 2015, 8:28pm

Thank you very much Kem. That’s true, Xojo is very powerful and this is an example. I have used your code and It works very well. I don’t understand very well the theStartTag and theEndTag code at the beginning because I haven’t enough knowledge of regular expressions but slowly.

Thanks again,
Sergio