RegEx help needed

Alexander_van_der_Linden · March 24, 2014, 12:18pm

I have this HTML code.
Project assistent

I need to extract the words “Project assistent” (in bold face). The text in the link varies (italics), so I need a wildcard for that. I’m not so familiar with RegEx and I find the options confusing. Anybody?

Joshua_Woods · March 24, 2014, 12:24pm

You’d be best off waiting for Kem to answer since he’s the RegEx wizard around here, but this might get you out of trouble for now:

<[\\w .="/-]+>([\\w ]+)<[\\w /]+>

Within the brackets is the submatch and that’s your bold text.

Use http://regexpal.com/ too. It’s great.

Bill_Gookin · March 24, 2014, 12:30pm

I don’t know about RegEx, but you could do it with XML.

dim xml as new XmlDocument
xml.LoadXml("<a class=""d_link"" href=""/projecten/500641/Project-assistent.asp"">Project assistent</a>")  
MsgBox xml.FirstChild.FirstChild.Value
MsgBox xml.FirstChild.GetAttribute("href")

The first msgbox shows “Project assistant” (the first “firstchild” gives you the “a” node, the second “firstchild” gives you the text node, and then you get the value of that to get “Project assistant”) and the second shows “/projecten/500641/Project-assistent.asp” if you need that.

Note that I had to double-double quote the class and href values because of the way I input it.

Mike_Cotrone · March 24, 2014, 12:39pm

Alexander,
Here is a very quick demo to show you the reg ex matching in Xojo code.

I used this pattern (as there are many ways to skin this cat pattern wise).

(?<=/Project-assistent.asp">).+(?=)

https://www.dropbox.com/s/vbp6x9sxvugzuzr/RegExHTML_Text.xojo_binary_project

HTH

Alexander_van_der_Linden · March 24, 2014, 12:44pm

Hi,

the problem is that the text “Project-assistant.asp” varies. It can be anything…

Mike_Cotrone · March 24, 2014, 12:44pm

Kem is still sleepin’ in Vegas probably

Alexander_van_der_Linden · March 24, 2014, 12:44pm

[quote=73737:@Bill Gookin]I don’t know about RegEx, but you could do it with XML.

dim xml as new XmlDocument
xml.LoadXml("<a class=""d_link"" href=""/projecten/500641/Project-assistent.asp"">Project assistent</a>")  
MsgBox xml.FirstChild.FirstChild.Value
MsgBox xml.FirstChild.GetAttribute("href")

The first msgbox shows “Project assistant” (the first “firstchild” gives you the “a” node, the second “firstchild” gives you the text node, and then you get the value of that to get “Project assistant”) and the second shows “/projecten/500641/Project-assistent.asp” if you need that.

Note that I had to double-double quote the class and href values because of the way I input it.[/quote]

Can I load the whole HTML page or a piece of that to get the HREFs in this way?

Thanks

Mike_Cotrone · March 24, 2014, 12:46pm

[quote=73740:@Alexander van der Linden]Hi,

the problem is that the text “Project-assistant.asp” varies. It can be anything…[/quote]

Bill_Gookin · March 24, 2014, 12:53pm

[quote=73743:@Alexander van der Linden]Can I load the whole HTML page or a piece of that to get the HREFs in this way?

Thanks[/quote]
You should be able to load the whole HTML page, but then you’ll need to get to the correct tag. XQL will probably be your friend here. If you want to post your HTML, or a simplified version of it, I (or someone) can surely help.

Alexander_van_der_Linden · March 24, 2014, 1:01pm

I try to parse a website, obtaining the links and describing texts. The links are grouped in this fashion:

<a class="d_link" href="/projecten/500641/Project-assistent.asp">Project assistent</a></td>????<td width="90"><center><a href="?p=companyprofile&user=sgoossens" target="_blank"></a></center></td>?            <td width="120" ><span class="d_tekst">utrecht</span></td>?            <td width="65" ><span class="d_tekst">12:42</span></td>?            <td width="120" ><font face="Verdana" color="#808080"><small><small>13 </small></small></font></td>?          </tr>      <tr bgcolor="#F2F2F2" height="22" style="height:25px">?            <td width="20"></td>?            <td width="295" >

Then new blocks (hundreds) appears with same type of links but different texts:

><a class="d_link" href="/projecten/500915/Aanpassing-ae-films.asp">Aanpassing ae films</a></td>????<td width="90"><center><a href="?p=companyprofile&user=keesvanvelzen" target="_blank"></a></center></td>?            <td width="120" ><span class="d_tekst">-</span></td>?            <td width="65" ><span class="d_tekst">12:42</span></td>?            <td width="120" ><font face="Verdana" color="#808080"><small><small>11  &nbsp;(<font color="#008080">nieuw project</font>)</small></small></font></td>?          </tr>      <tr bgcolor="#FFFFFF" height="22" style="height:25px">?            <td width="20"></td>?            <td width="295" >

So I want the ‘human readable’ texts in this matter. If done correct the first block should return:
‘Project assistent’
‘sgoossens’
‘12:42’
‘utrecht’
‘295’

Maybe I should just look for delimiters ‘>’ and ‘<’?

Bill_Gookin · March 24, 2014, 1:11pm

This would be a lot easier if each link group is surrounded by something, like if it were in its own block, or in it’s own

or . Are they separated like that somehow? The examples you gave just have part of one row and then part of the next, and it’s hard to see how that final 295 is associated with the link in the previous row.

Alexander_van_der_Linden · March 24, 2014, 1:13pm

Ok guys… stop thinking (for this question that is ;-)). The site provides an XML download…

Thanks for your help! Much appreciated.

Kem_Tekinay · March 24, 2014, 1:53pm

Glad I could help…

Joshua_Woods · March 24, 2014, 2:00pm

You can still post a brilliant RegEx solution in case a future reader reading the thread needs it : )

I’d also be curious to see your solution.

Mike_Cotrone · March 24, 2014, 2:13pm

If still trying on a “Kem” style “If” conditional regular expression pattern since mainly Im not familiar with it.

Kem_Tekinay · March 24, 2014, 3:08pm

Since there is no rush, I’ll look at this in detail later, but at first blush, this doesn’t look like a case where regular expressions are a good fit.

Mike_Cotrone · March 24, 2014, 3:08pm

Darn… (as I am pulling my hair out trying to make one ) hehe Waiting on my flight so its fun