RegEx help needed

I have this HTML code.
Project assistent

I need to extract the words “Project assistent” (in bold face). The text in the link varies (italics), so I need a wildcard for that. I’m not so familiar with RegEx and I find the options confusing. Anybody?

You’d be best off waiting for Kem to answer since he’s the RegEx wizard around here, but this might get you out of trouble for now:

<[\\w .="/-]+>([\\w ]+)<[\\w /]+>

Within the brackets is the submatch and that’s your bold text.

Use http://regexpal.com/ too. It’s great.

I don’t know about RegEx, but you could do it with XML.

dim xml as new XmlDocument
xml.LoadXml("<a class=""d_link"" href=""/projecten/500641/Project-assistent.asp"">Project assistent</a>")  
MsgBox xml.FirstChild.FirstChild.Value
MsgBox xml.FirstChild.GetAttribute("href")

The first msgbox shows “Project assistant” (the first “firstchild” gives you the “a” node, the second “firstchild” gives you the text node, and then you get the value of that to get “Project assistant”) and the second shows “/projecten/500641/Project-assistent.asp” if you need that.

Note that I had to double-double quote the class and href values because of the way I input it.

Alexander,
Here is a very quick demo to show you the reg ex matching in Xojo code.

I used this pattern (as there are many ways to skin this cat pattern wise).

(?<=/Project-assistent.asp">).+(?=)

https://www.dropbox.com/s/vbp6x9sxvugzuzr/RegExHTML_Text.xojo_binary_project

HTH

Hi,

the problem is that the text “Project-assistant.asp” varies. It can be anything…

Kem is still sleepin’ in Vegas probably :slight_smile:

[quote=73737:@Bill Gookin]I don’t know about RegEx, but you could do it with XML.

dim xml as new XmlDocument
xml.LoadXml("<a class=""d_link"" href=""/projecten/500641/Project-assistent.asp"">Project assistent</a>")  
MsgBox xml.FirstChild.FirstChild.Value
MsgBox xml.FirstChild.GetAttribute("href")

The first msgbox shows “Project assistant” (the first “firstchild” gives you the “a” node, the second “firstchild” gives you the text node, and then you get the value of that to get “Project assistant”) and the second shows “/projecten/500641/Project-assistent.asp” if you need that.

Note that I had to double-double quote the class and href values because of the way I input it.[/quote]

Can I load the whole HTML page or a piece of that to get the HREFs in this way?

Thanks

[quote=73740:@Alexander van der Linden]Hi,

the problem is that the text “Project-assistant.asp” varies. It can be anything…[/quote]

[quote=73743:@Alexander van der Linden]Can I load the whole HTML page or a piece of that to get the HREFs in this way?

Thanks[/quote]
You should be able to load the whole HTML page, but then you’ll need to get to the correct tag. XQL will probably be your friend here. If you want to post your HTML, or a simplified version of it, I (or someone) can surely help.

I try to parse a website, obtaining the links and describing texts. The links are grouped in this fashion:

<a class="d_link" href="/projecten/500641/Project-assistent.asp">Project assistent</a></td>????<td width="90"><center><a href="?p=companyprofile&user=sgoossens" target="_blank"></a></center></td>?            <td width="120" ><span class="d_tekst">utrecht</span></td>?            <td width="65" ><span class="d_tekst">12:42</span></td>?            <td width="120" ><font face="Verdana" color="#808080"><small><small>13 </small></small></font></td>?          </tr>      <tr bgcolor="#F2F2F2" height="22" style="height:25px">?            <td width="20"></td>?            <td width="295" >

Then new blocks (hundreds) appears with same type of links but different texts:

><a class="d_link" href="/projecten/500915/Aanpassing-ae-films.asp">Aanpassing ae films</a></td>????<td width="90"><center><a href="?p=companyprofile&user=keesvanvelzen" target="_blank"></a></center></td>?            <td width="120" ><span class="d_tekst">-</span></td>?            <td width="65" ><span class="d_tekst">12:42</span></td>?            <td width="120" ><font face="Verdana" color="#808080"><small><small>11  &nbsp;(<font color="#008080">nieuw project</font>)</small></small></font></td>?          </tr>      <tr bgcolor="#FFFFFF" height="22" style="height:25px">?            <td width="20"></td>?            <td width="295" >

So I want the ‘human readable’ texts in this matter. If done correct the first block should return:
‘Project assistent’
‘sgoossens’
‘12:42’
‘utrecht’
‘295’

Maybe I should just look for delimiters ‘>’ and ‘<’?

This would be a lot easier if each link group is surrounded by something, like if it were in its own block, or in it’s own

or . Are they separated like that somehow? The examples you gave just have part of one row and then part of the next, and it’s hard to see how that final 295 is associated with the link in the previous row.

Ok guys… stop thinking (for this question that is ;-)). The site provides an XML download…

Thanks for your help! Much appreciated.

Glad I could help… :confused:

You can still post a brilliant RegEx solution in case a future reader reading the thread needs it : )

I’d also be curious to see your solution.

If still trying on a “Kem” style “If” conditional regular expression pattern since mainly Im not familiar with it. :slight_smile:

Since there is no rush, I’ll look at this in detail later, but at first blush, this doesn’t look like a case where regular expressions are a good fit.

Darn… (as I am pulling my hair out trying to make one ) :slight_smile: hehe Waiting on my flight so its fun :slight_smile: