regex: how to capture with end of line ?

Jean-Yves_Pochez · October 30, 2015, 8:31pm

Hi,

is it possible to capture a sequence in regex, and that the parser does not stop at the end of a line ?

I search for a pattern <table … > and the “…” are on multiple lines
I only can extract <table … until the end of first line with the regex = “<table (.*)”
I cannot extract more than a line.

? thanks.

Tim_Parnell · October 30, 2015, 8:32pm

Turn Greedy on

Norman_P · October 30, 2015, 8:44pm

I’ll reiterate
Kems regexrx already has a PILE of these kinds of examples built in and you can just use them
VERY worth the small amount he asks for it

Kem_Tekinay · October 30, 2015, 9:37pm

You should avoid .* where possible as it’s too general and it doesn’t capture EOL by default. A better choice might be [^>]*.

However, if you really must use .*, turn Greedy off and turn DotMatchesNewline on in the RegEx.Options. You can do the same by prefixing your pattern with code[/code], which is more portable.

Jean-Yves_Pochez · October 31, 2015, 8:06am

[quote=225564:@Norman Palardy]I’ll reiterate
Kems regexrx already has a PILE of these kinds of examples built in and you can just use them
VERY worth the small amount he asks for it[/quote]
I already own it !

Jean-Yves_Pochez · October 31, 2015, 8:15am

[quote=225576:@Kem Tekinay]You should avoid .* where possible as it’s too general and it doesn’t capture EOL by default. A better choice might be [^>]*.

However, if you really must use .*, turn Greedy off and turn DotMatchesNewline on in the RegEx.Options. You can do the same by prefixing your pattern with code[/code], which is more portable.[/quote]

thanks Kem, but the [^>]* doesnt work as there are other tags between the <table and
so it stops at the first “>” it encounters.
I tried <table [^</table>]* but it doesnt work either…

Jean-Yves_Pochez · October 31, 2015, 8:20am

this is an example of the source text :

I want to extract the “name” attribute of the tag, and the field attributes of the table. I dont need the <table_extra> tag.
is it doable with a regex, or do I better deal with xmldocuments and iterate ?

Tim_Parnell · October 31, 2015, 9:02am

You could get clever with loops and things and get yourself a neat little parser that will extract key value pairs from each tag. I never liked Xojo’s XML tools so I have no comment on which would technically be better.

Someone always manages to code for free when I make an offer to create a solution per my consulting rate.
However, available for hire if you need - send me an email.

Beatrix_Willius · October 31, 2015, 9:35am

If you have the MBS plugins use the Tidy classes. Makes html much easier to parse.

Jean-Yves_Pochez · October 31, 2015, 11:22am

[quote=225661:@Tim Parnell]You could get clever with loops and things and get yourself a neat little parser that will extract key value pairs from each tag. I never liked Xojo’s XML tools so I have no comment on which would technically be better.

Someone always manages to code for free when I make an offer to create a solution per my consulting rate.
However, available for hire if you need - send me an email.[/quote]

thanks but I’m able to do this the “heavy” way.
I find regex can be elegant, and would try with it this time, or with xmldocument if regex cannot handle such search
and Kem is such a Regex Guru that I cannot help myself asking here …

Scott_Griffitts · October 31, 2015, 2:03pm

Kem would probably have a more elegant solution, and without knowing how rigidly structured the data is, I’d do it like:

find the next table and it’s name:

(?s)<table name="(.+?)".+?</table>

subexpressionstring(0) would give you the whole table and subexpressionstring(1) would give you the name.

send subexpressionstring(0) to a “field finder” method to loop through the fields:

<field name="(.+?)" uuid="(.+?)" type="(.+?)" never_null="(.+?)"/>

repeat

Jean-Yves_Pochez · October 31, 2015, 5:58pm

Thanks Scott, this answers my question.
only the order of the attributes is not guaranteed to be the same on each line
is there a trick to get the attributes anyway ?
do a search for each attribute at a time ?
or a good regex trick …?

and Kem a “more elegant” solution ?

Scott_Griffitts · October 31, 2015, 6:28pm

Assuming there’s always 4 attributes you could do something like:

<field (.+?)="(.+?)" (.+?)="(.+?)" (.+?)="(.+?)" (.+?)="(.+?)"/>

or you could loop through the field looking for any instances of

(.+?)="(.+?)"

(Note there’s a space at the beginning of the pattern)

Robert_Livingston · October 31, 2015, 9:58pm

<table[\w|\W|\r]*?>

\r assumes that carriage returns are the line delimiters. This might have to be adjusted.

The [\w|\W|\r] basically means everything as \w and \W are the inverse of each other and we are just adding \r to the mix. The pipes – | – are equivalent to OR

The question mark ? means that the search will stop at the first find of the >

Jean-Yves_Pochez · November 1, 2015, 3:32pm

[quote=225779:@Robert Livingston]<table[\w|\W|\r]*?>

\r assumes that carriage returns are the line delimiters. This might have to be adjusted.

The [\w|\W|\r] basically means everything as \w and \W are the inverse of each other and we are just adding \r to the mix. The pipes – | – are equivalent to OR

The question mark ? means that the search will stop at the first find of the >[/quote]

thanks for the “elegant” solution, Robert.
it works with a small fix : <table[\\w|\\W|\\r]*?</table> instead of <table[\\w|\\W|\\r]*?></table>