regex: how to capture with end of line ?

Hi,

is it possible to capture a sequence in regex, and that the parser does not stop at the end of a line ?

I search for a pattern <table … > and the “…” are on multiple lines
I only can extract <table … until the end of first line with the regex = “<table (.*)”
I cannot extract more than a line.

? thanks.

Turn Greedy on

I’ll reiterate
Kems regexrx already has a PILE of these kinds of examples built in and you can just use them
VERY worth the small amount he asks for it

You should avoid .* where possible as it’s too general and it doesn’t capture EOL by default. A better choice might be [^>]*.

However, if you really must use .*, turn Greedy off and turn DotMatchesNewline on in the RegEx.Options. You can do the same by prefixing your pattern with code[/code], which is more portable.

[quote=225564:@Norman Palardy]I’ll reiterate
Kems regexrx already has a PILE of these kinds of examples built in and you can just use them
VERY worth the small amount he asks for it[/quote]
I already own it ! :wink:

[quote=225576:@Kem Tekinay]You should avoid .* where possible as it’s too general and it doesn’t capture EOL by default. A better choice might be [^>]*.

However, if you really must use .*, turn Greedy off and turn DotMatchesNewline on in the RegEx.Options. You can do the same by prefixing your pattern with code[/code], which is more portable.[/quote]

thanks Kem, but the [^>]* doesnt work as there are other tags between the <table and
so it stops at the first “>” it encounters.
I tried <table [^</table>]* but it doesnt work either…

this is an example of the source text :

<table name="Paramtres" uuid="D884397B408846F0AF75BA2DE4BC577B"> <field name="Etablissement" uuid="37EA579A54D54436B62E8D0A3CB13E62" type="10" limiting_length="80" never_null="true"/> <field name="Adresse" uuid="237FB0A17F154120A370DBF8A43D3ADD" type="10" never_null="true"/> <field name="NoLicence" uuid="3D24C1F70C0B4D7597F6323461F5D452" type="10" limiting_length="20" never_null="true"/> <field name="DpartTTexte" uuid="F6AB6B7E61854FBB81EF68C9604B414A" type="1" never_null="true"/> <field name="Ville" uuid="DE5BBB575481492B8EB0DEA2513D3DC5" type="10" limiting_length="20" never_null="true"/> <field name="Logo" uuid="0B0D72FB2A124993968F6E25DF38ED1A" type="12" never_null="true"/> <field name="PiedPage" uuid="AA13F062931E4516A75540763605966F" type="12" never_null="true"/> <table_extra visible="false"> <editor_table_info> <color red="0" green="0" blue="0" alpha="0"/> <coordinates left="1063.5234375" top="31.87890625" width="114" height="211.8984375"/> </editor_table_info> </table_extra> </table>

I want to extract the “name” attribute of the tag, and the field attributes of the table. I dont need the <table_extra> tag.
is it doable with a regex, or do I better deal with xmldocuments and iterate ?

You could get clever with loops and things and get yourself a neat little parser that will extract key value pairs from each tag. I never liked Xojo’s XML tools so I have no comment on which would technically be better.

Someone always manages to code for free when I make an offer to create a solution per my consulting rate.
However, available for hire if you need - send me an email.

If you have the MBS plugins use the Tidy classes. Makes html much easier to parse.

[quote=225661:@Tim Parnell]You could get clever with loops and things and get yourself a neat little parser that will extract key value pairs from each tag. I never liked Xojo’s XML tools so I have no comment on which would technically be better.

Someone always manages to code for free when I make an offer to create a solution per my consulting rate.
However, available for hire if you need - send me an email.[/quote]

thanks but I’m able to do this the “heavy” way.
I find regex can be elegant, and would try with it this time, or with xmldocument if regex cannot handle such search
and Kem is such a Regex Guru that I cannot help myself asking here …

Kem would probably have a more elegant solution, and without knowing how rigidly structured the data is, I’d do it like:

  1. find the next table and it’s name:
(?s)<table name="(.+?)".+?</table>

subexpressionstring(0) would give you the whole table and subexpressionstring(1) would give you the name.

  1. send subexpressionstring(0) to a “field finder” method to loop through the fields:
<field name="(.+?)" uuid="(.+?)" type="(.+?)" never_null="(.+?)"/>
  1. repeat

Thanks Scott, this answers my question.
only the order of the attributes is not guaranteed to be the same on each line
is there a trick to get the attributes anyway ?
do a search for each attribute at a time ?
or a good regex trick …?

and Kem a “more elegant” solution ?

Assuming there’s always 4 attributes you could do something like:

<field (.+?)="(.+?)" (.+?)="(.+?)" (.+?)="(.+?)" (.+?)="(.+?)"/>

or you could loop through the field looking for any instances of

(.+?)="(.+?)"

(Note there’s a space at the beginning of the pattern)

<table[\w|\W|\r]*?>

\r assumes that carriage returns are the line delimiters. This might have to be adjusted.

The [\w|\W|\r] basically means everything as \w and \W are the inverse of each other and we are just adding \r to the mix. The pipes – | – are equivalent to OR

The question mark ? means that the search will stop at the first find of the >

[quote=225779:@Robert Livingston]<table[\w|\W|\r]*?>

\r assumes that carriage returns are the line delimiters. This might have to be adjusted.

The [\w|\W|\r] basically means everything as \w and \W are the inverse of each other and we are just adding \r to the mix. The pipes – | – are equivalent to OR

The question mark ? means that the search will stop at the first find of the >[/quote]

thanks for the “elegant” solution, Robert.
it works with a small fix : <table[\\w|\\W|\\r]*?</table> instead of <table[\\w|\\W|\\r]*?></table>