regex challenge for Kem ;)

Hi,
below is an xml file that comes from an xlsx sharedstrings file.
I would like to extract all the texts between the and tags.
the regex for that is easy ( I found “<t[\w\W\r]?>([\w\W\r]?)” to work nice) but,
when there is styled text, it is splitted by excel into multiples tags, one for each stylerun (the and tags)
so I would like to get the resulting text concatened if it is between tags
I did this in xojo with two regex searches, I’m wondering if it is possible to do this in one regex match ?
then I thought : that’s a challenge for Kem !

<?xml version="1.0" encoding="utf-8" standalone="yes"?> <sst xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" count="1166" uniqueCount="175"> <si> <t>RD_avant_pont(anse)</t> </si> <si> <t>mordu</t> </si> <si> <t>CAR</t> </si> <si> <t>gibelio</t> </si> <si> <r> <rPr> <sz val="10" /> <rFont val="Verdana" /> </rPr> <t xml:space="preserve"> Partie </t> </r> <r> <rPr> <sz val="10" /> <rFont val="Verdana" /> </rPr> <t>aval</t> </r> <r> <rPr> <sz val="10" /> <rFont val="Verdana" /> </rPr> <t xml:space="preserve"> Morte </t> </r> <r> <rPr> <sz val="10" /> <rFont val="Verdana" /> </rPr> <t>Cholet</t> </r> </si> <si> <t>BDL</t> </si> <si> <t>Proximit morte</t> </si> <si> <t>`</t> </si> </sst>

I know this is not the answer you want to hear, but it is the correct one: don’t extract stuff from XML with regular expressions.

RegEx match open tags except XHTML self-contained tags

You should read Eugene Daking book I wish I knew… XML

Eli is correct, regex is not the way to go.

I’d answer your question anyway as an academic exercise but the sample text doesn’t show the problem.

in the sample text, the text between tags are splitted
there should not be “Partie” “aval” “morte” “cholet” as a 4 matching result
but only one “partie aval morte cholet”
the other tags are at the first level, only these 4 are inside a tag and should be matched together
is this possible with regex ? a sort of conditionnal concatenation ?

[quote=251217:@Eli Ott]I know this is not the answer you want to hear, but it is the correct one: don’t extract stuff from XML with regular expressions.

RegEx match open tags except XHTML self-contained tags[/quote]
I’ve used the xmldocument for other parts of this document
just wondering how to make some conditionnal regex on this little part.

Nothing comes too mind that would work reliably. Or at all. :slight_smile:

ok - thanks anyway !

Maybe we should look into xpath queries?