Help with regex

I’d like to parse something like this:

  aaa={a, b, c},

the idea is to store both the field names and values:

aaa -> a, b, c
bbb -> a,b,c

A simple Split is tricky since the fields are delimitated by commas which can appear in the field values too…I am trying with a regex with no luck.
Any suggestions?


So the value will be surrounded by either quotes or curly braces, and you need to get whatever is between them?

rx.SearchPattern = "(?|\\{([^}]*)\\}|""([^""]*)"")"

Keep repeating that on the values until you get a nil match. The values will be in SubExpressionString( 1 ).

If you want to get key/value pairs:

rx.SearchPattern = "^([^=]+)=(.*)"

The key will be in SubExpressionString( 1 ) and the values in SubExpressionString( 2 ), but this assumes there won’t be newlines in either the key or the values.

Thank you Kem!

I will test and let you know.

Hi Kem, unfortunately it does not work because of me…I mean, in the OP I was not providing the whole picture…
Actually, the real text I have to parse is a sequence of blocks like the followings.

Within each block I need to save the pairs

aaa -> a, b, c
bbb -> a,b,c

I am currently using two regex:

regBlock.SearchPattern = "@(\\w+)\\{([^,]+)([^@]+)"

This one let me access the content of each block and the following one should let me save the info I need.

regFields.SearchPattern = "(\\w+)\\s*[=]\\s*(.+)\ "

It seems to work but I suspect it’s not right yet. What do you think?


Does each block always terminate with a standalone curly brace?

non necessarily (unfortunately) as it could be followed by a comma. that’s why I stop just when I find the next @.

And the “@” doesn’t necessarily start a line either? What happens if one of the values contains “@”?

Each block necessarily starts with @. About your second question…it’s actually a good question. In the “language” (BibTeX) @ could appear also in the value but only if it’s enclosed in double-braces. For example:

aaa = {bob@ccc}  <-- invalid
aaa = "bob@ccc"  <-- invalid
aaa = {{bob@ccc}} <-- valid
aaa = "{bob@ccc}" <-- valid

Maybe too many rules to be handled with regex…

Not too many rules, per se, but that pattern is going to be a bear. There is no plugin or command-line tool you could use instead?

TeX is awesome but damned hard to parse correctly if you’re not TeX

Almost all of TeX’s syntactic properties can be changed on the fly, which makes TeX input hard to parse by anything but TeX itself. TeX is a macro- and token-based language: many commands, including most user-defined ones, are expanded on the fly until only unexpandable tokens remain, which are then executed.