Repeating a RegEx match

Daniel_Mclaughlan · October 13, 2014, 10:42pm

Hi everyone,

I’m trying to parse a Subrip (.srt) subtitle file using a Regular Expression.

For those not familiar with the format, a sample looks like this:

[quote]1
00:00:06,849 --> 00:00:07,740
Morning everyone
Finally time for my Ice Bucket Challenge
Test

2
00:00:07,740 --> 00:00:14,009
Sorry it took so long
Thank you for my nominations[/quote]

Each subtitle consists of a number, time stamps, and one or more lines.

I’ve been using RegExRx by Kem Tekinay (wonderful program; i’m learning more than I thought possible about Regular Expressions just by fiddling) and thought I had made great progress getting a RegEx that was working:

(?x) # free spacing mode
^(\\d+)[\\r\
|\\r|\
] # start of line one or more digits, end of line
(\\d{2}:\\d{2}:\\d{2},\\d{3}) # digits in the format 00:00:00,000
(?:\\x20-->\\x20) # (non-capturing group) "-->" surrounded by a space on either side
(\\d{2}:\\d{2}:\\d{2},\\d{3})[\\r\
|\\r|\
] # digits in the format 00:00:00,000 followed by end of line
(^.+$)[\\r\
|\\r|\
] # start of line, one or more characters, end of line
(^.+$)[\\r\
|\\r|\
]? # start of line, one or more characters, end of line (optional)

However while I was refactoring my code I discovered that it does not work for more than two subtitle lines (see the third line ‘Test’ I added above)

I think I can assume there will always be one line (otherwise there would be nothing to display which is the same as not having a subtitle at all) so I need to repeat my second, optional match for all possible matches.

I know that I can use ‘+’ to repeat the last group one or more times, and I read that having it outside the expression will only return the last match - and sure enough that’s what I see - so it needs to go within the expression but I have not been able to work out where.

Can anyone help?

Kem_Tekinay · October 14, 2014, 3:16am

First, I’m glad you like RegExRX.

Second (as an aside), almost anything within a character class (between square bracket) is considered a literal, and creates an “or” condition. [abc] means “match an ‘a’, a ‘b’, or a ‘c’”. When you wrote [\\r\ |\\r|\ ], will be interpreted as “match a return, a linefeed, a vertical bar, a return, a vertical bar, or a linefeed”. In that case, parenthesis is what you wanted, or you could have used the token \\R which does exactly what you intended.

Moving on to the problem at hand, you can use a positive lookahead to end your pattern where the next marker begins or at the end of the document. For this to work, Greedy has to be turned off.

(?xU) # free spacing mode, Ungreedy
^(\\d+)\\R # start of line one or more digits, end of line
(\\d{2}:\\d{2}:\\d{2},\\d{3}) # digits in the format 00:00:00,000
(?:\\x20-->\\x20) # (non-capturing group) "-->" surrounded by a space on either side
(\\d{2}:\\d{2}:\\d{2},\\d{3})\\R # digits in the format 00:00:00,000 followed by end of line
((?:(?:^.+)\\R)+) # start of line, one or more characters, end of line, repeat
(?=\\d+\\R\\d{2}:|\\Z) # stop before the next marker

Kem_Tekinay · October 14, 2014, 3:25am

Bleh, too many non-capturing groups at the end there:

(?xU) # free spacing mode, Ungreedy
^(\\d+)\\R # start of line one or more digits, end of line
(\\d{2}:\\d{2}:\\d{2},\\d{3}) # digits in the format 00:00:00,000
(?:\\x20-->\\x20) # (non-capturing group) "-->" surrounded by a space on either side
(\\d{2}:\\d{2}:\\d{2},\\d{3})\\R # digits in the format 00:00:00,000 followed by end of line
((?:^.+\\R)+) # start of line, one or more characters, end of line, repeat
(?=\\d+\\R\\d{2}:|\\Z) # stop before the next marker

Daniel_Mclaughlan · October 15, 2014, 4:44am

Thanks Kem,

That makes sense using /R

It’s not quite there yet. It’s matching all the lines of text but it’s not splitting them out into SubExpressions as I intended. In other words, SubExpression $4 holds: [quote]Morning\severyone
Finally\stime\sfor\smy\sIce\sBucket\sChallenge
Test
[/quote]

I can probably Split this line on the linefeeds in Xojo, but i’d rather the RegEx did this if that makes sense?

Is it right that this line is non-capturing group - I expected I would want to match and store the whole result - or am I misunderstanding how that works?

Kem_Tekinay · October 15, 2014, 1:53pm

I know what you’re asking, but regular expressions don’t work that way. The subgroups are literally matched to the parenthesis you can see so there is no way way to tell the engine, “keep creating subgroups until you’re out of matches”. I suppose you could create optional matches for the maximum number of lines you think there might be, but honestly, I’d use the pattern above and Split after the fact.

Daniel_Mclaughlan · October 15, 2014, 3:16pm

Thanks Kem, I’ll do it that way. I suppose it’s nice and neat in the sense that I know all the rows for that subtitle will be in that one SubExpression.

Also now that I think about it, in my original Xojo code I was concatenating the two subexpressions I was returning into one line for use in a row of a listbox. This approach avoids that - which could get confusing - and I only need to Split on EndOfLine and add a delimiter such as | to visually show the split.

Thank you very much for all your help. I have other files to parse which I expect will be more challenging but I’m glad I’m on the right track, in no small part thanks to RegExRx.

Edwin_van_den_Akker · November 15, 2016, 11:06am

Hey Daniel.
Did you ever get it to work?
I was trying the same thing. But not using Regular Expressions. I was actually reading a file line by line, determine the format and store the data in a class according to what kind of data I found.

I don’t really understand your approach. But somehow it seems more efficient.

Edwin_van_den_Akker · November 15, 2016, 11:29am

I’ve actually found a nice article that uses RegEx, too.
It’s a module written in JS. But it could be adapted to Xojo, I think.