RegEx help (again)

Hi Kem and other RegEx wizards - he’s baaaack!

I’ve been working through manually creating an expression (and with RegExRx) for dealing with file paths and can’t seem to get a handle on a substring handling expression that will do the following:

Given a substring of a parent folder
Display all entries in a list that are exactly one level below that folder

Example strings:

VL:c|306464768|1|646|149640|/Volumes/Argest/GoProMix/ VL:c|306464768|1|15364|149640|/Volumes/Argest/GoProMix/.DS_Store VL:c|306464768|1|16213743536|149640|/Volumes/Argest/GoProMix/GOPR0226.mov VL:c|324560896|1|1221811|158476|/Volumes/Argest/GoProMix/GOPR0226.mov.asd VL:c|324562944|1|238|158477|/Volumes/Argest/GoProMix/GoProDjMix_01_11_2014 Project/ VL:c|324562944|1|6148|158477|/Volumes/Argest/GoProMix/GoProDjMix_01_11_2014 Project/.DS_Store VL:c|324562944|1|102|158477|/Volumes/Argest/GoProMix/GoProDjMix_01_11_2014 Project/Ableton Project Info/ VL:c|324562944|1|459|158477|/Volumes/Argest/GoProMix/GoProDjMix_01_11_2014 Project/Ableton Project Info/Project8_1.cfg VL:c|324562944|1|11499|158477|/Volumes/Argest/GoProMix/GoProDjMix_01_11_2014 Project/GoProDjMix_01_11_2014.als VL:c|324562944|1|0|158477|/Volumes/Argest/GoProMix/GoProDjMix_01_11_2014 Project/Icon VL:c|324562944|1|136|158477|/Volumes/Argest/GoProMix/GoProDjMix_01_11_2014 Project/Samples/ VL:c|324562944|1|6148|158477|/Volumes/Argest/GoProMix/GoProDjMix_01_11_2014 Project/Samples/.DS_Store VL:c|324562944|1|68|158477|/Volumes/Argest/GoProMix/GoProDjMix_01_11_2014 Project/Samples/Recorded/ VL:c|324562944|1|881755162|158477|/Volumes/Argest/GoProMix/GoProDjMix_01_11_2014.aif VL:c|325545984|1|3772193|158957|/Volumes/Argest/GoProMix/GoProDjMix_01_11_2014.aif.asd VL:c|325550080|1|439195728|158959|/Volumes/Argest/GoProMix/GoProDjMix_01_11_2014.mov VL:c|326039552|1|1442304044|159198|/Volumes/Argest/GoProMix/GoProDJMix_01_11_2014.wav VL:c|327649280|1|3971227|159984|/Volumes/Argest/GoProMix/GoProDJMix_01_11_2014.wav.asd VL:c|327655424|1|15985573296|159987|/Volumes/Argest/GoProMix/GP010226.mov VL:c|345495552|1|1221203|168698|/Volumes/Argest/GoProMix/GP010226.mov.asd VL:c|345497600|1|16020452784|168699|/Volumes/Argest/GoProMix/GP020226.mov VL:c|363376640|1|1212815|177429|/Volumes/Argest/GoProMix/GP020226.mov.asd VL:c|363378688|1|2921821236|177430|/Volumes/Argest/GoProMix/GP030226.mov VL:c|366639104|1|344063|179022|/Volumes/Argest/GoProMix/GP030226.mov.asd VL:c|366639104|1|204|179022|/Volumes/Argest/GoProMix/HQ_ConvertedMix/ VL:c|366639104|1|19582987691|179022|/Volumes/Argest/GoProMix/HQ_ConvertedMix/GOPR0226.mov VL:c|388495360|1|19306366379|189694|/Volumes/Argest/GoProMix/HQ_ConvertedMix/GP010226.mov VL:c|410042368|1|19346663851|200215|/Volumes/Argest/GoProMix/HQ_ConvertedMix/GP020226.mov VL:c|431634432|1|3536776751|210758|/Volumes/Argest/GoProMix/HQ_ConvertedMix/GP030226.mov

For example - given the path “/Volumes/Argest/GoProMix/”, I would want a return of all of the files and folders in that folder while not including the files / folders that are 2+ levels deep. So that the response only includes:

VL:c|306464768|1|15364|149640|/Volumes/Argest/GoProMix/.DS_Store VL:c|306464768|1|16213743536|149640|/Volumes/Argest/GoProMix/GOPR0226.mov VL:c|324560896|1|1221811|158476|/Volumes/Argest/GoProMix/GOPR0226.mov.asd VL:c|324562944|1|238|158477|/Volumes/Argest/GoProMix/GoProDjMix_01_11_2014 Project/ VL:c|324562944|1|881755162|158477|/Volumes/Argest/GoProMix/GoProDjMix_01_11_2014.aif VL:c|325545984|1|3772193|158957|/Volumes/Argest/GoProMix/GoProDjMix_01_11_2014.aif.asd VL:c|325550080|1|439195728|158959|/Volumes/Argest/GoProMix/GoProDjMix_01_11_2014.mov VL:c|326039552|1|1442304044|159198|/Volumes/Argest/GoProMix/GoProDJMix_01_11_2014.wav VL:c|327649280|1|3971227|159984|/Volumes/Argest/GoProMix/GoProDJMix_01_11_2014.wav.asd VL:c|327655424|1|15985573296|159987|/Volumes/Argest/GoProMix/GP010226.mov VL:c|345495552|1|1221203|168698|/Volumes/Argest/GoProMix/GP010226.mov.asd VL:c|345497600|1|16020452784|168699|/Volumes/Argest/GoProMix/GP020226.mov VL:c|363376640|1|1212815|177429|/Volumes/Argest/GoProMix/GP020226.mov.asd VL:c|363378688|1|2921821236|177430|/Volumes/Argest/GoProMix/GP030226.mov VL:c|366639104|1|344063|179022|/Volumes/Argest/GoProMix/GP030226.mov.asd VL:c|366639104|1|204|179022|/Volumes/Argest/GoProMix/HQ_ConvertedMix/

Kem, I’m sure that it would help more than me, but if that could be described in relationship to how it would be created in RegEx Rx, it would be like teaching us to fish instead of giving us a fish :).

Tim

Here is what I came up with as a starting point. You can tweak as needed, especially the handling of the prefix:

^VL:c\\|\\d+\\|\\d+\\|\\d+\\|\\d+\\|/Volumes/Argest/GoProMix/[^/\\r\
]+(?:/?)$

This identifies exactly the lines you indicated. The point is to look for the path you need followed up a series of characters that are not a slash or EOL. Optionally, it can end with a slash (the last, non-capturing group).

The biggest problem you’ll have when converting this type of string into a pattern is that you can’t control the characters that appear within it. For example, if you have a valid path like “/Volumes/HD/This|Or|That”, it will throw off your results since “|” means something to the RegEx engine.

The way around this is to convert your path to its hex equivalent instead. You can safely represent any character with \x{NNNN} where NNNN represents its hex code. The “|” would be \x{7C}, and the Xojo code to convert it would be pretty fast.

(Come to think of it, I should add this as an option to RegExRX.)

As for the process, I started the pattern with the first line, escaped the “|” characters, and added “^” to indicate the start of the line. There was only one match (the first line) so I realized that the numbers were different per line and replaced those with “\d”. That matched every line.

Since there can’t be a slash character in a path other than to designate a folder, I added [^\\]+ to indicate a series of characters that are not a slash. That worked fine, but got too much since the EOL characters qualified. I modified that a bit to [^\\\\r\ ] to exclude those too, and the results started to take shape.

Realizing that a subfolder would end with a slash, I added that optionally, followed by the end-of-line anchor, and there you have it.

I believe that you’ve hit my problem - the “|” symbols used for delimiters.

Also since I need to catch lines starting with VL:c and VL:i, and the third field is hex, not numeric, I’ve modified the expression to:

"""^VL:.\\|\\d+\\|\\.+\\|\\d+\\|\\d+\\|" + theParentPath + "[^/\\r\ ]+(?:/?)$"""

RegEx Rx seems to like it, but I’m getting no match found with this as the Source text:

VL:c|0|1|1428|-1|/ VL:c|0|1|374|-1|/Volumes/ VL:c|0|1|1870|-1|/Volumes/WorkRAID/ VL:c|0|1|3400|-1|/Volumes/WorkRAID/66 GB/ VL:c|0|1|39940|-1|/Volumes/WorkRAID/66 GB/.DS_Store VL:c|0|1|3366|-1|/Volumes/WorkRAID/66 GB/Music/ VL:c|0|1|39940|-1|/Volumes/WorkRAID/66 GB/Music/.DS_Store VL:c|0|1|102|-1|/Volumes/WorkRAID/66 GB/Music/1000 Homo DJs/ VL:c|0|1|510|-1|/Volumes/WorkRAID/66 GB/Music/1000 Homo DJs/Supernaut _ Apathy/ VL:c|0|1|14512497|-1|/Volumes/WorkRAID/66 GB/Music/1000 Homo DJs/Supernaut _ Apathy/01 Supernaut.mp3 VL:c|15872|1|17291162|30|/Volumes/WorkRAID/66 GB/Music/1000 Homo DJs/Supernaut _ Apathy/02 Hey Asshole!.mp3 VL:c|35328|1|10076476|68|/Volumes/WorkRAID/66 GB/Music/1000 Homo DJs/Supernaut _ Apathy/03 Apathy.mp3 VL:c|46592|1|11909705|90|/Volumes/WorkRAID/66 GB/Music/1000 Homo DJs/Supernaut _ Apathy/04 Better Ways.mp3 VL:c|59904|1|7744|116|/Volumes/WorkRAID/66 GB/Music/1000 Homo DJs/Supernaut _ Apathy/AlbumArt_{E8988690-8796-42DB-AB5D-744D416B957C}_Large.jpg VL:c|59904|1|2131|116|/Volumes/WorkRAID/66 GB/Music/1000 Homo DJs/Supernaut _ Apathy/AlbumArt_{E8988690-8796-42DB-AB5D-744D416B957C}_Small.jpg VL:c|59904|1|2131|116|/Volumes/WorkRAID/66 GB/Music/1000 Homo DJs/Supernaut _ Apathy/AlbumArtSmall.jpg VL:c|59904|1|9032742|116|/Volumes/WorkRAID/66 GB/Music/1000 Homo DJs/Supernaut _ Apathy/Apathy.mp3 VL:c|70144|1|10392198|136|/Volumes/WorkRAID/66 GB/Music/1000 Homo DJs/Supernaut _ Apathy/Better Ways.mp3 VL:c|81408|1|358|158|/Volumes/WorkRAID/66 GB/Music/1000 Homo DJs/Supernaut _ Apathy/desktop.ini VL:c|81408|1|7744|158|/Volumes/WorkRAID/66 GB/Music/1000 Homo DJs/Supernaut _ Apathy/Folder.jpg VL:c|81408|1|15863702|158|/Volumes/WorkRAID/66 GB/Music/1000 Homo DJs/Supernaut _ Apathy/Hey Asshole!.mp3 VL:c|99328|1|13104472|193|/Volumes/WorkRAID/66 GB/Music/1000 Homo DJs/Supernaut _ Apathy/Supernaut.mp3 VL:c|114176|1|102|222|/Volumes/WorkRAID/66 GB/Music/Adriano Banchieri/ VL:c|114176|1|340|222|/Volumes/WorkRAID/66 GB/Music/Adriano Banchieri/Empire Brass_Gabrieli/ VL:c|114176|1|7658|222|/Volumes/WorkRAID/66 GB/Music/Adriano Banchieri/Empire Brass_Gabrieli/AlbumArt_{591231B2-1DA5-4CE0-B9F9-E7E3D046EAF3}_Large.jpg VL:c|114176|1|2190|222|/Volumes/WorkRAID/66 GB/Music/Adriano Banchieri/Empire Brass_Gabrieli/AlbumArt_{591231B2-1DA5-4CE0-B9F9-E7E3D046EAF3}_Small.jpg VL:c|114176|1|2190|222|/Volumes/WorkRAID/66 GB/Music/Adriano Banchieri/Empire Brass_Gabrieli/AlbumArtSmall.jpg VL:c|114176|1|1531239|222|/Volumes/WorkRAID/66 GB/Music/Adriano Banchieri/Empire Brass_Gabrieli/Concerto Primo.mp3 VL:c|115712|1|2483539|225|/Volumes/WorkRAID/66 GB/Music/Adriano Banchieri/Empire Brass_Gabrieli/Concerto Secondo.mp3 VL:c|118272|1|1851391|230|/Volumes/WorkRAID/66 GB/Music/Adriano Banchieri/Empire Brass_Gabrieli/Concerto Terzo.mp3 VL:c|120320|1|353|234|/Volumes/WorkRAID/66 GB/Music/Adriano Banchieri/Empire Brass_Gabrieli/desktop.ini VL:c|120320|1|7658|234|/Volumes/WorkRAID/66 GB/Music/Adriano Banchieri/Empire Brass_Gabrieli/Folder.jpg VL:c|120320|1|102|234|/Volumes/WorkRAID/66 GB/Music/Al Jarreau/ VL:c|120320|1|68|234|/Volumes/WorkRAID/66 GB/Music/Al Jarreau/Best Of Al Jarreau/ VL:c|120320|1|102|234|/Volumes/WorkRAID/66 GB/Music/Alan Hovhaness/ VL:c|120320|1|102|234|/Volumes/WorkRAID/66 GB/Music/Alan Hovhaness/Music for Organ, Brass & Percu/ VL:c|120320|1|5742050|234|/Volumes/WorkRAID/66 GB/Music/Alan Hovhaness/Music for Organ, Brass & Percu/The Prayer of Saint Gregory.mp3 VL:c|126976|1|102|247|/Volumes/WorkRAID/66 GB/Music/Albert Collins_Robert Cray_Johnny Copeland/ VL:c|126976|1|68|247|/Volumes/WorkRAID/66 GB/Music/Albert Collins_Robert Cray_Johnny Copeland/Showdown!/ VL:c|126976|1|102|247|/Volumes/WorkRAID/66 GB/Music/Alison Krauss/ VL:c|126976|1|272|247|/Volumes/WorkRAID/66 GB/Music/Alison Krauss/O-Brother Where Art Thou_/ VL:c|126976|1|9592|247|/Volumes/WorkRAID/66 GB/Music/Alison Krauss/O-Brother Where Art Thou_/AlbumArt_{C6B5C3D0-5F6E-4970-8CA7-8458897D1102}_Large.jpg VL:c|126976|1|2406|247|/Volumes/WorkRAID/66 GB/Music/Alison Krauss/O-Brother Where Art Thou_/AlbumArt_{C6B5C3D0-5F6E-4970-8CA7-8458897D1102}_Small.jpg VL:c|126976|1|2406|247|/Volumes/WorkRAID/66 GB/Music/Alison Krauss/O-Brother Where Art Thou_/AlbumArtSmall.jpg VL:c|126976|1|370|247|/Volumes/WorkRAID/66 GB/Music/Alison Krauss/O-Brother Where Art Thou_/desktop.ini VL:c|126976|1|2820527|247|/Volumes/WorkRAID/66 GB/Music/Alison Krauss/O-Brother Where Art Thou_/Down To The River To Pray.mp3 VL:c|130048|1|9592|253|/Volumes/WorkRAID/66 GB/Music/Alison Krauss/O-Brother Where Art Thou_/Folder.jpg VL:c|130048|1|102|253|/Volumes/WorkRAID/66 GB/Music/Alison Krauss & Gillian Welch/ VL:c|130048|1|272|253|/Volumes/WorkRAID/66 GB/Music/Alison Krauss & Gillian Welch/O-Brother Where Art Thou_/ VL:c|130048|1|9592|253|/Volumes/WorkRAID/66 GB/Music/Alison Krauss & Gillian Welch/O-Brother Where Art Thou_/AlbumArt_{C6B5C3D0-5F6E-4970-8CA7-8458897D1102}_Large.jpg VL:c|130048|1|2406|253|/Volumes/WorkRAID/66 GB/Music/Alison Krauss & Gillian Welch/O-Brother Where Art Thou_/AlbumArt_{C6B5C3D0-5F6E-4970-8CA7-8458897D1102}_Small.jpg VL:c|130048|1|2406|253|/Volumes/WorkRAID/66 GB/Music/Alison Krauss & Gillian Welch/O-Brother Where Art Thou_/AlbumArtSmall.jpg VL:c|130048|1|370|253|/Volumes/WorkRAID/66 GB/Music/Alison Krauss & Gillian Welch/O-Brother Where Art Thou_/desktop.ini VL:c|130048|1|9592|253|/Volumes/WorkRAID/66 GB/Music/Alison Krauss & Gillian Welch/O-Brother Where Art Thou_/Folder.jpg VL:c|130048|1|3810673|253|/Volumes/WorkRAID/66 GB/Music/Alison Krauss & Gillian Welch/O-Brother Where Art Thou_/I'll Fly Away.mp3

BTW - I also learned how to seriously cripple RegEx Rx :slight_smile: - 125MB of Source Text.

I can’t edit that, but the real expression is:

"""^VL:.\\|\\d+\\|.+\\|\\d+\\|\\d+\\|" + theParentPath + "[^/\\r\ ]+(?:/?)$"""
My retype added a “” before the “.” for field 3.

Also, for the test, I’m passing the literal “/Volumes/” as theParentPath.

Your pattern has quotes, literally, at the start and end. Also, note that the 4th digit group has a negative sign. I’ve modified the pattern thusly:

^VL:.\\|-?\\d+\\|[[:xdigit:]]+\\|-?\\d+\\|-?\\d+\\|/Volumes/[^/\\r\
]+(?:/?)$

[[:xdigit:]]+ means “a series of hex digits”. You could also do that as [0-9A-F]+. There was one match in your sample text, as expected.

Sorry, my quotes were for passing through a shell to egrep.

The handling of the ‘-’ sign and [:xdigit:] fixed it within RegEx Rx, but if I pass that through to egrep, I get no lines matched.

egrep -a "^VL:.\\|-?\\d+\\|[[:xdigit:]]+\\|-?\\d+\\|-?\\d+\\|/Volumes/[^/\\r\ ]+(?:/?)$" CatFile.out

CatFile.out is the complete file I’m parsing. I’m trying to not load up 100MB+ files into memory for parsing from within the Xojo app and using egrep via the Shell to just return the matching lines (a MUCH smaller subset).

The good news is that the test in RegEx Rx only takes 274ms while my prior convolutions were taking 100’s of seconds.

Still more - if I extend the search literal to include a subfolder:

^VL:.\\|-?\\d+\\|[[:xdigit:]]+\\|-?\\d+\\|-?\\d+\\|/Volumes/WorkRAID[^/\ \\r]+(?:/?)$
I get no matches - even in RegEx Rx.

In the second version, you need the trailing “/” after the subfolder.

I wonder if you have to double up the slashes for egrep? Also, it might not like that bracket expression.

"^VL:.\\\\|-?\\\\d+\\\\|[0-9A-F]+\\\\|-?\\\\d+\\\\|-?\\\\d+\\\\|/Volumes/[^/\\\\r\\\
]+(?:/?)$"

As a test first, see if you get any matches from the simple pattern:

“\d”

If this matches the letter “d”, you have to double up the slashes.

Single escapes are fine. If I use this:

"^VL:c\\|-?\\d+\\|[[:xdigit:]]+\\|\\d+\\|\\d+\\|/Volumes/WorkRAID/BRU_Storage/"

egrep provides the subset that is in the BRU_Storage folder. If I extend that to only get the level below that, I get no matches:

"^VL:c\\|-?\\d+\\|[[:xdigit:]]+\\|\\d+\\|\\d+\\|/Volumes/WorkRAID/BRU_Storage/[^/\\r\ ]+(?:/?)$"
I also then get no matches in RegEx Rx (without the quotes).
Going back to simply

^VL:c\\|\\d+\\|[[:xdigit:]]+\\|\\d+\\|-?\\d+\\|/Volumes/[^/\\r\ ]+(?:/?)$
Works in RegEx Rx, but fails in egrep.

Okay, taking a step back, the following works “sort of” with egrep:

^VL:c\\|\\d+\\|[[:xdigit:]]+\\|\\d+\\|-?\\d+\\|/Volumes/WorkRAID/66 GB/Music/[^/\\r\ ]+(?:/?)$
I do get matches, but the number of matches is far below my expectations (and what should match). RegEx Rx finds 97 matches (correct), but egrep somehow only returns 14 and there are missing matches between the ones returned. Also, it doesn’t find files with egrep that don’t have a trailing “/”.

BTW - for clarification, only the 5th field can contain a negative number and I’m seeing the same results on OS X and Linux…

Can you post that data set so I can test against it? I mean a subset with BRU_Storage.

Also, did you try substituting [0-9A-F]?

Never mind about the bracket expression. I just tested and it’s fine.

BTW, did you make sure your file has the proper EOL characters in them?

Finally, since you are only concerned about individual lines anyway, have you consider using a TextInputStream and matching against each line as you read them in individually?

I started there and that was even slower doing it a ReadLine at a time…

However, I’ve just reimplemented the process using this code:

[code] theExp.SearchPattern = “^VL:.\|\d+\|[[:xdigit:]]+\|\d+\|-?\d+\|” + theParentPath + “[^/\r
]+(?:/?)$”
theExp.Options.Greedy = False

tis = TextInputStream.Open(theFile)
While Not tis.EOF
theResults = tis.ReadLine
theMatch = theExp.Search(theResults)
if theMatches <> Nil Then
Listbox1.AddRow(theMatch.SubExpressionString(0))
end if
Wend
[/code]
And it is quite quick. Plus, the PCRE library doesn’t seem to have any issues with the search pattern here:

^VL:.\\|\\d+\\|[[:xdigit:]]+\\|\\d+\\|-?\\d+\\|" + theParentPath + "[^/\\r\ ]+(?:/?)$
and I am now getting the proper results.

Is it possible that older RS (circa 2007) may have had issues with RegEx performance? That was the last time that I tried doing this internally within the RS code.

Thanks for sticking with me.

There is still an issue with performance, but only when doing repeated matches against the same text. You are doing it against a single line at a time, so it will be quite quick.

Of note, the MBS version doesn’t suffer from that performance problem.

You are going to run into trouble if theParentPath has characters that are regex tokens like “[” or “(”. I recommend code like this:

  dim chars() as string = theParentPath.Split( "" )
  for i as integer = 0 to chars.Ubound
    dim codepoint as integer = chars( i ).Asc
    chars( i ) = "\\x{" + Hex( codepoint ) + "}"
  next i
  dim encodedPath as string = join( chars, "" )

  theExp.SearchPattern = "^VL:.\\|\\d+\\|[[:xdigit:]]+\\|\\d+\\|-?\\d+\\|" + encodedPath + "[^/\\r\
]+(?:/?)$"

Excellent point and I already encode the path just for the special character situations.

Hmmm, I’ll compare the current Xojo RegEx against the RegExMBS times and report back.