RegEx Help

Hi Folks (and especially Kem),

I’ve been trying to come up with a regex filter to take a line of text and pull out the fields, but there may be whitespace in the last field (the filename). The original text looks like this:

i 12K [1] -rw-rw-rw- 1 Administ root 20451742 May 7 2012 F:\\50 GB MP3 Folder\\Chicago\\At Carnegie Hall\\01 In The Country (LP Version).mp3 i 22840K [1] -rw-rw-rw- 1 Administ root 10250194 May 7 2012 F:\\50 GB MP3 Folder\\Chicago\\At Carnegie Hall\\02 Fancy Colours (LP Version).mp3

I would like to convert it to10 fields:

i 12K [1] -rw-rw-rw 1 Administ root 20451742 May 7 2012 F:\\50 GB MP3 Folder\\Chicago\\At Carnegie Hall\\01 In The Country (LP Version).mp3

Other considerations

  • the character number of any field may move
  • the number of spaces may differ
  • the path component may contain any allowed character, including many special characters and multibyte characters

Thoughts?

Kem - an example of creating such a filter with RegExRx would be cool.

This looks like output from ls, so we know what each column will be. The pattern would be something like this:

^(\\w)\\x20+(\\S+)\\x20+(\\S+)\\x20+(\\S+)\\x20+(\\d+)\\x20+(\\S+)\\x20+(\\S+)\\x20+(\\d+)\\x20+(\\w+\\x20+\\d{1,2}\\x20+(?:\\d{4}|\\d{2}:\\d{2}))\\x20+([^\\r\
]+)

I’m not sure how I’d provide that example other than that I used RegExRX to create the pattern, checking the output as I went.

Note that “\x20” denotes a space. You could certainly use a space instead, but I prefer this as I often use free-space mode and spaces don’t count in that mode.

“\S” means any character that isn’t a whitespace.

“\w” means any word character.

“\d” means any digit.

The fields in each line will be in SubExpressionString( 1 ) through ( 10 ).

(I adjusted the pattern to account for times when, instead of the year, the time is given instead. That this part: code[/code])

Thanks, Kem. I just went step by step with RegExRx and finally got that result. I was trying to get too granular with the 8th and 9th fields.