More RegEx help

I would like to extract lines given a huge list that looks mostly like this:

35/00 DZTPROMAEBKVF ENCLOSURE SERVICES FAILURE 35/01 DZTPROMAEBKVF UNSUPPORTED ENCLOSURE FUNCTION 35/02 DZTPROMAEBKVF ENCLOSURE SERVICES UNAVAILABLE 35/03 DZTPROMAEBKVF ENCLOSURE SERVICES TRANSFER FAILURE 35/04 DZTPROMAEBKVF ENCLOSURE SERVICES TRANSFER REFUSED 35/05 DZT ROMAEBKVF ENCLOSURE SERVICES CHECKSUM ERROR ASC/ . . . . . ASCQ DZTPROMAEBKVF Description 3B/00 T SEQUENTIAL POSITIONING ERROR 3B/01 T TAPE POSITION ERROR AT BEGINNING-OF-MEDIUM 3B/02 T TAPE POSITION ERROR AT END-OF-MEDIUM 3B/03 TAPE OR ELECTRONIC VERTICAL FORMS UNIT NOT READY 3B/04 SLEW FAILURE 3B/05 PAPER JAM 3B/06 FAILED TO SENSE TOP-OF-FORM 3B/07 FAILED TO SENSE BOTTOM-OF-FORM 3B/08 T REPOSITION ERROR 3B/09 READ PAST END OF MEDIUM 3B/0A READ PAST BEGINNING OF MEDIUM 3B/0B POSITION PAST END OF MEDIUM 3B/0C T POSITION PAST BEGINNING OF MEDIUM 3B/0D DZT ROM BK MEDIUM DESTINATION ELEMENT FULL 3B/0E DZT ROM BK MEDIUM SOURCE ELEMENT EMPTY
Where the T and M fields are loaded. I want to get them out so that the above would parse down to:

35/01 DZTPROMAEBKVF UNSUPPORTED ENCLOSURE FUNCTION 35/02 DZTPROMAEBKVF ENCLOSURE SERVICES UNAVAILABLE 35/03 DZTPROMAEBKVF ENCLOSURE SERVICES TRANSFER FAILURE 35/04 DZTPROMAEBKVF ENCLOSURE SERVICES TRANSFER REFUSED 35/05 DZT ROMAEBKVF ENCLOSURE SERVICES CHECKSUM ERROR 3B/00 T SEQUENTIAL POSITIONING ERROR 3B/01 T TAPE POSITION ERROR AT BEGINNING-OF-MEDIUM 3B/02 T TAPE POSITION ERROR AT END-OF-MEDIUM 3B/08 T REPOSITION ERROR 3B/0C T POSITION PAST BEGINNING OF MEDIUM 3B/0D DZT ROM BK MEDIUM DESTINATION ELEMENT FULL 3B/0E DZT ROM BK MEDIUM SOURCE ELEMENT EMPTY
I’ve tried the obvious of “^\d\d/\d\d\s+…T…M…\s+.*”, but it seems to stop parsing at the odd lines (lines 7 and 8 above) and there are lots more odd lines … Also, I’d like to get lines where either the T or the M is set instead of both.

Thanks,
Tim

There is a disconnect here. Your pattern looks for lines that start with two digits, but some of these lines start with digit-letter.

Edit: I’m assuming that the lines can start with hex digits/hex digits, so I’ll craft the pattern that way.

[quote=374389:@Kem Tekinay]There is a disconnect here. Your pattern looks for lines that start with two digits, but some of these lines start with digit-letter.

Edit: I’m assuming that the lines can start with hex digits/hex digits, so I’ll craft the pattern that way.[/quote]
D’oh! I copied and pasted from a different section of the document than where I was looking. Glad I did, otherwise the HEX stuff would have been a mystery!

Are the T and M always in the same relative position? In other words, the column number for M is always the same as is the column number of T?

Yes - the format is always like that. ASC/ASCQ in hex, 2 spaces, list of 13 flags (and I only want the T and M entries), 2 spaces, and ending with the plain text.

That didn’t answer my question :slight_smile: but I saw the answer after I posted. “M” for example is in one column on some lines but another on others.

This will identify lines that have either M or T, not both:

(?x)

# starts with hex*2/hex*2
^[[:xdigit:]]{2} / [[:xdigit:]]{2}
\\x20\\x20

# next fifteen can contain T or M, not both
(?=.{0,11}[MT]) # lookahead to make sure M or T in the next 12
(?| 
 ([DZTPROAEBKVF\\x20]{12}) # no M
 | # or
 ([DZPROMAEBKVF\\x20]{12}) # no T
)

# rest of the line
\\x20\\x20
.*

Note: This assumes the codes given are the only ones allowed.

I need both or either, so …

^[[:xdigit:]]{2}/[[:xdigit:]]{2}\\x20\\x20(?=.{0,11}[MT])\\x20\\x20.*

Basically, the SCSI ASC/ASCQ codes are standardized, but the T-10 committee’s list includes all 13 device categories in this huge listing. I just need the tape (T) and medium changer (M) entries from the 1,000s of lines.

Thanks. It was the look-ahead that I didn’t grok.

Why do you do this so complicated?

  1. Load the lines into an array.
  2. Do a split of each line with double space.
  3. Do some clean up with trim.
  4. Examine the result for T and M lines.

I guess it’s all a matter of perspective. I believe your solution to be more complicated, but I would modify it to use Mid to grab the codes instead of splitting on double-spaces.

It was a one-shot operation, not something that would be done many times.