Regex: Find if any part of pattern not matches

Robert_Livingston · July 28, 2020, 7:05pm

I am looking for errors in a pattern. I want to find the string if any part of it does not match.

The “correct” pattern I am dealing with is: x#\d\d\d \d \w{11}:
An example of a correct pattern would be: x#104 2 Spous_Partn:

In my case, any part means anything past the x# so I would want to find anything where beyond the x# things do not match. All the patterns, erroneous or otherwise, start with x#

Examples would be:
x#104 a Spous_Partn:
x#104 2 Spous_Partn:
x#10a 2 Spous_Partn:
x#1042 Spous_Partn:
x#104 2 Spous_Part :
x#104 2 Spous Partn:
x#1B4 2 Spous_Partn:

These are all erroneous because they violate the acceptable pattern in one or another location in the string.
Basically I am trying to set up some kind of an OR logic that detects a mismatch in any of the locations. I would like to do this without using a programming language to set up multiple sequential searches checking each position individually.

Norman_Palardy · July 28, 2020, 7:10pm

a regex could match or not

not matching would be all you needed to look for

other than that I’m not clear what else you might be looking for ?

DerkJ · July 28, 2020, 7:10pm

We need @Kem_Tekinay here…

Robert_Livingston · July 28, 2020, 7:28pm

x#\d\d\d \d \w{11}:

This will match. But I am looking through a 1000 page document for any cases where there is a x# followed by 19 characters (\d\d\d \d \w{11}:) that has a mismatch anywhere along that pattern.

I think Tekinay is jumping out of an airplane and might not return to the forum.

Kem_Tekinay · July 28, 2020, 8:02pm

Not quite yet. Maybe not at all, stay tuned.

Anyway, time for verbs:

^x#\d{3} \d \w{11}:$(*SKIP)(*FAIL)|^.+$

This will look for the matching pattern and, if found, will tell the engine to “skip” its internal pointer to that position (no backtracking), and “fail” to match. Thus a match is actually a fail, and it will start looking again at the next position.

If it does not match the first part, it will go to the “or” and match the second part, which is just, “match this line”.

Norman_Palardy · July 28, 2020, 8:10pm

two reg execs ?
one to grab the #x + 19 characters #x.{19,19} or something
one to check this matches the pattern you first gave

Robert_Livingston · July 29, 2020, 2:32am

This is how I interpret Tekinay’s pattern.

It is adding some additional constraints to the problem presented.
( a ). All lines are supposed to have the basic pattern
( b ). There is no text expected after the basic pattern.
( c ). There is no text expected before the basic pattern.

Accepting these additional restrictions it works! Only BAD patterns get selected. Very elegant.

But let’s say that it is acceptable to have a variable amount of text after the pattern? If the pattern is OK you do not want those lines to be flagged. So we can modify the Teknay pattern. Get rid of the first $. [As best I can tell, the second $ is superfluous and it does not matter whether it is present or not}.

^x#\d{3} \d \w{11}:(*SKIP)(*FAIL)|^.+$

This slight variant is successful in removing the prohibition on text after the pattern. Those lines will not be flagged as bad unless the leading pattern itself is bad. The restriction ( b ) has been removed. It is not perfect in the sense that now it is not the bad pattern that is selected but rather the entire line that starts with the bad pattern. Actually, for my purposes, that is OK.

Let’s say it is acceptable to have text also before the pattern. Perhaps we can just get rid of the initial ^ in the pattern.

x#\d{3} \d \w{11}:(*SKIP)(*FAIL)|^.+$

For reasons that are not clear to me, that does not work. However a slight modification does make this work.

.*x#\d{3} \d \w{11}:(*SKIP)(*FAIL)|^.+$

This variant is successful in removing the prohibition on text before the pattern. The restriction ( c ) has been removed. The entire line that contains the bad pattern, rather than just the pattern, is flagged but, again, that is OK for me.

My remaining problem is the constraint ( a ). My text contains many lines that do not have and are not expected to have the pattern at all. Unfortunately, these all get flagged.

There is a somewhat kludgy solution which is to assume that if there is a problem it is only in one location and then look for this with a very long “OR” Regex pattern. You just look for cases where there is a conflict in a specific location.

(x#\D\d\d \d \w{11}:)|(x#\d\D\d \d \w{11}:)|(x#\d\d\D \d \w{11}:)|(x#\d\d\d[^ ]\d \w{11}:)|(x#\d\d\d \D \w{11}:)|(x#\d\d\d \d[^ ]\w{11}:)|(x#\d\d\d \d \W\w{10}:)|(x#\d\d\d \d \w\W\w{9}:)|(x#\d\d\d \d \w{2}\W\w{8}:)|(x#\d\d\d \d \w{3}\W\w{7}:)etc.

This works except it will not flag the situation where there is more than one error in the pattern.

Finally, I stumbled on this which is still fairly long and kludgy but works. You just sequentially look for an error in the 19 locations of the pattern.

(x#\D.{17})|(x#.\D.{16})|(x#.{2}\D.{15})|(x#.{3}[^ ].{14})|(x#.{4}\D.{13})|(x#.{5}[^ ].{12})|(x#.{6}\W.{11})|(x#.{7}\W.{10})|(x#.{8}\W.{9}) etc.

Norman_Palardy · July 29, 2020, 3:24am

someone else suggested maybe

^.*x#\d{3} \d \w{11}:.*$(*SKIP)(*FAIL)|^.+$

Robert_Livingston · July 29, 2020, 4:10am

That runs into the ( a ) problem. All line without the pattern end up being selected. My actual use case is searching for what are presumably a small numbers of errors in a huge document.

Markus_Winter · July 31, 2020, 4:29am

That runs into the ( a ) problem. All line without the pattern end up being selected. My actual use case is searching for what are presumably a small numbers of errors in a huge document.

Well, as you assume that x# is always there then you could simply use

^.*(x#\d{3} \d \w{11}:).*$(*SKIP)(*FAIL)|x#.+$

(unless there are lots of unrelated x# in the text)

Robert_Livingston · July 31, 2020, 10:20am

Thanks, Marcus. Your suggestion basically works and is “neater”

.*(x#\d{3} \d \w{11}:).*$(*SKIP)(*FAIL)|x#.{18}

I amended it slightly by putting the {18} at the end which results in only the “bad” part being selected rather than the entire line. I don’t think that the ^ at the beginning is useful so I dropped that.

In my use case, there is never more than one of these patterns in any given line. (although there could be zero)

Markus_Winter · July 31, 2020, 10:43am

Actually it should speed it up slightly as the search is restricted to lines, which prevents searching across lines (Kem can correct me if I’m wrong).