Regex Search Patterns

Bill_Martini · December 8, 2014, 1:28am

For years I’ve treated regex the same way I treat ebola I’ve kept my distance.

I now find myself needing to use it for many areas of a project. I’ve pieced together several find and replace functions from things I’ve found on the internet.

I would think what I’m looking for is something asked frequently, but I can’t seem to find anything on searches, I’m possibly not asking Google properly. I’m hoping there is a regex person that can point me in the right direction.

I need to search a string for a given phrase and match it with; exact match of word, is contained in word, is start of word, or is end of word. The matched word needs to be retrieved for each of the searches.

I’ve tried to tinker with \b but haven’t discovered how to do anything but an exact match, and I’m not sure I’m doing that properly. The nuances of regex are an art, and I’m finger painting

thanks for any help.

Jason_King · December 8, 2014, 1:38am

Check out RegexRx by @Kem Tekinay it will allow you to experiment with patterns to get your program to work just right.

Mike_Cotrone · December 8, 2014, 1:41am

Yes +1 for RegExRX as I use it all of the time. Its a great tool.

Bill_Martini · December 8, 2014, 2:25am

Thanks Mike and Jason. I downloaded the trial RegExRx and it looks like a great product for assembling patterns, if you know what you are doing already.

I thought I could figure out what I needed easily enough so I wrote a small app to do in a very meager way what RegexRx does, all, before I came to the forum for help. So, I’ve done a lot of experimenting in both my app and RegExRx, but all I have is a long list of what doesn’t work, and no understanding of why. This is standard text editor, find and replace fodder that is found in so many applications, I’d think the internet would be pouring over with examples…

Kem_Tekinay · December 8, 2014, 2:38am

A great resource is

http://regular-expressions.info

In your case, let’s say you want to find the any word that contains the word “dog”. This pattern will do the trick:

\\b\\S*dog\\S*

This means: Start at a word break followed by any number of characters that are not a whitespace, then “dog”, then any number of characters that are not a whitespace.

It won’t matter if “dog” is the entire word, at the start, end, or somewhere in the middle.

The problem is, you don’t know what the user will enter so you don’t want to use their input verbatim in case they use a character that means something to the regex engine. I recommend this code:

dim searchWord as string = theUsersInput
dim chars() as string = searchWord.Split( "" )
for i as integer = 0 to chars.Ubound
  select case chars( i )
  case "a" to "z", "0" to "9"
    // do nothing
  else
    chars( i ) = "\\x{" + hex( chars( i ).Asc ) + "}"
  end
next

dim pattern as string = "\\b\\S*" + join( chars, "" ) + "\\S*"

Jason_King · December 8, 2014, 2:55am

You can also implement a very basic find function like this:

Sub FindNext(source as textarea, searchKey as string, restartFrombeginning as Boolean = False)
  dim txt as String = source.Text
  static lastPosition as Integer = 0
  static lastSearchKey as String
  if lastSearchKey <> searchKey or restartFrombeginning then
    lastPosition = 0
  end if
  lastPosition = txt.InStr(lastPosition+1,searchKey)
  if lastPosition > 0 then
    source.SelStart = lastPosition-1
    source.SelLength = searchKey.Len
  end if
  lastSearchKey = searchKey
End Sub

In a button you can call it like:

FindNext(InputTextArea,FindField.text)
Where InputTextArea is the name of your TextArea, and FindField is the name of the field a user enters a search key into. A replace function would operate in a similar manner.

Bill_Martini · December 8, 2014, 3:28am

Thanks Kem, \S is what I was missing, not sure how I overlooked that, but I tried so many different combinations of things and just couldn’t get traction. I’ve spent quite a few hours on that site already along with some others. Still couldn’t find what I was looking for there, but did get a lot of other bits I needed elsewhere.

I’m sure if I knew more about the subject your app would have been very helpful, but plugging in garbage and getting no results is kind of like the monkeys and typewriters conundrum.

And thanks Jason, I did my initial code in pure RB but the file sizes were too large to be efficient. One file might be just fine, but I’m dealing with dozens of files at a minimum and even shaving a couple of seconds off the process is worth the effort.

Kem_Tekinay · December 8, 2014, 4:07am

In case you hadn’t seen it, RegExRX has an insert menu with all the tokens you can use and an explanation of what each does. It’s the arrow just above the pattern field on the left.

Bill_Martini · December 11, 2014, 6:30pm

Hey Kem,

I have another issue I can’t seem to resolve

I have a function to cleanup a text file by stripping 2 or more spaces, all non printables, and I need to add stripping 3 or more consecutive carriage returns. Single and double \r need to be permitted. \s\s\s works fine for cases where there are an even number of \r’s, but strips all if odd. Again I’ve tried as many combinations as I can think of and Google hasn’t provided an answer either.

What’s the trick?

Scott_Griffitts · December 11, 2014, 6:37pm

search for \r\r+ replace with \r\r

Scott_Griffitts · December 11, 2014, 6:38pm

or presumably more efficient, search for \r\r\r+ replace with \r\r

Bill_Martini · December 11, 2014, 7:00pm

Thanks Scott,

I’ve tried \r and \s \s is what I really needs as this function is meant to be used on imported files where I have no control over the content and covers the additional occurrences of line feeds and such.

my replace string is empty, so I need to catch 3 or more \s and replace with an empty string. I’m sure this is a replace pattern that can do this, but I’m only into regex about 48 hours or so now and still trying to solidify the basics in my mind.

Scott_Griffitts · December 11, 2014, 7:10pm

\s is whitespace - could be spaces, tabs, end of line characters

\r is an end of line, a subset of whitespace

sounds like you want to do certain things with certain types of whitespace.

replace 3 or more consecutive end of lines with 2 end of lines (Note: usually best to replace the endoflines in your text after you import it):

rg.searchpattern = "\\r\\r\\r+" rg.replacementpattern = "\\r\\r"

replace 2 or more spaces with one space:

rg.searchpattern = " +" '<--- 2 spaces followed by a plus sign rg.replacementpattern = " " <----- one space

Scott_Griffitts · December 11, 2014, 7:14pm

and if you do really want to replace 3 or more consecutive instances of whitespace with nothing then:

rg.searchpattern = "\\s\\s\\s+" rg.replacementpattern = ""

Kem_Tekinay · December 11, 2014, 7:15pm

Scott is correct. If they have to be treated differently, you need a different function. His space replacement is fine. For EOL, though, I’d do this:

rx.SearchPattern = "(\\R){3,}"
rx.ReplacementPattern = "$1$1"

That will cover all forms of the EOL delimiter.

Bill_Martini · December 11, 2014, 7:49pm

Once more thanks Scott and Kem,

I phrased my question wrong and in my limited testing on varied text documents \s is what I need to use I’m sure I find a document that breaks that in the future, so \R is noted.

Scott, \s\s\s+ has the odd even glitch and only retains two carriage returns if the number is even. Odd number of carriage returns strip all carriage returns.

As I tried to reformulate my question it seemed that I was seeking a conditional solution. Using Kems’ RX app I came up with this, which seems to work.

(?(?=\s\s\s)\s|)

Is there a pitfall I’m not aware of? I’ll give \R a whirl to see if I get equal or better results.

Kem_Tekinay · December 11, 2014, 7:54pm

Can you post your complete code? Something isn’t making sense.

Bill_Martini · December 11, 2014, 8:25pm

Kem,

Yeah your right! that segment worked in you app alone, but did nothing when added to my search string.

this is my current search pattern without the needed CR stripping. I’ve added stripping characters above 127 and currently evaluating for issues. The function is meant to visually cleanup a text document without destroying content.

" +|[\x00-\x09\x0B\x0C\x0E-\x1F\x80-\xFF]" (first two characters are a space)

If possible I’d like to keep this one function with the replacement string as empty, which I have now. \R\R\R+ works great for conditions where there are even numbered CRs. There is probably no distinguishable difference in doing in two searches, but I’d like to keep it to a single search if possible (I understand I have no valid reason for that requirement, other than simplicity, which now seems irrelevant).

scott_boss · December 21, 2014, 2:21am

I have been doing RegEx in Perl since the early 90s (I hate to admit that). And I know RegEx but I still use RegExEx to create my regex string these days. Not that I can’t do it, but RegExEx just does it much faster and easier. And I can verify the regex strings before applying them to my code (perl, Xojo, php, etc). It is a gem in my toolbox and worth every penny of it.