Newline in regex replacement pattern

Scott_Griffitts · March 25, 2022, 11:02pm

I frequently use \n in both Xojo and BBEdit search and replacement patterns without issue. However on the rare occasions when I use \n in a replacement pattern, save the text to a file, and then open the file in BBEdit, everywhere I used \n has the dreaded upside down question marks indicating an encoding issue. (Fortunately, BBEdit has a normalize line endings function that cleans it all up easily.) I’ve found that using EndOfLine instead of \n in the replacement pattern does not cause the issue. So my question is, is EndOfLine the one and only correct way of doing this, or is there another regex character I should be using or is the encoding getting messed up somewhere? The encoding as far as I can tell is UTF-8 from start to finish. The file is read into Xojo with UTF-8 and BBEdit reports that the file written from Xojo is UTF-8.

Kem_Tekinay · March 25, 2022, 11:33pm

What is the specific replacement pattern you’re using?

Rick_Araujo · March 25, 2022, 11:33pm

Would be great to see an example of content C that replaced at EOL by fragment F ends with final content FC containing a “dreaded upside down question mark”. Show C, F and FC please.

Scott_Griffitts · March 26, 2022, 12:37am

This is a snippet from a long list of text cleaning. In this code (other than the ReplaceLineEndings line) I used \n instead of EndOfLine and got the encoding issue. Changing it to EndOfLine made it go away.

x = x.ReplaceLineEndings(EndOfLine)

'lots of similar but non-endofline-related code here

x = find_replace("<br .+?>", "<br />" + endofline, x)
x = find_replace("<br>", "<br />" + endofline, x)
x = find_replace("\s+<br />", "<br />", x)
x = find_replace("</p>", "</p>" + endofline, x)
x = find_replace("</p>\s+", "</p>" + endofline + endofline, x)
x = find_replace(endofline + endofline + "+",  endofline + endofline, x)
x = find_replace("<br />" + endofline + endofline + "+", "<br />" + endofline, x)
x = find_replace("<br />\s+", "<br />" + endofline, x)

find_replace is a simple regex function:

Public Function find_replace(f as string, r as string, x as String) As string
  cleaner_rg.Options.ReplaceAllMatches = true
  
  cleaner_rg.SearchPattern = f
  cleaner_rg.ReplacementPattern = r
  
  return cleaner_rg.Replace(x)
End Function

Scott_Griffitts · March 26, 2022, 12:43am

As an example of what causes vs. not causes encoding issues in BBEdit for me:

'good
x = find_replace("</p>", "</p>" + endofline, x)

'bad
x = find_replace("</p>", "</p>\n", x)

Kem_Tekinay · March 26, 2022, 3:20pm

What happens if you do this on the source before replacement?

x = x.ReplaceLineEndings( &uA )

I’m wondering if some of your sources already have Returns in them, and that’s what you’re seeing.

BBEdit will let you examine the character codes of the gremlins too.

Scott_Griffitts · March 26, 2022, 4:02pm

Looking at the problem characters in BBEdit it says they are Hex: 0D.

Edit: I thought x = x.ReplaceLineEndings( &uA ) had fixed the problem, but I was testing it in debug mode. The app interacts with BBEdit via Applescripts and one of those sent everything back to the working, compiled (not debug) code.

So it looks like my use of \n in the replacement pattern is adding 0D characters instead of 0A, which appears to be what I want.

Rick_Araujo · March 26, 2022, 7:22pm

If you add EOLs in Windows, you get pairs of <0D><0A> “\r\n”

If you remove EOL “\n” in Unix like systems as Linux or MacOS, you may get residual <0D> from Windows

I usually normalize them all to \n if it’s not necessary to exchange with external Windows apps.

I also learned the hard way that Xojo destroys the integrity of strings pasted into constants. It does not store them as is, I found for example that it removes all the Linefeeds (\n) and convert them to Carriage Returns (\r) causing troubles with many functionalities that expects \n there, as SQL expressions. So I was forced to lose more machine clocks reconverting constants to variables reconstructing the real constants values. You may be affected by this bug (they probably call it feature ) too if you use such resource as I do.

Kem_Tekinay · March 26, 2022, 7:49pm

My theory is that there were already CRs in your text and you were adding a LF in front of them. Since LF+CR isn’t a valid EOL sequence, BBEdit was showing you the stray CR characters.

Try a pattern like this instead:

<br />(?![\x20\t]*\R)

Robert_Livingston · March 26, 2022, 9:19pm

This is a side note, but a related issue is that if you (using Mac) create the initial state of a TextArea containing multiple lines, the EOL character will be Chr(13). While the application is running, if you erase the content of the TextArea and type in exactly the same content that you did when originally creating the initial state in the IDE, the EOL character will now be Chr(10). User beware.

Scott_Griffitts · March 27, 2022, 12:47am

Well that kind of works in that there is a return character, but adds part of the pattern before it. Ex:

<br />some text here

becomes:

<br />(?![ 	]*R)
some text here

That got me to try this before doing any replacements:

x = x.replaceall(chr13, chr(10))

That didn’t work.

In the end using EndOfLine instead of \n isn’t a huge deal for me but I just thought this seemed odd.

Kem_Tekinay · March 27, 2022, 12:50am

I was giving you the search pattern, not the replacement.

Scott_Griffitts · March 27, 2022, 1:02am

Well that was silly of me. Yes, that does seem to work. Now I’ll play with that and see if I can use it to clean everything first then start replacing.