RegEx and Quoted Strings

I have a “poor mans” Syntax Highlighter… its kind of “brute force” so it is good only for a few thousand characters at most (which is fine for this project)
What it does is apply multiple RegEx statements to set colors
and it does it in a “priority” order

' 1) Reset all the text
highlight "*",me.TextColor
' 2) Highlight designated "Keywords"
highlight "(?i)(?:^|(?<= ))("+keywordList+")(?:(?= )|$)",zColorKeywords
' 3) Numbers
highlight "[-+]?[0-9]*\\.?[0-9]",zColorNumbers
' 4) Mathematic Operators
highlight "(<=|>=|!=|=|>|<|\\\\+|-|\\*|\\^|%|/)",zColorOperators
' 5) Quoted strings (this will reset numbers, keywords INSIDE quotes)
highlight "'.*?'",zColorStrings // single quotes
highlight "\"+chrb(34)+".*?\"+chrb(34),zColorStrings // double quotes
' 6) LINE comments
highlight "(\\/{2}(.*))$",zColorLineComments
' 7) BLOCK Comments
highlight "\\/\\*[\\s\\S]*?\\*\\/",zColorBlockComments

as you can see there is no “logic” each RegEx might override the results from a previous one (especially #5)

But here is the problem… Comments are // for line and /* */ for block
but they ONLY count if they are NOT inside single or double quoted strings
can the RegEx in #6 and #7 be expanded to check that condition?

This pattern shows a technique as applied to #6. In short, it first attempts to match a single-quoted or double-quoted string. If it can match that, it forces the engine to move past it with (*SKIP) and fail the match with (*FAIL). The engine resumes matching at that point, effectively skipping all such quoted strings until it finds the comment.

('|")((?!\\g1).)*\\g1(*SKIP)(*FAIL)|(/{2}(.*))$

[quote=329874:@Kem Tekinay]This pattern shows a technique as applied to #6. In short, it first attempts to match a single-quoted or double-quoted string. If it can match that, it forces the engine to move past it with (*SKIP) and fail the match with (*FAIL). The engine resumes matching at that point, effectively skipping all such quoted strings until it finds the comment.

('|")((?!\\g1).)*\\g1(*SKIP)(*FAIL)|(/{2}(.*))$ [/quote]
so for block it would be

('|")((?!\\g1).)*\\g1(*SKIP)(*FAIL)|(\\/\\*[\\s\\S]*?\\*\\/)$

am I escapping more that I should?

I haven’t tried it, but it certainly looks right.

Yes, I think you are escaping more than you need. It generally doesn’t matter as long as the character you’re escaping doesn’t have a meaning to the regex engine, or the regex engine in question specifically disallows it. (PCRE as compiled into Xojo doesn’t care.) You only need to escape these characters (off the top of my head, so I may have missed something):

()[]{}*+?|^$.\\

Thanks… I will let you know how it turns out… :slight_smile:

I am not sure if this is considered improper on this forum (because it is more a Regex question than answering Dave’s question) , but I have a question about the Tekinay Regex string.

why is not this (seemingly simpler) formulation satisfactory?

('|").*?\\g1(*SKIP)(*FAIL)|(/{2}(.*))$

Basically why this required

('|")((?!\\g1).)*\\g1

instead of just

('|").*?\\g1

not inappropriate at all,… Since RegEx is used extensively within Xojo…
It is just something I have a difficult time wrapping my head around…

As a general rule, you should avoid the .* structure in favor of what you’re truly trying to match. In this case, for example, that pattern should work for a single-line quote (which may be appropriate here) but not for a multi-line one. You’d have to turn on the right regex switch so . could match end-of-line or use some other technique like [\\s\\S]*.

To sum up, you’re right, that would probably work just fine since this is a fairly simple pattern, but as a practice, you should avoid .* and it’s implications. In this case, we want to match anything that’s not the opening quote character, so that’s how I wrote it.

FYI… Kem… I haven’t had a chance to get back to this particular issue… my Beta testers found other things for me to focus on first :slight_smile: