a favor? RegEx information [Kem?]

I am in the process of writing a Transpiler to convert BASIC like syntax to Swift for iOS/macOS
part of that project is a lexical tokenizer which will break up a line or block of code into “tokens” each with an assigned “type”

Would you be so kind as to look at the following list of RegEx statements and tell me if they could be improved upon? Currently they all seem to work, but was hoping to make them as efficient as possible :slight_smile:


//set up tokens
zAddPATTERN pEOL        ,"(\
|\\r)+"
zAddPATTERN pWhitespace ,"(\\s|\
|\\r)+"
zAddPATTERN pComment    ,"//([^\
\\r]*)"   
zAddPATTERN pComment    , "'([^\
\\r]*)"  
zAddPATTERN pkeyword, "\\b("+kw+")\\b"
zAddPATTERN pFunct  , "\\b("+fn+")\\b"

zAddPATTERN pOperator   ,"(\\,|\\\\|\\|\\||&&|\\||&|==|<=|>=|!=|=|>|<|\\+|-|\\*|\\^|%|/)"
zAddPATTERN pOpenParend ,"[(]"
zAddPATTERN pCloseParend,"[)]"
zAddPATTERN pFloat      ,"[0-9]*[.][0-9]+|[0-9]+[.][0-9]*"
zAddPATTERN pInteger    ,"[0-9]+"
zAddPATTERN pString     ,"[""]([^""]*)[""]"  
zAddPATTERN pDatatype   ,"string|integer|boolean|double|single"
zAddPATTERN pIdentifier ,"[a-z_][a-z0-9_]*"
zAddPATTERN pCharacter  ,"(?|`|~|!|@|#|$|[|]|{|}|;|:|.)"

the “pXXXX” is a string literal the indicates the token type that would match the pattern
If you don’t have time or resources, no worries… :slight_smile:

All patterns should be case insenstive…
Note for pKeyword and pFunc the varialbes KW and FN are a constructed value matching

Will this be through the PCRE engine or something else?

whatever XOJO is using

That’s it. I have thoughts that I’ll share when I’m back at my desk.

Are you learning portuguese? :smiley: Kem is really the best guy to help with the “walking” in the code using regex, my approach would be old school, reading bytes into a “word” and identifying it while doing it (maintaining a context, like, what was the last token? Or whatever my proposed language needs to maintain to output the correct tokens).

Rick… I’m sorry… but not sure any of that made a lick of sense.
I know Kem is the best for this question… and actually this was supposed to have been a private to him :slight_smile:

My code works just fine… I just wanted his input on potentially fine-tuning those RegEx statements.

I prefer public conversations anyway.

and so it is :slight_smile:

You said “a favor?” and for one moment, in a context where I answered another question about the same thing, I’ve read it in portuguese, that translates to “Do you agree?” :smiley: Funny, not?

That’s why we need to put more context into the topic titles.

A shortcut for EOL is

\\R

which is the same as \\r\ |[\\r\ ]

A \\s means “any whitespace”, and that includes EOL.

To match a comment, you could just do //.* since, unless you change it, the dot does not match an EOL character. Your pattern will grab the EOL too. But since you want to avoid something like // in a string, I suggest a pattern like this:

"[^"]*"(*SKIP)(*FAIL)|(//|'|\\bREM\\b).*

That will skip over any strings and grab any type of comment (at least, a Xojo-style comment).

You can use that technique to skip over any strings.

When matching a single character, like an open paren, you don’t need to put square brackets around it.

You don’t need to put a backslash before every symbol, just the ones RegEx would consider a token. For example, the vertical bar or asterisk would need a backslash, but not a comma.

Consider this for data type:

\\b(string|integer|boolean|double|single)\\b

I don’t think your pCharacter pattern will work at all because the tokens are not escaped.

A shortcut for integer is simply \\d

The pattern for identifier ignores Unicode characters. Consider:

\\b[\\pL_][\\pL_\\d]*

Thanks… some questions for my education :slight_smile:

[quote=466832:@Kem Tekinay]
I don’t think your pCharacter pattern will work at all because the tokens are not escaped.

A shortcut for integer is simply \\d

The pattern for identifier ignores Unicode characters. Consider:

\\b[\\pL_][\\pL_\\d]* [/quote]

pCharacter “seems” to work… but you are suggesting escaping each one?
what the heck does the above pattern mean/do? compared to the one I currently have?

fyi… most of those I got from other sources :slight_smile:

if \d can be used for integer… would ^\d*\.?\d* be better for float?

got one to match HEX numbers? [0xFE3a5]?

The pattern is this:

(?|`|~|!|@|#|$|[|]|{|}|;|:|.)

That will match, for example, an EOL but not the “$” because it was not escaped. It will match a bar, but not the square brackets, because the bar is between the square brackets, and they are not escaped. Finally, it will match any single character, including letters and numbers, because it includes an unescaped dot.

I think what you want it this:

[`~!@#$\\[\\]{};:.]

The \p construct is unicode-aware and is followed by a script or property. \pL means any letter in any language.

Hex:

[[:xdigit:]]+