I am in the process of writing a Transpiler to convert BASIC like syntax to Swift for iOS/macOS
part of that project is a lexical tokenizer which will break up a line or block of code into “tokens” each with an assigned “type”
Would you be so kind as to look at the following list of RegEx statements and tell me if they could be improved upon? Currently they all seem to work, but was hoping to make them as efficient as possible
Are you learning portuguese? Kem is really the best guy to help with the “walking” in the code using regex, my approach would be old school, reading bytes into a “word” and identifying it while doing it (maintaining a context, like, what was the last token? Or whatever my proposed language needs to maintain to output the correct tokens).
Rick… I’m sorry… but not sure any of that made a lick of sense.
I know Kem is the best for this question… and actually this was supposed to have been a private to him
My code works just fine… I just wanted his input on potentially fine-tuning those RegEx statements.
You said “a favor?” and for one moment, in a context where I answered another question about the same thing, I’ve read it in portuguese, that translates to “Do you agree?” Funny, not?
A \\s means “any whitespace”, and that includes EOL.
To match a comment, you could just do //.* since, unless you change it, the dot does not match an EOL character. Your pattern will grab the EOL too. But since you want to avoid something like // in a string, I suggest a pattern like this:
"[^"]*"(*SKIP)(*FAIL)|(//|'|\\bREM\\b).*
That will skip over any strings and grab any type of comment (at least, a Xojo-style comment).
You can use that technique to skip over any strings.
When matching a single character, like an open paren, you don’t need to put square brackets around it.
You don’t need to put a backslash before every symbol, just the ones RegEx would consider a token. For example, the vertical bar or asterisk would need a backslash, but not a comma.
Consider this for data type:
\\b(string|integer|boolean|double|single)\\b
I don’t think your pCharacter pattern will work at all because the tokens are not escaped.
A shortcut for integer is simply \\d
The pattern for identifier ignores Unicode characters. Consider:
[quote=466832:@Kem Tekinay]
I don’t think your pCharacter pattern will work at all because the tokens are not escaped.
A shortcut for integer is simply \\d
The pattern for identifier ignores Unicode characters. Consider:
\\b[\\pL_][\\pL_\\d]*
[/quote]
pCharacter “seems” to work… but you are suggesting escaping each one?
what the heck does the above pattern mean/do? compared to the one I currently have?
That will match, for example, an EOL but not the “$” because it was not escaped. It will match a bar, but not the square brackets, because the bar is between the square brackets, and they are not escaped. Finally, it will match any single character, including letters and numbers, because it includes an unescaped dot.
I think what you want it this:
[`~!@#$\\[\\]{};:.]
The \p construct is unicode-aware and is followed by a script or property. \pL means any letter in any language.