Trouble with tokens

I am using the Parser which can be downloaded from http://www.charcoaldesign.co.uk/source/realbasic.

I am not sure if this is a deliberate limitation or if there is a bug or if there is something wrong with my code but I seem to have been adding tokens to the parser fine until I found out that the amount of tokens I can create has a fairly small limit. All of the tokens are added if I try to add a limited amount of tokens. Of coarse there is a limit to everything but I this seems like a weird limitation to me. The limit is around 47 tokens. Here is the code which I used to attempt to add tokens:

  eof = false
  position = 1
  
  Tokenizer1 = new Tokenizer
  Parser1 = new Parser
  
  //add tokenizer1's event handlers
  AddHandler Tokenizer1.EOF, AddressOf EOF
  AddHandler Tokenizer1.TokenMatched, AddressOf TokenMatched
  AddHandler Tokenizer1.UnexpectedToken, AddressOf UnexpectedToken
  //add parser1's event handlers
  AddHandler Parser1.FetchToken, AddressOf FetchToken
  AddHandler Parser1.Finished, AddressOf Finished
  AddHandler Parser1.Reduce, AddressOf Reduce
  
  //setup tokens
  Tokenizer1.AddTokenType "end_line","\\r",""
  //not compiled
  Tokenizer1.AddTokenType "whitespace","[ \
\\r]",""
  Tokenizer1.AddTokenType "comment","//([^\
\\r]*)","\\1"
  //end
  Tokenizer1.AddTokenType "end","end if",""
  Tokenizer1.AddTokenType "end","wend",""
  Tokenizer1.AddTokenType "end","end while",""
  Tokenizer1.AddTokenType "end","next",""
  Tokenizer1.AddTokenType "end","end for",""
  Tokenizer1.AddTokenType "end","end select",""
  Tokenizer1.AddTokenType "end","end",""
  //unwanted (this comment is only relevant to me - just saying that I will be removing this in the future)
  Tokenizer1.AddTokenType "dim","dim",""
  Tokenizer1.AddTokenType "as","as",""
  Tokenizer1.AddTokenType "print","print",""
  //select case
//works fine if I comment any three tokens out but I DON'T want to have to limit myself
  'Tokenizer1.AddTokenType "case else","case else",""
  'Tokenizer1.AddTokenType "case","case",""
  'Tokenizer1.AddTokenType "select","select case",""
  Tokenizer1.AddTokenType "case","case",""
  //break
  Tokenizer1.AddTokenType "break","exit",""
  Tokenizer1.AddTokenType "continue","continue",""
  Tokenizer1.AddTokenType "return","return",""
  //if
  Tokenizer1.AddTokenType "else if","elseif",""
  Tokenizer1.AddTokenType "else if","else if",""
  Tokenizer1.AddTokenType "if","if",""
  Tokenizer1.AddTokenType "then","then",""
  Tokenizer1.AddTokenType "else","else",""
  //for
  Tokenizer1.AddTokenType "for","for",""
  Tokenizer1.AddTokenType "step","step",""
  Tokenizer1.AddTokenType "to","to",""
  Tokenizer1.AddTokenType "downto","downto",""
  //while
  Tokenizer1.AddTokenType "while","while",""
  //boolean
  Tokenizer1.AddTokenType "true","true",""
  Tokenizer1.AddTokenType "false","false",""
  //maths operators
  Tokenizer1.AddTokenType "+","[+]",""
  Tokenizer1.AddTokenType "-","[-]",""
  Tokenizer1.AddTokenType "*","[*]",""
  Tokenizer1.AddTokenType "/","[/]",""
  //comparision operators
  Tokenizer1.AddTokenType "<=","<=",""
  Tokenizer1.AddTokenType ">=",">=",""
  Tokenizer1.AddTokenType "<>","<>",""
  Tokenizer1.AddTokenType ">","[>]",""
  Tokenizer1.AddTokenType "<","[<]",""
  Tokenizer1.AddTokenType "==","==",""
  Tokenizer1.AddTokenType "=","[=]",""
  //
  Tokenizer1.AddTokenType "(","[(]",""
  Tokenizer1.AddTokenType ")","[)]",""
  Tokenizer1.AddTokenType "integer_literal","[0-9]+"
  Tokenizer1.AddTokenType "real_literal","[0-9]*[.][0-9]+|[0-9]+[.][0-9]*"
  Tokenizer1.AddTokenType "string_literal","[""]([^""]*)[""]","\\1"
  Tokenizer1.AddTokenType "type","string|integer|boolean"
  Tokenizer1.AddTokenType "identifier","[a-z_][a-z0-9_]*"
  //set up operands
  Parser1.AddStructureType "operand","identifier",1000
  Parser1.AddStructureType "operand","string_literal",1000
  Parser1.AddStructureType "operand","real_literal",1000
  Parser1.AddStructureType "operand","integer_literal",1000
  Parser1.AddStructureType "operand","expression",1003
  Parser1.AddStructureType "operand","( operand )",1001
  //set up prefix expressions
  Parser1.AddStructureType "expression","- operand",1002
  //set up infix expressions
  Parser1.AddStructureType "expression","operand + operand",2
  Parser1.AddStructureType "expression","operand - operand",2
  Parser1.AddStructureType "expression","operand * operand",3
  Parser1.AddStructureType "expression","operand / operand",3
  Parser1.AddStructureType "expression","operand <= operand",5
  Parser1.AddStructureType "expression","operand >= operand",5
  Parser1.AddStructureType "expression","operand <> operand",6
  Parser1.AddStructureType "expression","operand == operand",6,false
  Parser1.AddStructureType "expression","operand < operand",4
  Parser1.AddStructureType "expression","operand > operand",4
  Parser1.AddStructureType "assignment","operand = operand",1,false
  //declarations
  Parser1.AddStructureType "declaration","dim operand as type",-1
  Parser1.AddStructureType "declaration","dim operand as type = operand",-1
  //statements
  Parser1.AddStructureType "statement","print operand",-1
  Parser1.AddStructureType "statement","if operand then program end",-1
  Parser1.AddStructureType "statement","if operand then program else program end",-1
  //loops
  Parser1.AddStructureType "loop","for assignment to operand program next",-1
  Parser1.AddStructureType "loop","for assignment downto operand program next",-1
  //program
  Parser1.AddStructureType "program","declaration",-1
  Parser1.AddStructureType "program","statement",-1
  Parser1.AddStructureType "program","loop",-1
  Parser1.AddStructureType "program","program program",0
  
  Parser1.Parse

There is more code in this method but I tried to reduce down to only what is relevant to the problem (as far as I can tell).

Thanks

Is there a reason you are trying to use that parser instead of Xojoscript?

I’ve used this class before and never had a size limit problem. I tested your code by duplicating Window1 and pasting your code into the Open event, and also commented out the lines starting “Tokenizer1 =” through to all the AddHandlers.

It’s been a long time but it appears the order added matters, or maybe it’s the pattern. Move the 3 lines you marked to the end of the AddTokenType list and it appears to work. Also note there’s a duplicate “case” token that might be interfering.

I think maybe the problem isn’t in the Tokenizer but in the Parser matching tokens. I refactored a non-Event based version of these classes to better see and understand how the Shift-Reduce algorithm works which helped but I never got it discriminating between minus and negate. Try deconstructing the code with lots of breakpoints and find where it’s mismatching a pattern.

[quote=92116:@Will Shank]I’ve used this class before and never had a size limit problem. I tested your code by duplicating Window1 and pasting your code into the Open event, and also commented out the lines starting “Tokenizer1 =” through to all the AddHandlers.

It’s been a long time but it appears the order added matters, or maybe it’s the pattern. Move the 3 lines you marked to the end of the AddTokenType list and it appears to work. Also note there’s a duplicate “case” token that might be interfering.

I think maybe the problem isn’t in the Tokenizer but in the Parser matching tokens. I refactored a non-Event based version of these classes to better see and understand how the Shift-Reduce algorithm works which helped but I never got it discriminating between minus and negate. Try deconstructing the code with lots of breakpoints and find where it’s mismatching a pattern.[/quote]
Thanks. Where am I adding breakpoints?

Yes. As far as I am aware, XojoScript does not give you a list of parsed tokens. I need a lot of power and flexibility over the parsing.

Xojoscript parses and compiles Xojo code. What are you trying to parse?

My own programming language. It is converted then to Javascript.

[quote=92116:@Will Shank]I’ve used this class before and never had a size limit problem. I tested your code by duplicating Window1 and pasting your code into the Open event, and also commented out the lines starting “Tokenizer1 =” through to all the AddHandlers.

It’s been a long time but it appears the order added matters, or maybe it’s the pattern. Move the 3 lines you marked to the end of the AddTokenType list and it appears to work. Also note there’s a duplicate “case” token that might be interfering.

I think maybe the problem isn’t in the Tokenizer but in the Parser matching tokens. I refactored a non-Event based version of these classes to better see and understand how the Shift-Reduce algorithm works which helped but I never got it discriminating between minus and negate. Try deconstructing the code with lots of breakpoints and find where it’s mismatching a pattern.[/quote]
I don’t understand how you changed the code to get it to ‘appear to’ work? Thanks

I’m going by the print out of pattern (or is it just token) matches below the sample code, where it shows comment, new_line, identifier, etc. With those 3 lines you indicated as the problem commented out the first few lines listed match the code. With those 3 lines uncommented the print out is skipping several things, I was mostly focused on the identifier(i) that should come after finding “for”. Moving those 3 lines to the bottom makes the print out look more normal, I mean identifiers appear again but I didn’t check it all the way for every part of the sample code.

As far as placing breakpoints that’ll be a challenge. I was thinking that you’d start by establishing the first pattern that the parser misses. Then you’ll need to investigate the code and place breakpoints, maybe write special case breaks to narrow down and find when the parser should be identifying those patterns. So like I think the i in “for i” is the first thing being missed with your original code. I’d try to get a break at the point in the parser where and when it should be matching that i (but isn’t) and then investigate the variables at play to see why it’s not happening.

I have really not been able to fix this problem still! :frowning:

Could I use some kind of program to generate the tokens for me, so it is valid?

Wow! You have developed your Xojo skills Oliver ! :slight_smile: Kudos!

Thanks. This was a while ago and I am no expert with the parser. Not at all.

I have already played around with it enough to realise my potential to create a programming language. However I cannot get round this issue.

If you wish, you can subscribe to my blog:
http://blog.powermodegames.com

Thanks for the compliment anyway. :slight_smile:

For sure! :slight_smile:

Thanks

I’ve never had a problem with the Tokenizer nor the Parser (save adding proper structures to distinguish negation vs subtraction).

The order of items added and duplicate items might make a difference. It’s difficult to diagnose. Can you post a project and describe the problem?

[quote=163466:@Will Shank]I’ve never had a problem with the Tokenizer nor the Parser (save adding proper structures to distinguish negation vs subtraction).

The order of items added and duplicate items might make a difference. It’s difficult to diagnose. Can you post a project and describe the problem?[/quote]
I may post a project. Basically, the problem is identifiers are passed off as ‘end_line’.

Thanks

[quote=163466:@Will Shank]I’ve never had a problem with the Tokenizer nor the Parser (save adding proper structures to distinguish negation vs subtraction).

The order of items added and duplicate items might make a difference. It’s difficult to diagnose. Can you post a project and describe the problem?[/quote]
I don’t understand the structures. Is there something I can read upon to learn about that?

Thanks

It’s a Shift-Reduce Parser with precedence (and single look-ahead if I remember correctly).

The charcoal design parser is event based, I mean, when you feed in source the tokenizer passes tokens to the parser as it finds them (through events) and the parser produces it’s matches as it finds them (through events). I find it easier to follow without events so the source is passed to the Tokenizer and it returns an array of all Tokens, then that array is passed to the parser which returns a full AST (abstract syntax tree), basically the tokens linked up hierarchically.

The way the parser works is it ‘shifts’ in a token and then scans it’s list of structures for a match. If there’s no match it shifts in the next token and checks again. When a match is found it ‘Reduces’ those matching tokens into a single one. It’s a little more complicated than that because it’s also testing precedence of structures. I forget the details, I think that’s what the look-ahead token is for; it matches 2 structures and reduces the one with higher precedence, or something like that.

My terminology is rusty, maybe someone else can clarify better.

[quote=163473:@Will Shank]ed in source the tokenizer passes tokens to the parser as it finds them (through events) and the parser produces it’s matches as it finds them (through events). I find it easier to follow without events so the source is passed to the Tokenizer and it returns an array of all Tokens, then that array is passed to the parser which returns a full AST (abstract syntax tree), basically the tokens linked up hierarchically.

The way the parser works is it ‘shifts’ in a token and then scans it’s list of structures for a match. If there’s no match it shifts in the next token and checks again. When a match is found it ‘Reduces’ those mat[/quote]
Thanks.