LAPG ( Lexical analyzer and parser generator ) Evgeny Gryaznov, 2002-06, inspirer@inbox.ru 1. Introduction Lapg is the combined lexical analyzer and parser generator, which converts a description for a context-free LALR1 grammar into source file to parse the grammar. The generated parser accepts zero-terminated text (C/C++) or string (in C#, Javascript etc.) breaks it into tokens and applies given rules to reduce the input to the main non-terminal symbol. Lapg can be used in a wide range of tasks. It may be a simple string parser or a complicated compiler like C++. The official homepage is http://gryaznov.net/lapg/ SourceForge: http://sourceforge.net/projects/lapg/ LAPG is distributed under the General Public License. 2. Invoking By default lapg assumes that the grammar is stored in file with name syntax while the output name depends on target language and usually begins with word parse. usage: lapg [OPTIONS]... [inputfile [outputfile]] The content of output file is fully depends on the input, lapg has no command line options which affects the generation process. The only thing you can do from command line is to force generation of debug information. It is very important for detecting lalr1 conflicts and is discussed later. lapg -d [inputfile [outputfile]] OR lapg --debug [inputfile [outputfile]] In this mode lapg will generate file 'debug' in the current dirrectory. use 'lapg --help' for more details 3. Supported languages 3.1 C++ (default) For each parser engine you want to use in your program you need to have a class. Lapg takes a parser-description file and generates a separate .cpp file, which contains definition for the function: int yourclass::parse(); 3.2 C# In the case of C# language, the generated .cs file contains one class. You can define additional members for this class in grammar file. 3.3 Javascript (from Pawel Chmielowski) Generates class in separate .js file. 3.4 Java Generates java class in specified package (use .namespace directive). 3.5 Other languages. You can easily add a new language to lapg. You need to add language specific template and (optionaly) default syntax file, then modify srcgen.cpp to add one line about your language to 'langs' static array. Recompile lapg and enjoy. Template samples are in files: templ_cpp, templ_cs, or can be obtained by invoking 'lapg --verbose'. 4. Syntax of parser-description file Comments begin from the '#' symbol and end at the end of line. Spaces, tabs and new-lines are used as separators. 4.1 Lexical conventions: Identificator: is either a C-like identifier or some string in single quotes (ex: '::=') Type: is some string in parentheses (ex: (char*) ) Cmd: is one of the following 1. the rest of line after the \ 2. the whole line, beginning with two tabs. (I like this) 3. text in curly braces, which DOES NOT contain any curly brace. Two Cmds following one by one, will be concatenated. RegExp: is perl-like regular expression in // The following metacharacters have their special meaning: . Match any character (except newline) | Alternation () Grouping [] Character class [^..] not from Character class \ Quote the next character The following escaped characters have their special meaning: \a \b \f \n \r \t \v The following standart qualifiers are implemented: * Match 0 or more times + Match 1 or more times ? Match 1 or 0 times Examples: /"([^"\\]|\\.)*"/ - C-like string /[a-zA-Z_][a-zA-Z_0-9]*/ - C-like identifier 4.2 The grammar-description file consists of four sections. 1) Directives Here you can control the process of code generation. 2) Vocabulary This section contains declarations, which define terminal symbols, their text representations, and semantic actions. 3) Attributes This section describes L-attributes in your grammar. 4) LALR(1) Grammar This section contains one or more grammar rules. 4.3.1 Section #1: Directives This is the full list of target language independent directives. Integer value: maxtoken - maximum length of one token stack - maximum size of the parsing stack String value: class - name of the parsing engine getsym - code for getting a next symbol and putting it to the 'chr' variable errorprefix - this string is inserted after the first parenthesis in call of error function positioning - can be: none, line, full or offset namespace - namespace name to place class in lang - can be: c++, cs, js lexemend - can be: on, off (default) 4.3.2 Section #2: Vocabulary In this section you must define at least one terminal (lexical element). Each terminal definition has the following notation: id typeopt ':' regexp intopt cmdopt OR id typeopt ':' Here: id - identificator, the name of a new terminal type (optional) - language specific, type of data associated with the terminal regexp - regular expression int (optional, default: 0) - signed integer, priority, used to resolve conflicts between terminals cmd (optional) - semantic actions In more details: All regexps are used to make a finite state machine, which reads the input and breaks it into tokens. 'intopt' is used to resolve finite state machine conflicts, this is the priority. For example if you have token /abc/ and token /[a-z]+/ you will have a conflict. To resolve it correctly you must increase the priority of the first lexem, or decrease the priority of the last one. So the definition will look like this: id : /[a-z]+/ -1 abc : /abc/ Semantic actions is a target-language-dependent code, which takes lexem text representation ('token' variable) and associates the data with the terminal. Also this code can modify the type of current lexem, current position, or just do his own work. The following metacharacters are available in these actions: @ Data associated with a new lexem, is equal to zero by default. (language-specific) $id Number, which corresponds the 'id' lexem $@ Current lexem position (structure) $$ variable, which contains the number of the current lexem. You can change it to add some context-sensitivity in your parser. Ex: type: id : /[a-zA-Z_]+/ -1 { if( istype(token) ) $$ = $type; } A predefined lexem 'eoi' has zero number and special meaning, but you can write additional regular expressions for it, and break your data processing at specified tokens. If you define 'error' lexem, it cannot have regular expression, this is predefined lexem, which is used in error recovering (see 5.) You cannot define 'input' lexem because it is predefined as nonterminal. You can have several declarations for a terminal, but you must have the only semantic action for it. Ex: const : /const|__const__/ { go(); } is equivalent to const : /const/ const : /__const__/ { go(); } But this is not a constraint, you can use the following pattern if you want to distinguish several views of a terminal: term : /view1/ { action1(); } term2 : /view2/ { action2(); $$ = $term; } Examples: id(char*): /[a-zA-Z_][a-zA-Z_0-9]*|'[^']+'/ \ @ = _strdup(token); regexp(char*): /\/([^\/\\]|\\.)*\// \ @ = strstrip(token); '::=': /::=/ '[': /\[/ ']': /\]/ 4.3.3 Section #3: Attributes By attribute we mean the data associated with the symbol. In this section you can add L-attributes in your grammar. By default, all attributes in your grammar are synthesized; it means that they depend only on their children. So when you writing rule like this: a :: = b c { action } ; the resulting value of 'a' is a function of 'b' and 'c', i.e. action is equivalent to "a = action(b,c)". In this section you can define heritable attributes. Each attribute will depend also on the left siblings of its symbol. This section is declarative. It is used only for static checks and type control. If you have declared that a symbol has an attribute, lapg checks, whether all entries of this symbol has attributes. Syntax notation is the following: attrib_section : [ attribute-comma-sep-listopt ] attribute : id -> id id Example #1. [ b ] This means, that each entry of 'b' symbol must be preceded by action. This action must produce attribute for 'b'. c ::= a d { $$ = $a; } b ; Now in any rule for 'b' we can use metacharacter $# to access the 'a' symbol. c ::= a b; Wrong. No attribute for b specified. Example #2. [ type -> newvar_list ] definition ::= type newvar_listopt ';' ; newvar_list ::= newvar { addvar( $newvar, $# ); } | newvar_list ',' newvar { addvar( $newvar, $# ); } ; In the attributes section we declared that in our grammar all newvar_list entries are either the leftmost symbol in the rule with newvar_list in the left part, or preceded by the 'type' symbol, therefore, $# symbol refers to 'type' attribute. 4.3.4 Section #4: LALR(1) Grammar In this section you must declare at least one rule for input nonterminal. Syntax notation is the following: lalr_section: ruledecl_list ruledecl: header '::=' rightpart_or_list ';' header '::' rightpart_eq_list ';' header '[' rightpart_eq_list ']' prioritydef header: identificator typeopt rightpart_or_list: rightpart rightpart_or_list '|' rightpart rightpart_eq_list: '=' rightpart rightpart_eq_list '=' rightpart rightpart: rightelem_list rule_priorityopt cmdopt rightelem: cmdopt identificator rule_priority: << identificator prioritydef: % [left|right|nonassoc] identificator_list Restrictions: Symbols ending with opt and terminals cannot be used in header (i.e. in the left part of the rule). Semantics: When lapg encounters a symbol ending with opt, it defines the symbol automatically using the following rules: symbolopt ::= symbol | ; You can define semantic actions at any place in the rule. Actions at the end of the rule will be performed when this rule will be applied. All semantic actions you define become a part of parse function, therefore do not use 'chr', 'token' and 'lapg_*' variables. The following metacharacters are available in these actions: $$ value associated with new nonterminal. $$ = $0 by default. (language-specific: in C/C++ has void* type) $0,$1,$2,$3 value associated with i-th in the right part of rule (language-specific: in C/C++ has void* type) $name#num value associated with num-th entry of the name in the rule (zero based). It has the type, which was associated with this symbol (or default type otherwise). See: examples $name is equivalent to $name#0 $# L-attribute. @$ position of newly created lexem (by default: @$ = @0; ) @0,@1,... position of i-th symbol in the right part of rule Examples: term(int) ::= integer; # by default $term = $integer; term(int) ::= ':' integer { $term = $integer }; # by default $term = $0 term(int) ::= ':' integer { $$ = $1 }; # by default $term = $0 attrib ::= id '->' id { printf( "%s -> %s", $id#0, $id#1 ); } ; multiplication ::= simple | multiplication '*' simple { printf( "%i = %i*%i", $multiplication#0, $multiplication#1, $simple ); } ; 5. Errors handling mechanism By default, when the parser encounters an unexpected input, it calls error function with string "syntax error ..." and exits. But if you need, lapg can skip some part of input an continue analysis. First of all, you define 'error' lexem in terminal description section: error: The definition cannot have regular expression, it turns on mechanism of error recovering. This terminal can be used in grammar in place of skipped part of input. For example: expression_statement :: = expression ';' = error ';' { printf( "expression skipped\n" ); } = ';' ; If syntax error is encountered while waiting for expression_statement, input will be skipped up to ';' and the second rule will be applied. 6. Operator precedence Works like in Bison. I don't wont to retell about it, you can find the information in the following articles: http://www.gnu.org/software/bison/manual/html_node/Precedence.html http://www.gnu.org/software/bison/manual/html_node/Contextual-Precedence.html exception: '%prec' in bison is '<<' in lapg