In this example I specify a parse for a basic maths language. This function takes as input some mathematical expression and outputs an instance of `mpc_ast_t`.
Parser Combinators are structures that encode how to parse a particular language. They can be combined using a number of intuitive operators to create new parsers of ever increasing complexity. With these, complex grammars and languages can be processed easily.
The trick behind Parser Combinators is the observation that by structuring the library in a particular way one can make building parser combinators look like writing a grammar itself. Therefore instead of describing _how to parse a language_, a user must only specify _the language itself_, and the computer will work out how to parse it ... as if by magic!
The Parser Combinator type in _mpc_ is `mpc_parser_t`. This encodes a function that attempts to parse some string and, if successful, returns a pointer to some data. Otherwise it returns some error. A parser can be run using `mpc_parse`.
This function returns `true` on success and `false` on failure. It takes as input some parser `p`, some input string `s`, and some filename `f`. It outputs into `r` the result of the parse - which is either a pointer to some data object, or an error. The type `mpc_result_t` is a union type defined as follows.
All the following functions return basic parsers. All of those parsers return strings with the character(s) they manage to match. They have the following functionality.
Combinators are functions that take one or more parsers and return a new parser. These combinators work independent of what exactly those input parsers return on success. In languages such as Haskell ensuring you don't ferry one type of data into a parser requiring a different type of data is done by the compiler. But in C we don't have that luxury. So it is at the discretion of the programmer to ensure that he or she deals correctly with the outputs of different parser types.
A second annoyance in C is that of manual memory management. Some parsers might get half-way and then fail. This means they need to clean up any partial data that has been collected in the parse. In Haskell this is handled by the Garbage Collector, but in C these combinators will need to take _destructor_ functions as input, which say how clean up any partial data that has been collected.
Returns a parser with the following behaviour. If parser `a` succeeds, the it fails. If parser `a` fails, then it succeeds and returns `NULL` (or the result of lift function `lf`). Destructor `da` is used to destroy the result of `a` on success.
Attempts to run `a` zero or more times. If this fails then it succeeds and returns `NULL` (or the result of `lf`). If there is at least one success, results of `a` are combined using fold function `f`. See the _Function Types_ section for more details.
Attempts run `a` exactly `n` times. If this fails, the result of fold function `f` is destructed with `da`, and it returns `NULL` (or the result of lift function `lf`). Result of `a` are combined using fold function `f`.
Attempts to run `a`. Then attempts to run `b`. If `b` fails it destructs the result of `a` using `da`. If both succeed it returns the result of `a` and `b` combined using the fold function `f`. Otherwise it returns an error.
Attempts to run `n` parsers in sequence, returning the fold of the results using fold function `f`. First parsers must be specified, followed by destructors for each parser minus the final one. These are used in case of partial success. For example: `mpc_and(3, mpcf_astrfold, mpc_char('a'), mpc_char('b'), mpc_char('c'), free, free);` would attempt to match `'a'` followed by `'b'` followed by `'c'`, and if successful would concatenate them using `mpcf_astrfold`.
The combinator functions take a number of special function types as function pointers. Here is a short explanation of those types are how they are expected to behave.
This takes in some pointer to data and outputs some new or modified pointer to data, ensuring to free and old data no longer required. The `apply_to` variation takes in an extra pointer to some data such as state of the system.
This takes two pointers to data and must output some new combined pointer to data, ensuring to free and old data no longer required. When used with the `many`, `many1` and `count` functions this initially takes in `NULL` for it's first argument and following that takes in for it's first argument whatever was previously returned by the function itself. In this way users have a chance to build some initial data structure before populating it with whatever is passed as the second argument.
This function returns some data value when called. It can be used to create _empty_ versions of data types when certain combinators have no known default value to return.
This will construct a parser called `name` which can then be used by others, including itself. Any parser created using `mpc_new` is said to be _retained_. This means it will behave slightly differently to a normal parser. For example when deleting a parser that includes a _retained_ parser, the _retained_ parser it will not be deleted along with it. To delete a retained parser `mpc_delete` must be used on it directly.
This assigns the contents of parser `a` to `p`, and frees and memory used by `a`. With this technique parsers can now reference each other, as well as themselves, without trouble.
A final step is required. Parsers that reference each other must all be undefined before they are deleted. It is important to do any undefining before deletion. The reason for this is that to delete a parser it must look at each sub-parser that is used by it. If any of these have already been deleted a segfault is unavoidable.
To ease the task of undefining and then deleting parsers `mpc_cleanup` can be used. It takes `n` parsers as input, and undefines them all, before deleting them all.
*`mpc_parser_t* mpc_start(mpc_parser_t* a);` Matches the start of input an `a`
*`mpc_parser_t* mpc_end(mpc_parser_t* a, mpc_dtor_t da);` Matches `a` followed by the end of input
*`mpc_parser_t* mpc_enclose(mpc_parser_t* a, mpc_dtor_t da);` Matches the start of input, `a` and then the end of input
*`mpc_parser_t* mpc_strip(mpc_parser_t* a);` Matches `a` striping any surrounding whitespace
*`mpc_parser_t* mpc_tok(mpc_parser_t* a);` Matches `a` and strips any trailing whitespace
*`mpc_parser_t* mpc_sym(const char* s);` Matches string `s` and strips any trailing whitespace
*`mpc_parser_t* mpc_total(mpc_parser_t* a, mpc_dtor_t da);` Matches the whitespace stripped `a`, enclosed in the start and end of input
*`mpc_parser_t* mpc_between(mpc_parser_t* a, mpc_dtor_t ad, const char* o, const char* c);` Matches `a` between strings `o` and `c`
*`mpc_parser_t* mpc_parens(mpc_parser_t* a, mpc_dtor_t ad);` Matches `a` between `"("` and `")"`
*`mpc_parser_t* mpc_braces(mpc_parser_t* a, mpc_dtor_t ad);` Matches `a` between `"<"` and `">"`
*`mpc_parser_t* mpc_brackets(mpc_parser_t* a, mpc_dtor_t ad);` Matches `a` between `"{"` and `"}"`
*`mpc_parser_t* mpc_squares(mpc_parser_t* a, mpc_dtor_t ad);` Matches `a` between `"["` and `"]"`
*`mpc_parser_t* mpc_tok_between(mpc_parser_t* a, mpc_dtor_t ad, const char* o, const char* c);` Matches `a` between `o` and `c`, where `o` and `c` have their trailing whitespace striped.
*`mpc_parser_t* mpc_tok_parens(mpc_parser_t* a, mpc_dtor_t ad);` Matches `a` between trailing whitespace stripped `"("` and `")"`
*`mpc_parser_t* mpc_tok_braces(mpc_parser_t* a, mpc_dtor_t ad);` Matches `a` between trailing whitespace stripped `"<"` and `">"`
*`mpc_parser_t* mpc_tok_brackets(mpc_parser_t* a, mpc_dtor_t ad);` Matches `a` between trailing whitespace stripped `"{"` and `"}"`
*`mpc_parser_t* mpc_tok_squares(mpc_parser_t* a, mpc_dtor_t ad);` Matches `a` between trailing whitespace stripped `"["` and `"]"`
*`mpc_val_t* mpcf_fst_free(mpc_val_t* x, mpc_val_t* y);` Returns `x` and frees `y`
*`mpc_val_t* mpcf_snd_free(mpc_val_t* x, mpc_val_t* y);` Returns `y` and frees `x`
*`mpc_val_t* mpcf_freefold(mpc_val_t* t, mpc_val_t* x);` Returns `NULL` and frees `x`
*`mpc_val_t* mpcf_strfold(mpc_val_t* t, mpc_val_t* x);` Concatenates `t` and `x` and returns result
*`mpc_val_t* mpcf_afst(int n, mpc_val_t** xs);` Returns first argument
*`mpc_val_t* mpcf_asnd(int n, mpc_val_t** xs);` Returns second argument
*`mpc_val_t* mpcf_atrd(int n, mpc_val_t** xs);` Returns third argument
*`mpc_val_t* mpcf_astrfold(int n, mpc_val_t** xs);` Concatenates and returns all input strings
*`mpc_val_t* mpcf_between_free(int n, mpc_val_t** xs);` Frees first and third argument and returns second
*`mpc_val_t* mpcf_maths(int n, mpc_val_t** xs);` Examines second argument as string to see which operator it is, then operators on first and third argument as if they are `int*`.
Passing around all these function pointers might seem clumsy, but having parsers be type-generic is important as it lets users define their own syntax tree types as well as perform specific house-keeping or data processing in the parsing phase. For example we can specify a simple maths grammar that computes the result of the expression as it goes along.
Even with all that has been shown above, specifying parts of text can be a tedious task requiring many lines of code. So _mpc_ provides a simple regular expression matcher.
This returns a parser that will attempt to match the given regular expression pattern, and return the matched string on success. It does not have support for groups and match objects, but should be sufficient for simple tasks.
A cute thing about this is that it uses previous parts of the library to parse the user input string - and because _mpc_ is type generic, the parser spits out a `mpc_parser_t` directly! It even uses many of the combinator functions as fold functions! This is a great case study in learning how to use _mpc_, so those curious are encouraged to find it in the source code.
Abstract Syntax Tree
--------------------
For those that really do not care what data they get out a basic abstract syntax tree type `mpc_ast_t` has been included. Along with this are included some combinator functions which work specifically on this type. They reside under `mpca_*` and you will notice they do not require fold functions or destructors to be specified.
Doing things via this method means that all the data processing must take place after the parsing - but to many this will be preferable. It also allows for one more trick...
If all the fold and destructor functions are implicit then the user can simply specify the grammar in some nice way and the system can try to build an AST for them from this alone.
This can be used to do exactly that. It takes in some grammar, as well as a list of named parsers - and outputs a parser that does exactly what is specified.