# chicken-lexgen Lexer and parser combinators in Chicken Scheme ## Description `lexgen` is a lexer generator comprised in its core of only five small procedures that can be combined to form pattern matchers. A pattern matcher procedure takes an input stream, and returns a new stream advanced by the pattern. A stream is defined as a list that contains a list of characters consumed by the pattern matcher, and a list of characters not yet consumed. E.g., the list ((#\a) (#\b #\c #\d #\e)) represents a stream that contains the consumed character a, and the unconsumed characters b c d e. A pattern matcher has the form of a procedure that takes a success continuation, which is invoked when the pattern matches and the stream is advanced, an error continuation, which is invoked when the pattern does not match, and an input stream. ## Library Procedures Every combinator procedure in this library returns a procedure that takes in a success continuation, error continuation and input stream as arguments. ### Basic procedures (seq MATCHER1 MATCHER2) => MATCHER `seq` builds a matcher that matches a sequence of patterns. (bar MATCHER1 MATCHER2) => MATCHER `bar` matches either of two patterns. It's analogous to patterns separated by `|` in traditional regular expressions. (star MATCHER) => MATCHER `star` is an implementation of the Kleene closure. It is analogous to `*` in traditional regular expressions. ### Token procedure (tok TOKEN PROC) => MATCHER Procedure `tok` builds pattern matchers based on character comparison operations. It is intended for matching input sequences that are SRFI-127 lazy streams. For each stream given, `tok` applies the procedure `PROC` to the given token `TOKEN` and an input character. If the procedure returns a true value, that value is prepended to the list of consumed elements, and the input character is removed from the stream of input elements. (char CHAR) => MATCHER Matches a single character. (set CHAR-SET) => MATCHER Matches any of a SRFI-14 set of characters. (range CHAR CHAR) => MATCHER Matches a range of characters. Analogous to character class `[]`. (lit STRING) => MATCHER Matches a literal string `s`. ### Convenience procedures These procedures are built from the basic procedures and are provided for convenience. (try PROC) => PROC Converts a binary predicate procedure to a binary procedure that returns its right argument when the predicate is true, and false otherwise. (lst MATCHER-LIST) => MATCHER Constructs a matcher for the sequence of matchers in `MATCHER-LIST`. (pass) => MATCHER This matcher returns without consuming any input. (pos MATCHER) => MATCHER Positive closure. Analogous to `+`. (opt MATCHER) => MATCHER Optional pattern. Analogous to `?`. (bind F P) => MATCHER Given a rule `P` and function `F`, returns a matcher that first applies `P` to the input stream, then applies `F` to the returned list of consumed tokens, and returns the result and the remainder of the input stream. Note: this combinator will signal failure if the input stream is empty. (bind* F P) => MATCHER The same as `bind`, but will signal success if the input stream is empty. (rebind F G P) => MATCHER Given a rule `P` and procedures `F` and `G`, returns a matcher that first applies `F` to the input stream, then applies `P` to the resulting stream, then applies `G` to the resulting list of consumed elements and returns the result along with the remainder of the input stream. Note: this combinator will signal failure if the input stream is empty. (rebind* F G P) => MATCHER The same as `rebind`, but will signal success if the input stream is empty. (drop P) => MATCHER Given a rule `P`, returns a matcher that always returns an empty list of consumed tokens when `P` succeeds. ### Lexer procedure (lex MATCHER ERROR STRING) => CHAR-LIST `lex` takes a pattern and a string, turns the string into a list of streams (containing one stream), applies the pattern, and returns the first possible match. Argument `ERROR` is a single-argument procedure called when the pattern does not match anything. ## Examples ### A pattern to match floating point numbers ```scheme ;; A pattern to match floating point numbers. ;; "-"?(([0-9]+(\\.[0-9]+)?)|(\\.[0-9]+))([eE][+-]?[0-9]+)? (define numpat (let* ((digit (range #\0 #\9)) (digits (pos digit)) (fraction (seq (char #\.) digits)) (significand (bar (seq digits (opt fraction)) fraction)) (exp (seq (set "eE") (seq (opt (set "+-")) digits))) (sign (opt (char #\-)))) (seq sign (seq significand (opt exp))))) (define (err s) (print "lexical error on stream: " s) (list)) (lex numpat err "-123.45e-6") ``` ### A pattern to match floating point numbers and construct user-defined lexer state ```scheme (define (collect cs) (let loop ((cs cs) (ax (list))) (cond ((null? cs) `(,(list->string ax))) ((atom? (car cs)) (loop (cdr cs) (cons (car cs) ax))) (else (cons (list->string ax) cs))))) (define (make-exp x) (or (and (pair? x) (let ((x1 (collect x))) (list `(exp . ,x1)))) x)) (define (make-significand x) (or (and (pair? x) (let ((x1 (collect x))) (cons `(significand ,(car x1)) (cdr x1)))) x)) (define (make-sign x) (or (and (pair? x) (let ((x1 (collect x))) (cons `(sign ,(car x1)) (cdr x1)))) x)) (define (check s) (lambda (s1) (if (null? s1) (err s) s1))) (define bnumpat (let* ((digit (range #\0 #\9)) (digits (star digit)) (fraction (seq (char #\.) digits)) (significand (bar (seq digits (opt fraction)) fraction)) (exp (seq (set "eE") (seq (opt (set "+-")) digits))) (sign (opt (char #\-)) ) (pat (seq (bind make-sign sign) (seq (bind make-significand significand) (bind make-exp (opt exp)))))) pat)) (define (num-parser s) (car (lex bnumpat err s))) (num-parser "-123.45e-6") ``` ## Version History * 8.2 Removed yasos dependency [thanks to Noel Cragg] * 8.1 Ported to CHICKEN 5 and yasos collections interface * 7.1 Bug fix in bind* [thanks to Peter Bex] * 7.0 Added bind* and rebind* variants of bind and rebind [thanks to Peter Bex] * 6.1-6.2 Corrected behavior of the tok combinator so that the failure continuation is invoked upon end-of-input [thanks to Chris Salch] * 6.0 Using utf8 for char operations * 5.2 Ensure test script returns proper exit status * 5.0-5.1 Added error continuation to the matcher interface and eliminated multiple stream matching * 4.0 Implemented typeclass interface for abstracting over input sequences * 3.8 Added procedure `star*` (greedy Kleene closure matching) * 3.6 Added procedure redo [thanks to Christian Kellermann] * 3.5 Bug fixes in bind [reported by Peter Bex] * 3.3 Bug fixes in stream comparison * 3.2 Improved input stream comparison procedures * 3.1 Added rebind combinator and stream-unfold procedure * 3.0 Added an extension mechanism for input streams of different types (to be elaborated and documented in subsequent versions). * 2.6 Added bind and drop combinators * 2.5 The seq combinator checks whether the first parser in the sequence has failed * 2.4 Added (require-library srfi-1); using lset<= instead of equal? in star * 2.3 Bug fix in procedure range; added procedure cps-table * 2.2 Bug fix in procedure star * 2.1 Added procedure lst * 2.0 Core procedures rewritten in continuation-passing style * 1.5 Using (require-extension srfi-1) * 1.4 Ported to Chicken 4 * 1.2 Added procedures try and tok (supersedes pred) * 1.0 Initial release ## License Based on the [SML lexer generator](http://www.standarddeviance.com/projects/combinators/combinators.html) by Thant Tessman. > > Copyright 2009-2021 Ivan Raikov. > > > This program is free software: you can redistribute it and/or modify > it under the terms of the GNU General Public License as published by > the Free Software Foundation, either version 3 of the License, or > (at your option) any later version. > > This program is distributed in the hope that it will be useful, but > WITHOUT ANY WARRANTY; without even the implied warranty of > MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > General Public License for more details. > > A full copy of the GPL license can be found at > . >