Here are some alternative implementations of uri-generic, using various parsing libraries. They all passed the full testsuite at the time they were written. The original uri-generic implementation (based on matchable) is a bit hard to read, but it is by far the most efficient. For readability I prefer the irregex one, based on composable Scheme Regular Expressions. It is about average in performance compared to the other implementations and it will hopefully improve when irregex is optimized. Here are some notes I made after benchmarking and implementing each version: (benchmarks on an iBook G4, using Chicken 4.6.2 with irregex 0.8.2) The +/- behind the LOC indicates that some parts are implemented using irregexes. If those parts would be converted to get rid of the irregex dependency, the lines of code would change (generally increase) ============================================================================== Current main implementation: based on matchable [[1058 LOC]] #;1> ,t (let loop ((i 0)) (unless (> i 10000) (uri-reference "http://call-with-current-continuation.org/foo/bar?mooh#qux") (loop (add1 i)))) 7.055s CPU time, 0.747s GC time (major), 1320215 mutations, 85/16096 GCs (major/minor) NOTES: Implementation is a bit hard to read & understand even if you know how it works. It is also the largest implementation (in lines of code). But it's faaaaaast! Turns out that converting a string to a list of chars once and then using an optimizing macro to match the list against expected chars is a very efficient way to do parsing :) ------------------------------------------------------------------------------ Implementation based on abnf [[790+/- LOC]] #;1> ,t (let loop ((i 0)) (unless (> i 10000) (uri-reference "http://call-with-current-continuation.org/foo/bar?mooh#qux") (loop (add1 i)))) 26.32s CPU time, 5.039s GC time (major), 570140 mutations, 411/67768 GCs (major/minor) NOTES: It always generates all possible matches. For example, an abnf for "a*" would generate ("a" "aa" "aaa") when passed "aaa" as input. I think this is the major reason it's so slow. ------------------------------------------------------------------------------ irregex-based implementation [[695 LOC]] #;1> ,t (let loop ((i 0)) (unless (> i 10000) (uri-reference "http://call-with-current-continuation.org/foo/bar?mooh#qux") (loop (add1 i)))) 7.659s CPU time, 0.76s GC time (major), 600143 mutations, 61/15838 GCs (major/minor) NOTES: Very concise but still very readable syntax when using named matches. I especially like that built-in regexes like hex-digit and such are all quoted, whereas custom bits are unquoted. It's an immediate indication that the definition is to be found elsewhere in the code (AFAIK, this is not extendable so you can't make libraries of regexes in the same way; at that point everything of the library becomes unquoted too). The 0.8.1 version of irregex which shipped with Chickens up till 4.6.1 was extremely slow (this benchmark took over 50 seconds to run), but with irregex 0.8.2, the irregex implementation performs similarly to the matchable implementation. ------------------------------------------------------------------------------ packrat-based implementation [[842+/- LOC]] #;1> ,t (let loop ((i 0)) (unless (> i 10000) (uri-reference "http://call-with-current-continuation.org/foo/bar?mooh#qux") (loop (add1 i)))) 53.339s CPU time, 15.325s GC time (major), 3100393 mutations, 1547/340859 GCs (major/minor) NOTES: It is extremely annoying that the macro requires you to provide a body expression. There is no automatic way to let it just resolve to whatever the lower nonterminals resolve to, nor to combine code from a sequence of expressions so you have to spell it out every time. In general, the macro just feels clumsy and you have to keep the specification next to your code all the time because it feels so unnatural (requiring, for my feeling, superfluous sets of parentheses). The macro does not allow escaping to Scheme quickly to produce a dynamically generated parser in-line. You can define procedures outside of the macro which can be called as "nonterminals" but you can't, for example, parameterize a parser on-the-fly. The macro is much more like plain old BNF or normal yacc. This means that for simple things like optional arguments you need to make a separate rule containing both the argument and an empty sequence. Quite annoying, after you've worked with any of the above parser libs where that's no problem. Perhaps I just misunderstood how to use the (/ ...) rule, but I couldn't get it to work. Possibly things are better when you just use the combinators instead, not using the macro. The advantage of the macro is that the resulting ruleset is quite readable. Furthermore, this parser has the same disadvantage that abnf has, it generates all possible parses and hence has quite a bit of overhead. ========================================================================= Other notes that people interested in Scheme parsing might want to read (by zbigniew): http://3e8.org/zb/eggs/lexing.txt