Here are some alternative implementations of uri-generic, using various parsing libraries. They all passed the full testsuite at the time they were written. Occasionally they are updated to make them work again with the latest testsuite. The original uri-generic implementation (based on matchable) is a bit hard to read, but it is by far the most efficient. For readability I prefer the irregex one, based on composable Scheme Regular Expressions. It is about average in performance compared to the other implementations and it will hopefully improve when irregex is optimized. Here are some notes I made after benchmarking and implementing each version: (benchmarks on a Lenovo X230, Chicken 4.11.0, which ships w/ irregex 0.9.4) The benchmark can be done like this: (load "uri-generic.irregex.so") ;; Use the compiled version! (load "../uri-generic.import.scm") (import uri-generic) (time (let loop ((i 0)) (unless (> i 10000) (uri-reference "http://call-with-current-continuation.org/foo/bar?mooh#qux") (loop (add1 i))))) You can compile with: for i in uri-generic.*.scm; do echo Compiling $i; csc -s -O3 $i; done Test with csi -s test.scm after changing it for the tested implementation. The +/- behind the LOC indicates that some parts are implemented using irregexes. If those parts would be converted to get rid of the irregex dependency, the lines of code would change (generally increase) ============================================================================== Current main implementation: based on matchable [[1284 LOC]] 0.276s CPU time, 0.036s GC time (major), 1320148/7492 mutations (total/tracked), 10/1653 GCs (major/minor) NOTES: Implementation is a bit hard to read & understand even if you know how it works. It is also the largest implementation (in lines of code). But it's faaaaaast! Turns out that converting a string to a list of chars once and then using an optimizing macro to match the list against expected chars is a very efficient way to do parsing :) ------------------------------------------------------------------------------ Implementation based on abnf [[966 LOC]] 1.392s CPU time, 0.112s GC time (major), 2980315/5741 mutations (total/tracked), 27/11011 GCs (major/minor) NOTES: abnf has undergone a few major API changes, which meant this alternative implementation required a massive overhaul. I'd be hesitant to rely on a parsing egg which changes so radically. On the other hand, it may have stabilised by now. The API is pretty easy to use, and the resulting parsers rather readable. Originally abnf appeared to be faster than comparse, but we were comparing apples and oranges: the abnf implementation of uri-decode-string was still using the irregex-based version, which turns out to be quite the bottleneck. ------------------------------------------------------------------------------ Implementation based on comparse [[893 LOC]] 0.84s CPU time, 0.116s GC time (major), 1870231/55987 mutations (total/tracked), 43/6724 GCs (major/minor) NOTES: Reasonably easy to use and the resulting parser is pretty readable. It used to be very slow, but performance has been worked on a lot, so now it's more acceptable. In many respects it is very similar to abnf; obviously I prefer the license of comparse, but overall abnf is more readable. The learning curve of the current version of abnf is slightly steeper than comparse (the old one being quite hard to use), but once you "get" it it's easy enough. This may just be a documentation issue and the confusion of having to deal with the API of 2 eggs (lexgen/typeclass) rather than an inherent problem. ------------------------------------------------------------------------------ irregex-based implementation [[859 LOC]] 0.896s CPU time, 0.024s GC time (major), 600077/2761 mutations (total/tracked), 24/3076 GCs (major/minor) NOTES: Very concise but still very readable syntax when using named matches. I especially like that built-in regexes like hex-digit and such are all quoted, whereas custom bits are unquoted. It's an immediate indication that the definition is to be found elsewhere in the code (AFAIK, this is not extendable so you can't make libraries of regexes in the same way; at that point everything of the library becomes unquoted too). The 0.8.1 version of irregex which shipped with Chickens up till 4.6.1 was extremely slow (this benchmark took over 50 seconds to run), but with irregex 0.8.2, the irregex implementation performs similarly to the matchable implementation (but it's still slightly slower). ------------------------------------------------------------------------------ packrat-based implementation [[997 LOC]] 1.972s CPU time, 0.296s GC time (major), 3110328/513298 mutations (total/tracked), 273/10605 GCs (major/minor) NOTES: In compiled mode it's the slowest of the bunch, and it's extremely clumsy. Do not use, avoid at all costs, etc. Why? Many reasons: It is extremely annoying that the macro requires you to provide a body expression. There is no automatic way to let it just resolve to whatever the lower nonterminals resolve to, nor to combine code from a sequence of expressions so you have to spell it out every time. In general, the macro just feels clumsy and you have to keep the specification next to your code all the time because it feels so unnatural (requiring, for my feeling, superfluous sets of parentheses). The macro does not allow escaping to Scheme quickly to produce a dynamically generated parser in-line. You can define procedures outside of the macro which can be called as "nonterminals" but you can't, for example, parameterize a parser on-the-fly. The macro is much more like plain old BNF or normal yacc. This means that for simple things like optional arguments you need to make a separate rule containing both the argument and an empty sequence. Quite annoying, after you've worked with any of the above parser libs where that's no problem. Perhaps I just misunderstood how to use the (/ ...) rule, but I couldn't get it to work. Debugging is pretty terrible, due to the definitions being wrapped in one large macro. Because everything is wrapped up in one macro, you can't easily reuse subparsers, unless you rip them out and put them in their own macro. This makes it completely uncomposable. Possibly things are better when you just use the combinators instead, not using the macro. The advantage of the macro is that the resulting ruleset is quite readable. ------------------------------------------------------------------------------ prcc-based implementation [[923 LOC]] 33.26s CPU time, 1.72s GC time (major), 63413798/28125899 mutations (total/tracked), 919/263952 GCs (major/minor) NOTES: At first this looks like a very limited library, but once you start working with it you realise that all the important combinators are there. It's a very clean API if you ignore all the aliases. It only supplies what's needed, and you can easily build additional combinators on it. The documentation was a bit lacking, but a dive into the source code was enough to figure out most things (and now I've extended the docs a bit with that knowledge). I used the comparse library as a basis, which was easy enough to convert to prcc. Found a couple of bugs, which were fixed within hours by the very responsive maintainer, even though the egg was from 2012 and had only one release! A real disadvantage is that parse errors will always be written to the current output port. There seems to be no way to supply custom error handling. The error handling itself is pretty decent in that it keeps track of input position and it knows what was expected when the error occurred. Unfortunately, for more complex parsers this doesn't work perfectly: (parse-string "qux" (sel (str "foo") (str "bar"))) results in an error that says "bar was expected". On the other hand, this would be easy to customize by using "act" with a custom failure handler. It's obviously too slow for real-world use, probably because it uses srfi-41 which is not known to be efficient at all. It could probably be converted to lazy-seq, which should speed things up, but I haven't done any serious performance comparisons. ========================================================================= Other notes that people interested in Scheme parsing might want to read (by zbigniew): http://3e8.org/zb/eggs/lexing.txt