# xj — HTML to JSON
This, `xj`, is a Unix filter that reads XML (or permissively parses
HTML) and outputs JSON. Perfect for piping directly into [jq], [gron]
or [json2tsv].
## Usage
wget -qO- https://stedolan.github.io/jq/|xj|jq '..|select(.title?)[][]'
## Description
I put it together but it's just a tiny bit of glue code that uses
the [HTML parser][html] and the [output combinators][fmt] both made by
Alex Shinn.
This is just an early release and there's a pretty big bug currently:
tabs in the input document contains are not being escaped properly and
will cause jq to crash. Hoping to fix that in a future release.
[jq]: https://stedolan.github.io/jq/
[gron]: https://github.com/tomnomnom/gron
[json2tsv]: https://codemadness.org/json2tsv.html
[html]: http://wiki.call-cc.org/eggref/5/html-parser
[fmt]: http://synthcode.com/scheme/fmt/
## Formal Semantics
Elements are objects with one key, the element name, and the value is
an array with the children of the element, or an empty array if there
aren't any. (This is to disambiguate elements from text data.)
Iff there are any attributes, an attibute object is listed first among
the children, disambiguated from the other children by having a "@"
key. The attributes are not in a list, they can be accessed directly.
In XML, an element can have several children with the same name, and
in turn have grandchildren. But the same isn't true for attributes
which is why it can have simpler semantics.
## Building
Get the source at `git clone https://idiomdrottning.org/xj` and to
build it on Debian and derivatives, do
apt install chicken-bin
chicken-install fmt html-parser srfi-1 utf8
csc -O5 xj.scm
Remove the `-O5` when you're hacking.