Introduction

Ugarit is a backup/archival system based around content-addressible storage.

News

Development priorities are: Performance, better error handling, and fixing bugs! After I've cleaned house a little, I'll be focussing on replicated backend storage (ticket [f1f2ce8cdc]), as I now have a cluster of storage devices at home.

About Ugarit

What's content-addressible storage?

Traditional backup systems work by storing copies of your files somewhere. Perhaps they go onto tapes, or perhaps they're in archive files written to disk. They will either be full dumps, containing a complete copy of your files, or incrementals or differentials, which only contain files that have been modified since some point. This saves making repeated copies of unchanging files, but it means that to do a full restore, you need to start by extracting the last full dump then applying one or more incrementials, or the latest differential, to get the latest state. Not only do differentials and incrementals let you save space, they also give you a history - you can restore up to a previous point in time, which is invaluable if the file you want to restore was deleted a few backup cycles ago! This technology was developed when the best storage technology for backups was magnetic tape, because each dump is written sequentially (and restores are largely sequential, unless you're skipping bits to pull out specific files). However, these days, random-access media such as magnetic disks and SSDs are cheap enough to compete with magnetic tape for long-term bulk storage (especially when one considers the cost of a tape drive or two). And having fast random access means we can take advantage of different storage techniques. A content-addressible store is a key-value store, except that the keys are always computed from the values. When a given object is stored, it is hashed, and the hash used as the key. This means you can never store the same object twice; the second time you'll get the same hash, see the object is already present, and re-use the existing copy. Therefore, you get deduplication of your data for free. But, I hear you ask, how do you find things again, if you can't choose the keys? When an object is stored, you need to record the key so you can find it again later. In Ugarit, everything is stored in a tree-like directory structure. Files are uploaded and their hashes obtained, and then a directory object is constructed containing a list of the files in the directory, and listing the key of the Ugarit objects storing the contents of each file. This directory object itself has a hash, which is stored inside the directory entry in the parent directory, and so on up to the root. The root of a tree stored in a Ugarit vault has no parent directory to contain it, so at that point, we store the key of the root in a named "tag" that we can look up by name when we want it. Therefore, everything in a Ugarit vault can be found by starting with a named tag and retrieving the object whose key it contains, then finding keys inside that object and looking up the objects they refer to, until we find the object we want. When you use Ugarit to back up your filesystem, it uploads a complete snapshot of every file in the filesystem, like a full dump. But because the vault is content-addressed, it automatically avoids uploading anything it already has a copy of, so all we upload is an incremental dump - but in the vault, it looks like a full dump, and so can be restored on its own without having to restore a chain of incrementals. Also, the same storage can be shared between multiple systems that all back up to it - and the incremental upload algorithm will mean that any files shared between the servers will only need to be uploaded once. If you back up a complete server, than go and back up another that is running the same distribution, then all the files in /bin and so on that are already in the storage will not need to be backed up again; the system will automatically spot that they're already there, and not upload them again. As well as storing backups of filesystems, Ugarit can also be used as the primary storage for read-only files, such as music and photos. The principle is exactly the same; the only difference is in how the files are organised - rather than as a directory structure, the files are referenced from metadata objects that specify information about the file (so it can be found) and a reference to the contents. Sets of metadata objects are pointed to by tags as well, so they can also be found.

So what's that mean in practice?

Backups

You can run Ugarit to back up any number of filesystems to a shared storage area (known as a vault, and on every backup, Ugarit will only upload files or parts of files that aren't already in the vault - be they from the previous snapshot, earlier snapshots, snapshot of entirely unrelated filesystems, etc. Every time you do a snapshot, Ugarit builds an entire complete directory tree of the snapshot in the vault - but reusing any parts of files, files, or entire directories that already exist anywhere in the vault, and only uploading what doesn't already exist. The support for parts of files means that, in many cases, gigantic files like database tables and virtual disks for virtual machines will not need to be uploaded entirely every time they change, as the changed sections will be identified and uploaded. Because a complete directory tree exists in the vault for any snapshot, the extraction algorithm is incredibly simple - and, therefore, incredibly reliable and fast. Simple, reliable, and fast are just what you need when you're trying to reconstruct the filesystem of a live server. Also, it means that you can do lots of small snapshots. If you run a snapshot every hour, then only a megabyte or two might have changed in your filesystem, so you only upload a megabyte or two - yet you end up with a complete history of your filesystem at hourly intervals in the vault. Conventional backup systems usually either store a full backup then incrementals to their archives, meaning that doing a restore involves reading the full backup then reading every incremental since and applying them - so to do a restore, you have to download *every version* of the filesystem you've ever uploaded, or you have to do periodic full backups (even though most of your filesystem won't have changed since the last full backup) to reduce the number of incrementals required for a restore. Better results are had from systems that use a special backup server to look after the archive storage, which accept incremental backups and apply them to the snapshot they keep in order to maintain a most-recent snapshot that can be downloaded in a single run; but they then restrict you to using dedicated servers as your archive stores, ruling out cheaply scalable solutions like Amazon S3, or just backing up to a removable USB or eSATA disk you attach to your system whenever you do a backup. And dedicated backup servers are complex pieces of software; can you rely on something complex for the fundamental foundation of your data security system?

Archives

You can also use Ugarit as the primary storage for read-only files. You do this by creating an archive in the vault, and importing batches of files into it along with their metadata (arbitrary attributes, such as "author", "creation date" or "subject"). Just as you can keep snapshots of multiple systems in a Ugarit vault, you can also keep multiple separate archives, each identified by a named tag. However, as it's all within the same vault, the usual de-duplication rules apply. The same file may be in multiple archives, with different metadata in each, as the file contents and metadata are stored separately (and associated only within the context of each archive). And, of course, the same file may appear in snapshots and in archives; perhaps a file was originally downloaded into your home directory, where it was backed up into Ugarit snapshots, and then you imported it into your archive. The archive import would not have had to re-upload the file, as its contents would have already been found in the vault, so all that needs to be uploaded is the metadata. Although we have mainly spoken of storing files in archives, the objects in archives can be files or directories full of files, as well. This is useful for storing MacOS-style files that are actually directories, or for archiving things like completed projects for clients, which can be entire directory structures.

System Requirements

Ugarit should run on any POSIX-compliant system that can run [http://www.call-with-current-continuation.org/|Chicken Scheme]. It stores and restores all the file attributes reported by the stat system call - POSIX mode permissions, UID, GID, mtime, and optionally atime and ctime (although the ctime cannot be restored due to POSIX restrictions). Ugarit will store files, directories, device and character special files, symlinks, and FIFOs. Support for extended filesystem attributes - ACLs, alternative streams, forks and other metadata - is possible, due to the extensible directory entry format; support for such metadata will be added as required. Currently, only local filesystem-based vault storage backends are complete: these are suitable for backing up to a removable hard disk or a filesystem shared via NFS or other protocols. However, the backend can be accessed via an SSH tunnel, so a remote server you are able to install Ugarit on to run the backends can be used as a remote vault. However, the next backend to be implemented will be one for Amazon S3, and an SFTP backend for storing vaults anywhere you can ssh to. Other backends will be implemented on demand; a vault can, in principle, be stored on anything that can store files by name, report on whether a file already exists, and efficiently download a file by name. This rules out magnetic tapes due to their requirement for sequential access. Although we need to trust that a backend won't lose data (for now), we don't need to trust the backend not to snoop on us, as Ugarit optionally encrypts everything sent to the vault.

Terminology

A Ugarit backend is the software module that handles backend storage. An actual storage area - managed by a backend - is called a storage, and is used to implement a vault; currently, every storage is a valid vault, but the planned future introduction of a distributed storage backend will enable multiple storages (which are not, themselves, valid vaults as they only contain some subset of the information required) to be combined into an aggregrate storage, which then holds the actual vault. Note that the contents of a storage is purely a set of blocks, and a series of named tags containing references to them; the storage does not know the details of encryption and hashing, so cannot make any sense of its contents. For example, if you use the recommended "splitlog" filesystem backend, your vault might be /mnt/bigdisk on the server prometheus. The backend (which is compiled along with the other filesystem backends in the backend-fs binary) must be installed on prometheus, and Ugarit clients all over the place may then use it via ssh to prometheus. However, even with the filesystem backends, the actual storage might not be on prometheus where the backend runs - /mnt/bigdisk might be an NFS mount, or a mount from a storage-area network. This ability to delegate via SSH is particularly useful with the "cache" backend, which reduces latency by storing a cache of what blocks exist in a backend, thereby making it quicker to identify already-stored files; a cluster of servers all sharing the same vault might all use SSH tunnels to access an instance of the "cache" backend on one of them (using some local disk to store the cache), which proxies the actual vault storage to a vault on the other end of a high-latency Internet link, again via an SSH tunnel. A vault is where Ugarit stores backups (as chains of snapshots) and archives (as chains of archive imports). Backups and archives are identified by tags, which are the top-level named entry points into a vault. A vault is based on top of a storage, along with a choice of hash function, compression algorithm, and encryption that are used to map the logical world of snapshots and archive imports into the physical world of blocks stored in the storage. A snapshot is a copy of a filesystem tree in the vault, with a header block that gives some metadata about it. A backup consists of a number of snapshots of a given filesystem. An archive import is a set of filesystem trees, each along with metadata about it. Whereas a backup is organised around a series of timed snapshots, an archive is organised around the metadata; the filesystem trees in the archive are identified by their properties.

So what, exactly, is in a vault?

A Ugarit vault contains a load of blocks, each up to a maximum size (usually 1MiB, although other backends might impose smaller limits). Each block is identified by the hash of its contents; this is how Ugarit avoids ever uploading the same data twice, by checking to see if the data to be uploaded already exists in the vault by looking up the hash. The contents of the blocks are compressed and then encrypted before upload. Every file uploaded is, unless it's small enough to fit in a single block, chopped into blocks, and each block uploaded. This way, the entire contents of your filesystem can be uploaded - or, at least, only the parts of it that aren't already there! The blocks are then tied together to create a snapshot by uploading blocks full of the hashes of the data blocks, and directory blocks are uploaded listing the names and attributes of files in directories, along with the hashes of the blocks that contain the files' contents. Even the blocks that contain lists of hashes of other blocks are subject to checking for pre-existence in the vault; if only a few MiB of your hundred-GiB filesystem has changed, then even the index blocks and directory blocks are re-used from previous snapshots. Once uploaded, a block in the vault is never again changed. After all, if its contents changed, its hash would change, so it would no longer be the same block! However, every block has a reference count, tracking the number of index blocks that refer to it. This means that the vault knows which blocks are shared between multiple snapshots (or shared *within* a snapshot - if a filesystem has more than one copy of the same file, still only one copy is uploaded), so that if a given snapshot is deleted, then the blocks that only that snapshot is using can be deleted to free up space, without corrupting other snapshots by deleting blocks they share. Keep in mind, however, that not all storage backends may support this - there are certain advantages to being an append-only vault. For a start, you can't delete something by accident! The supplied fs and sqlite backends support deletion, while the splitlog backend does not yet. However, the actual snapshot deletion command in the user interface hasn't been implemented yet either, so it's a moot point for now... Finally, the vault contains objects called tags. Unlike the blocks, the tags' contents can change, and they have meaningful names rather than being identified by hash. Tags identify the top-level blocks of snapshots within the system, from which (by following the chain of hashes down through the index blocks) the entire contents of a snapshot may be found. Unless you happen to have recorded the hash of a snapshot somewhere, the tags are where you find snapshots from when you want to do a restore. Whenever a snapshot is taken, as soon as Ugarit has uploaded all the files, directories, and index blocks required, it looks up the tag you have identified as the target of the snapshot. If the tag already exists, then the snapshot it currently points to is recorded in the new snapshot as the "previous snapshot"; then the snapshot header containing the previous snapshot hash, along with the date and time and any comments you provide for the snapshot, and is uploaded (as another block, identified by its hash). The tag is then updated to point to the new snapshot. This way, each tag actually identifies a chronological chain of snapshots. Normally, you would use a tag to identify a filesystem being backed up; you'd keep snapshotting the filesystem to the same tag, resulting in all the snapshots of that filesystem hanging from the tag. But if you wanted to remember any particular snapshot (perhaps if it's the snapshot you take before a big upgrade or other risky operation), you can duplicate the tag, in effect 'forking' the chain of snapshots much like a branch in a version control system. Archive imports cause the creation of one or more archive metadata blocks, each of which lists the hashes of files or filesystem trees in the archive, along with their metadata. Each import then has a single archive import block pointing to the sequence of metadata blocks, and pointing to the previous archive import block in that archive. The same filesystem tree can be imported more than once to the same archive, and the "latest" metadata always wins. Generally, you should create lots of small archives for different categories of things - such as one for music, one for photos, and so on. You might well create separate archives for the music collections of different people in your household, unless they overlap, and another for Christmas music so it doesn't crop up in random shuffle play! It's easy to merge archives if you over-compartmentalise them, but harder to split an archive if you find it too cluttered with unrelated things. I've spoken of archive imports, and backup snapshots, each having a "previous" reference to the last import or snapshot in the chain, but it's actually more complex than that: they have an arbitrary list of zero or more previous objects. As such, it's possible for several imports or snapshots to have the same "previous", known as a "fork", and it's possible to have an import or snapshot that merges multiple previous ones. Forking is handy if you want to basically duplicate an archive, creating two new archives with the same contents to begin with, but each then capable of diverging thereafter. You might do this to keep the state of an archive before doing a bit import, so you can go back to the original state if you regret the import, for instance. Forking a backup tag is a more unusual operation, but also useful. Perhaps you have a server running many stateful services, and the hardware becomes overloaded, so you clone the basic setup onto another server, and run half of the services on the original and half on the new one; if you fork the backup tag of the original server to create a backup tag for the new server, then both servers' snapshot history will share the original shared state. Merging is most useful for archives; you might merge several archives into one, as mentioned. And, of course, you can merge backup tags, as well. If your earlier splitting of one server into two doesn't work out (perhaps your workload reduces, or you can now afford a single, more powerful, server to handle everything in one place), you might rsync back the service state from the two servers onto the new server, so it's all merged in the new server's filesystem. To preserve this in the snapshot history, you can merge the two backup tags of the two servers to create a backup tag for the single new server, which will accurately reflect the history of the filesystem. Also, tags might fork by accident - I plan to introduce a distributed storage backend, which will replicate blocks and tags across multiple storages to create a single virtual storage to build a vault on top of; in the event of the network of actual storages suffering a failure, it may be that snapshots and imports are only applied to some of the storages - and then subsequent snapshots and imports only get applied to some other subset of the storages. When the network is repaired and all the storages are again visible, they will have diverged, inconsistent, states for their tags, and the distributed storage system will resolve the situation by keeping the majority state as the state of the tag on all the backends, but preserving any other states by creating new tags, with the original name plus a suffix. These can then be merged to "heal" the conflict.

Using Ugarit

Installation

Install [http://www.call-with-current-continuation.org/|Chicken Scheme] using their [http://wiki.call-cc.org/man/4/Getting%20started|installation instructions]. Ugarit can then be installed by typing (as root): chicken-install ugarit See the [http://wiki.call-cc.org/manual/Extensions#chicken-install-reference|chicken-install manual] for details if you have any trouble, or wish to install into your home directory.

Setting up a vault

Firstly, you need to know the vault identifier for the place you'll be storing your vaults. This depends on your backend. The vault identifier is actually the command line used to invoke the backend for a particular vault; communication with the vault is via standard input and output, which is how it's easy to tunnel via ssh.

Local filesystem backends

These backends use the local filesystem to store the vaults. Of course, the "local filesystem" on a given server might be an NFS mount or mounted from a storage-area network.

Logfile backend

The logfile backend works much like the original Venti system. It's append-only - you won't be able to delete old snapshots from a logfile vault, even when I implement deletion. It stores the vault in two sets of files; one is a log of data blocks, split at a specified maximum size, and the other is the metadata: an sqlite database used to track the location of blocks in the log files, the contents of tags, and a count of the logs so a filename can be chosen for a new one. To set up a new logfile vault, just choose where to put the two parts. It would be nice to put the metadata file on a different physical disk to the logs directory, to reduce seeking. If you only have one disk, you can put the metadata file in the log directory ("metadata" is a good name). You can then refer to it using the following vault identifier: "backend-fs splitlog ...log directory... ...metadata file..."

SQLite backend

The sqlite backend works a bit like a [http://www.fossil-scm.org/|Fossil] repository; the storage is implemented as a single file, which is actually an SQLite database containing blocks as blobs, along with tags and configuration data in their own tables. It supports unlinking objects, and the use of a single file to store everything is convenient; but storing everything in a single file with random access is slightly riskier than the simple structure of an append-only log file; it is less tolerant of corruption, which can easily render the entire storage unusable. Also, that one file can get very large. SQLite has internal limits on the size of a database, but they're quite large - you'll probably hit a size limit at about 140 terabytes. To set up an SQLite storage, just choose a place to put the file. I usually use an extension of .vault; note that SQLite will create additional temporary files alongside it with additional extensions, too. Then refer to it with the following vault identifier: "backend-sqlite ...path to vault file..."

Filesystem backend

The filesystem backend creates vaults by storing each block or tag in its own file, in a directory. To keep the objects-per-directory count down, it'll split the files into subdirectories. Because of this, it uses a stupendous number of inodes (more than the filesystem being backed up). Only use it if you don't mind that; splitlog is much more efficient. To set up a new filesystem-backend vault, just create an empty directory that Ugarit will have write access to when it runs. It will probably run as root in order to be able to access the contents of files that aren't world-readable (although that's up to you), so unless you access your storage via ssh or sudo to use another user to run the backend under, be careful of NFS mounts that have maproot=nobody set! You can then refer to it using the following vault identifier: "backend-fs fs ...path to directory..."

Proxying backends

These backends wrap another vault identifier which the actual storage task is delegated to, but add some value along the way.

SSH tunnelling

It's easy to access a vault stored on a remote server. The caveat is that the backend then needs to be installed on the remote server! Since vaults are accessed by running the supplied command, and then talking to them via stdin and stdout, the vault identified needs only be: "ssh ...hostname... '...remote vault identifier...'"

Cache backend

The cache backend is used to cache a list of what blocks exist in the proxied backend, so that it can answer queries as to the existance of a block rapidly, even when the proxied backend is on the end of a high-latency link (eg, the Internet). This should speed up snapshots, as existing files are identified by asking the backend if the vault already has them. The cache backend works by storing the cache in a local sqlite file. Given a place for it to store that file, usage is simple: "backend-cache ...path to cachefile... '...proxied vault identifier...'" The cache file will be automatically created if it doesn't already exist, so make sure there's write access to the containing directory. - WARNING - WARNING - WARNING - WARNING - WARNING - WARNING - If you use a cache on a vault shared between servers, make sure that you either: * Never delete things from the vault or * Make sure all access to the vault is via the same cache If a block is deleted from a vault, and a cache on that vault is not aware of the deletion (as it did not go "through" the caching proxy), then the cache will record that the block exists in the vault when it does not. This will mean that if a snapshot is made through the cache that would use that block, then it will be assumed that the block already exists in the vault when it does not. Therefore, the block will not be uploaded, and a dangling reference will result! Some setups which *are* safe: * A single server using a vault via a cache, not sharing it with anyone else. * A pool of servers using a vault via the same cache. * A pool of servers using a vault via one or more caches, and maybe some not via the cache, where nothing is ever deleted from the vault. * A pool of servers using a vault via one cache, and maybe some not via the cache, where deletions are only performed on servers using the cache, so the cache is always aware.

Writing a ugarit.conf

ugarit.conf should look something like this: (storage ) (hash tiger "") [double-check] [(compression [deflate|lzma])] [(encryption aes )] [(file-cache "")] [(rule ...)] The hash line chooses a hash algorithm. Currently Tiger-192 (tiger), SHA-256 (sha256), SHA-384 (sha384) and SHA-512 (sha512) are supported; if you omit the line then Tiger will still be used, but it will be a simple hash of the block with the block type appended, which reveals to attackers what blocks you have (as the hash is of the unencrypted block, and the hash is not encrypted). This is useful for development and testing or for use with trusted vaults, but not advised for use with vaults that attackers may snoop at. Providing a salt string produces a hash function that hashes the block, the type of block, and the salt string, producing hashes that attackers who can snoop the vault cannot use to find known blocks (see the "Security model" section below for more details). I would recommend that you create a salt string from a secure entropy source, such as:
dd if=/dev/random bs=1 count=64 | base64 -w 0
Whichever hash function you use, you will need to install the required Chicken egg with one of the following commands:
chicken-install -s tiger-hash  # for tiger
chicken-install -s sha2        # for the SHA hashes
double-check, if present, causes Ugarit to perform extra internal consistency checks during backups, which will detect bugs but may slow things down. lzma is the recommended compression option for low-bandwidth backends or when space is tight, but it's very slow to compress; deflate or no compression at all are better for fast local vaults. To have no compression at all, just remove the (compression ...) line entirely. Likewise, to use compression, you need to install a Chicken egg:
chicken-install -s z3       # for deflate
chicken-install -s lzma     # for lzma
WARNING: The lzma egg is currently rather difficult to install, and needs rewriting to fix this problem. Likewise, the (encryption ...) line may be omitted to have no encryption; the only currently supported algorithm is aes (in CBC mode) with a key given in hex, as a passphrase (hashed to get a key), or a passphrase read from the terminal on every run. The key may be 16, 24, or 32 bytes for 128-bit, 192-bit or 256-bit AES. To specify a hex key, just supply it as a string, like so:
(encryption aes "00112233445566778899AABBCCDDEEFF")
...for 128-bit AES,
(encryption aes "00112233445566778899AABBCCDDEEFF0011223344556677")
...for 192-bit AES, or
(encryption aes "00112233445566778899AABBCCDDEEFF00112233445566778899AABBCCDDEEFF")
...for 256-bit AES. Alternatively, you can provide a passphrase, and specify how large a key you want it turned into, like so:
(encryption aes ([16|24|32] "We three kings of Orient are, one in a taxi one in a car, one on a scooter honking his hooter and smoking a fat cigar. Oh, star of wonder, star of light; star with royal dynamite"))
I would recommend that you generate a long passphrase from a secure entropy source, such as:
dd if=/dev/random bs=1 count=64 | base64 -w 0
Finally, the extra-paranoid can request that Ugarit prompt for a passphrase on every run and hash it into a key of the specified length, like so:
(encryption aes ([16|24|32] prompt))
(note the lack of quotes around prompt, distinguishing it from a passphrase) Please read the "Security model" section below for details on the implications of different encryption setups. Again, as it is an optional feature, to use encryption, you must install the appropriate Chicken egg:
chicken-install -s aes
A file cache, if enabled, significantly speeds up subsequent snapshots of a filesystem tree. The file cache is a file (which Ugarit will create if it doesn't already exist) mapping filenames to (mtime,size,hash) tuples; as it scans the filesystem, if it finds a file in the cache and the mtime and size have not changed, it will assume it is already stored under the specified hash. This saves it from having to read the entire file to hash it and then check if the hash is present in the vault. In other words, if only a few files have changed since the last snapshot, then snapshotting a directory tree becomes an O(N) operation, where N is the number of files, rather than an O(M) operation, where M is the total size of files involved. For example:
(storage "ssh ugarit@spiderman 'backend-fs splitlog /mnt/ugarit-data /mnt/ugarit-metadata/metadata'")
(hash tiger "i3HO7JeLCSa6Wa55uqTRqp4jppUYbXoxme7YpcHPnuoA+11ez9iOIA6B6eBIhZ0MbdLvvFZZWnRgJAzY8K2JBQ")
(encryption aes (32 "FN9m34J4bbD3vhPqh6+4BjjXDSPYpuyskJX73T1t60PP0rPdC3AxlrjVn4YDyaFSbx5WRAn4JBr7SBn2PLyxJw"))
(compression lzma)
(file-cache "/var/ugarit/cache")
Be careful to put a set of parentheses around each configuration entry. White space isn't significant, so feel free to indent things and wrap them over lines if you want. Keep copies of this file safe - you'll need it to do extractions! Print a copy out and lock it in your fire safe! Ok, currently, you might be able to recreate it if you remember where you put the storage, but encryption keys and hash salts are harder to remember...

Your first backup

Think of a tag to identify the filesystem you're backing up. If it's /home on the server gandalf, you might call it gandalf-home. If it's the entire filesystem of the server bilbo, you might just call it bilbo. Then from your shell, run (as root):
# ugarit snapshot  [-c] [-a]  
For example, if we have a ugarit.conf in the current directory:
# ugarit snapshot ugarit.conf -c localhost-etc /etc
Specify the -c flag if you want to store ctimes in the vault; since it's impossible to restore ctimes when extracting from an vault, doing this is useful only for informational purposes, so it's not done by default. Similarly, atimes aren't stored in the vault unless you specify -a, because otherwise, there will be a lot of directory blocks uploaded on every snapshot, as the atime of every file will have been changed by the previous snapshot - so with -a specified, on every snapshot, every directory in your filesystem will be uploaded! Ugarit will happily restore atimes if they are found in a vault; their storage is made optional simply because uploading them is costly and rarely useful.

Exploring the vault

Now you have a backup, you can explore the contents of the vault. This need not be done as root, as long as you can read ugarit.conf; however, if you want to extract files, run it as root so the uids and gids can be set.
$ ugarit explore ugarit.conf
This will put you into an interactive shell exploring a virtual filesystem. The root directory contains an entry for every tag; if you type ls you should see your tag listed, and within that tag, you'll find a list of snapshots, in descending date order, with a special entry current for the most recent snapshot. Within a snapshot, you'll find the root directory of your snapshot under contents, and the detailts of the snapshot itself in propreties.sexpr, and will be able to cd into subdirectories, and so on:
> ls
localhost-etc/ 
> cd localhost-etc
/localhost-etc> ls
current/ 
2015-06-12 22:49:34/ 
2015-06-12 22:49:25/ 
/localhost-etc> cd current
/localhost-etc/current> ls
log.sexpr 
properties.sexpr 
contents/ 
/localhost-etc/current> cat properties.sexpr
((previous . "a140e6dbe0a7a38f8b8c381323997c23e51a39e2593afb61")
 (mtime . 1434102574.0)
 (contents . "34eccf1f5141187e4209cfa354fdea749a0c3c1c4682ec86")
 (stats (blocks-stored . 12)
  (bytes-stored . 16889)
  (blocks-skipped . 50)
  (bytes-skipped . 6567341)
  (file-cache-hits . 0)
  (file-cache-bytes . 0))
 (log . "b2a920f962c12848352f33cf32941e5313bcc5f209219c1a")
 (hostname . "ahe")
 (source-path . "/etc")
 (notes)
 (files . 112)
 (size . 6563588))
/localhost-etc/current> cd contents
/localhost-etc/current/contents> ls
zoneinfo 
vconsole.conf 
udev/ 
tmpfiles.d/ 
systemd/ 
sysctl.d/ 
sudoers.tmp~ 
sudoers 
subuid 
subgid 
static 
ssl/ 
ssh/ 
shells 
shadow- 
shadow 
services 
samba/ 
rpc 
resolvconf.conf 
resolv.conf 
-- Press q then enter to stop or enter for more...
q
/localhost-etc/current/contents> ls -ll resolv.conf
-rw-r--r--     0     0 [2015-05-23 23:22:41] 78B/-: resolv.conf
key: #f
contents: "e33ea1394cd2a67fe6caab9af99f66a4a1cc50e8929d3550"
size: 78
ctime: 1432419761.0
As well as exploring around, you can also extract files or directories (or entire snapshots) by using the get command. Ugarit will do its best to restore the metadata of files, subject to the rights of the user you run it as. Type help to get help in the interactive shell. The interactive shell supports command-line editing, history and tab completion for your convenience.

Extracting things directly

As well as using the interactive explore mode, it is also possible to directly extract something from the vault, given a path. Given the sample vault from the previous example, it would be possible to extract the README.txt file with the following command:
$ ugarit extract ugarit.conf /Test/current/contents/README.txt

Forking tags

As mentioned above, you can fork a tag, creating two tags that refer to the same snapshot and its history but that can then have their own subsequent history of snapshots applied to each independently, with the following command:
$ ugarit fork   

Merging tags

And you can also merge two or more tags into one. It's possible to merge a bunch of tags to make an entirely new tag, or you can merge a tag into an existing tag, by having the "output" tag also be one of the "input" tags. The command to do this is:
$ ugarit merge   
For instance, to import your classical music collection into your main musical collection, you might do:
$ ugarit merge ugarit.conf my-music my-music classical-music
Or if you want to create a new all-music archive from the archives bobs-music and petes-music, you might do:
$ ugarit merge ugarit.conf all-music bobs-music petes-music

Archive operations

Importing

To import some files into an archive, you must create a manifest file listing them, and their metadata. The manifest can also list metadata for the import as a whole, perhaps naming the source of the files, or the reason for importing them. The metadata for a file (or an import) is a series of named properties. The value of a property can be any Scheme value, written in Scheme syntax (with strings double-quoted unless they are to be interpreted as symbols), but strings and numbers are the most useful types. You can use whatever names you like for properties in metadata, but there are some that the system applies automatically, and an informal standard of sorts, which is documented in [docs/archive-schema.wiki]. You can produce a manifest file by hand, or use the Ugarit Manifest Maker to produce one for you. You do this by installing it like so:
$ chicken-install ugarit-manifest-maker
And then running it, giving it any number of file and directory names on the command line. When given directories, it will recursively scan them to find all the files contained therein and put them in the manifest; it will not put directories in the manifest, although it is perfectly legal for you to do so when writing a manifest by hand. This is because the manifest maker can't do much useful analysis on a directory to suggest default metadata for them (so there isn't much point in using it), and it's far more useful for it to make it easy for you to import a large number of files individually by referencing the directory containing them. The manifest is sent to standard output, so you need to redirect it to a file, like so:
$ ugarit-manifest-maker ~/music > music.manifest
You can specify command-line options, as well. -e PATTERN or --exclude=PATTERN introduces a glob pattern for files to exclude from the manifest, and -D KEY=VALUE or --define=KEY=VALUE provides a property to be added to every file in the manifest (as opposed to an import property, that is part of the metadata of the overall import). Note that VALUE must be double-quoted if it's a string, as per Scheme value syntax. One might use this like so:
$ ugarit-manifest-maker -e *.txt -D rating=5 ~/favourite-music > music.manifest
The manifest maker simplifies the writing of manifests for files, by listing the files in manifest format along with useful metadata extracted from the filename and the file itself. For supported file types (currently, MP3 and OGG music files), it will even look inside the file to extract metadata. The manifest file it generates will contain lots of comments mentioning things it couldn't automatically analyse (such as unknown OGG/ID3 tags, or unknown types of files); and for metadata properties it thinks might be relevant but can't automatically provide, it suggests them with an empty property declaration, commented out. The idea is that, after generating a manifest, you read it by hand in a text editor to attempt to improve it.

The format of a manifest file

Manifest files have a relatively simple format. The are based on Scheme s-expressions, so can contain comments. From any semicolon (not in a string or otherwise quoted) to the end of the line is a comment, and #; in front of something comments out that something. Import metadata properties are specified like so:
(KEY = VALUE)
...where, as usual, VALUE must be double-quoted if it's a string. Files to import, with their metadata, are specified like so:
(object "PATH OF FILE TO IMPORT"
  (KEY = VALUE)
  (KEY = VALUE)...
)
The closing parenthesis need not be on a line of its own, it's conventionally placed after the closing parenthesis of the final property. Ugarit, when importing the files in the manifest, will add the following properties if they are not already specified:
import-path
The path the file was imported from
dc:format
A guess at the file's MIME type, based on the extension
mtime
The file's modification time (as the number of seconds since the UNIX epoch)
ctime
The file's change time (as the number of seconds since the UNIX epoch)
filename
The name of the file, stripped of any directory components, and including the extension.
The following properties are placed in the import metadata, automatically:
hostname
The hostname the import was performed on.
manifest-path
The path to the manifest file used for the import.
mtime
The time (in seconds since the UNIX epoch) at which the import was committed.
stats
A Scheme alist of statistics about the import (number of files/blocks uploaded, etc).
So, to wrap that all up, here's a sample import manifest file: (notes = "A bunch of old CDs I've finally ripped") (object "/home/alaric/newrip/track01.mp3" (filename = "track01.mp3") (dc:format = "audio/mpeg") (dc:publisher = "Go! Beat Records") (dc:created = "1994") (dc:contributor = "Portishead") (dc:subject = "Trip-Hop") (superset:size = 1) (superset:index = 1) (set:title = "Dummy") (set:size = 11) (set:index = 1) (dc:creator = "Portishead") (dc:title = "Wandering Star") (mtime = 1428962299.0) (ctime = 1428962299.0) (file-size = 4703055)) ;;... and so on, for ten more MP3s on this CD, then several other CDs...

Actually importing a manifest

Well, when you finally have a manifest file, importing it is easy:
$ ugarit import   

How do I change the metadata of an already-imported file?

That's easy; the "current" metadata of a file is the metadata of its most recent. Just import the file again, in a new manifest, with new metadata, and it will overwrite the old. However, the old metadata is still preserved in the archive's history; tags forked from the archive tag before the second import will still see the original state of the archive, by design.

Exploring

Archives are visible in the explore interface. For instance, an import of some music I did looks like this:
> ls
localhost-etc/ <tag>
archive-tag/ <tag>
> cd archive-tag
/archive-tag> ls
history/ <archive-history>
/archive-tag> cd history
/archive-tag/history> ls
2015-06-12 22:53:13/ <import>
/archive-tag/history> cd 2015-06-12 22:53:13
/archive-tag/history/2015-06-12 22:53:13> ls
log.sexpr <file>
properties.sexpr <inline>
manifest/ <import-manifest>
/archive-tag/history/2015-06-12 22:53:13> cat properties.sexpr
((stats (blocks-stored . 2046)
        (bytes-stored . 1815317503)
        (blocks-skipped . 9)
        (bytes-skipped . 8388608)
        (file-cache-hits . 0)
        (file-cache-bytes . 0))
 (log . "b2a920f962c12848352f33cf32941e5313bcc5f209219c1a")
 (mtime . 1434135993.0)
 (contents . "fcdd5b996914fdcac1e8a6cfbc67663e08f6eaf0cc952e21")
 (hostname . "ahe")
 (notes . "A bunch of music, imported as a demo")
 (manifest-path . "/home/alaric/tmp/test.manifest"))
/archive-tag/history/2015-06-12 22:53:13> cd manifest
/archive-tag/history/2015-06-12 22:53:13/manifest> ls
1d4269099189234eefeb80b95370eaf280730cf4d591004d:03 The Lemon Song.mp3 <file>
7cb253a4886b3e0051ea8cc0e78fb3a0160307a2c37c8382:04 Dazed and Confused.mp3 <file>
64092fa12c2800dda474b41e5ebe8c948f39a59ee91c120b:09 How Many More Times.mp3 <file>
1d79148d1e1e8947c50b44cf2d5690588787af328e82eeef:2-07 Going to California.mp3 <file>
e3685148d0d12213074a9fdb94a00e05282aeabe77fa60d5:1-01 You Shook Me.mp3 <file>
d73904f371af8d7ca2af1076881230f2dc1c2cf82416880a:03 Strangers.mp3 <file>
9c5a0efb7d397180a1e8d42356d8f04c6c26a83d3b05d34a:09 Uptight.mp3 <file>
01a069aec2e731e18fcdd4ecb0e424f346a2f0e16910f5e9:07 Numb.mp3 <file>
7ea1ab7fbd525c40e21d6dd25130e8c70289ad56c09375b0:08 She.mp3 <file>
009dacd8f3185b7caeb47050002e584ab86d08cf9e9aceec:1-03 Communication Breakdown.mp3 <file>
26d264d629e22709f664ed891741f690900d45cd4fd44326:1-03 Dazed and Confused.mp3 <file>
d879761195faf08e4e95a5a2398ea6eefb79920710bfeab6:1-10 Band Introduction _ How Many More Times.mp3 <file>
83244601db42677d110fc8522c6a3cbbc1f22966a779f876:06 All My Love.mp3 <file>
5eebee9a2ad79d04e4f69e9e2a92c4e0a8d5f21e670f89da:07 Tangerine.mp3 <file>
dd6f1203b5973ecd00d2c0cee18087030490230727591746:2-08 That's the Way.mp3 <file>
c0acea15aa27a6dd1bcaff1c13d4f3d741a40a46abeca3fc:04 The Crunge.mp3 <file>
ea7727ad07c6c82e5c9c7218ee1b059cd78264c131c1438d:1-02 I Can't Quit You Baby.mp3 <file>
10fda5f46b8f505ca965bcaf12252eedf5ab44514236f892:14 F.O.D..mp3 <file>
a99ca9af5a83bde1c676c388dc273051defa88756df26e95:1-03 Good Times Bad Times.mp3 <file>
b5d7cfe9808c7fc0dedbd656d44e4c56159cbd3c2ed963bb:1-15 Stairway to Heaven.mp3 <file>
79c87e3c49ffdac175c95aae071f63d3a9efdf2ddb84998c:08.Batmilk.ogg <file>
-- Press q then enter to stop or enter for more...
q
/archive-tag/history/2015-06-12 22:53:13/manifest> ls -ll 7cb253a4886b3e0051ea8cc0e78fb3a0160307a2c37c8382:04 Dazed and Confused.mp3
-r--------     -     - [2015-04-13 21:46:39] -/-: 7cb253a4886b3e0051ea8cc0e78fb3a0160307a2c37c8382:04 Dazed and Confused.mp3
key: #f
contents: "7cb253a4886b3e0051ea8cc0e78fb3a0160307a2c37c8382"
import-path: "/home/alaric/archive/sorted-music/Led Zeppelin/Led Zeppelin/04 Dazed and Confused.mp3"
filename: "04 Dazed and Confused.mp3"
dc:format: "audio/mpeg"
dc:publisher: "Atlantic"
dc:subject: "Classic Rock"
dc:title: "Dazed and Confused"
dc:creator: "Led Zeppelin"
dc:created: "1982"
dc:contributor: "Led Zeppelin"
set:title: "Led Zeppelin"
set:index: 4
set:size: 9
superset:index: 1
superset:size: 1
ctime: 1428957999.0
file-size: 15448903

Searching

However, the explore interface to an archive is far from pleasant. You need to go to the correct import, and find your file by name, and then identify it with a big long name composed of its hash and the original filename to find its properties and extract. I hope to add property-based searching to explore mode in future (which is why you need to go into a history directory within the archive directory, as other ways of exploring the archive will appear alongside). This will be particularly useful when the explore-mode virtual filesystem is mounted over 9P! However, even that interface, being constrained to look like a filesystem, will be limited. The ugarit command-line tool provides a very powerful search interface that exposes the full power of the archive metadata.

Metadata filters

Files (and directories) in an archive can be searched for using "metadata filters", which are descriptions of what you're looking for that the computer can understand. They are represented as Scheme s-expressions, and can be made up of the following components:
#t
This filter matches everything. It's not very useful.
#f
This filter matches nothing. It's not very useful.
(and FILTER FILTER...)
This filter matches files for which all of the inner filters match.
(or FILTER FILTER...)
This filter matches files for which any of the inner filters match.
(not FILTER)
This filter matches files which do not match the inner filter.
(= ($ PROP) VALUE)
This filter matches files which have the given PROPerty equal to that VALUE in their metadata.
(= key HASH)
This filter matches the file with the given hash.
(= ($import PROP) VALUE)
This filter matches files which have the given PROPerty equal to that VALUE in the metadata of the import that last imported them.

Searching an archive

For a start, you can search for files matching a given metadata filter in a given archive. This is done with:
$ ugarit search   
For instance, let's look for music by Led Zeppelin:
$ ugarit search ugarit.conf music '(or
   (= ($ dc:creator) "Led Zeppelin")
   (= ($ dc:contributor) "Led Zeppelin"))'
The result looks like the explore-mode view of an archive manifest, listing the file's hash followed by its title and extension: 7cb253a4886b3e0051ea8cc0e78fb3a0160307a2c37c8382:04 Dazed and Confused.mp3 834a1619a59835e0c27b22801e3c829b40be583dadd19770:2-08 No Quarter.mp3 9e8bc4954838bd9c671f275eb48595089257185750d63894:1-12 I Can't Quit You Baby.mp3 6742b3bebcdd9cae5ec5403c585935403fa74d16ed076cf2:02 Friends (1).mp3 07d161f4bd684e283f7f2cf26e0b732157a8e95ef66939c3:05 Carouselambra.mp3 [...] What of all our lovely metadata? You can view that if you add the word "verbose" to the end of the command line, which allows you to specify alternate output formats:
$ ugarit search ugarit.conf music '(or
   (= ($ dc:creator) "Led Zeppelin")
   (= ($ dc:contributor) "Led Zeppelin"))' verbose
Now the output looks like: object a444ff6ef807b080b536155f58d246d633cab4a0eabef5bf (ctime = 1428958660.0) (dc:contributor = "Led Zeppelin") (dc:created = "2008") (dc:creator = "Led Zeppelin") [... all the usual file properties omitted ...] import a43f7a7268ee8b18381c20d7573add5dbf8781f81377279c (stats = ((blocks-stored . 2046) (bytes-stored . 1815317503) (blocks-skipped . 9) (bytes-skipped . 8388608) (file-cache-hits . 0) (file-cache-bytes . 0))) (log = "b2a920f962c12848352f33cf32941e5313bcc5f209219c1a") [... all the usual import properties omitted ...] object b4cadf48b2c07ccf0303fc4064b292cb222980b0d4223641 (ctime = 1428958673.0) (dc:contributor = "Led Zeppelin") (dc:created = "2008") (dc:creator = "Led Zeppelin") (dc:creator = "Jimmy Page/John Paul Jones/Robert Plant") [...and so on...] As you can see, it lists the hash of each file, its metadata, the hash of the import that last imported it, and the metadata of that import. That's quite verbose, so you'd probably be wanting to take that as input to another program to do something nicer with it. But it's laid out for human reading, not for machine parsing. Thankfully, we have other formats for that, alist and alist-with-imports. Try this:
$ ugarit search ugarit.conf music '(or
   (= ($ dc:creator) "Led Zeppelin")
   (= ($ dc:contributor) "Led Zeppelin"))' alist
This outputs one Scheme s-expression list per match, the first element of which is the hash as a string, the rest of which is an alist of properties: ("7cb253a4886b3e0051ea8cc0e78fb3a0160307a2c37c8382" (ctime . 1428957999.0) (dc:contributor . "Led Zeppelin") (dc:created . "1982") (dc:creator . "Led Zeppelin") [... elided file properties ...] (superset:index . 1) (superset:size . 1)) ("77c960d09eb21ed72e434ddcde0bd3781a4f3d6ee7a6eb66" (ctime . 1428958981.0) (dc:contributor . "Led Zeppelin") [...]
$ ugarit search ugarit.conf music '(or
   (= ($ dc:creator) "Led Zeppelin")
   (= ($ dc:contributor) "Led Zeppelin"))' alist-with-imports
This outputs one s-expression per list per match, with four elements. The first is the key string, the second is an alist of file properties, the third is the import's hash, and the last is an alist containing the import's properties. It looks like: ("64fa08a0080aee6ef501c408fd44dfcc634cfcafd8006fc4" ((ctime . 1428958683.0) (dc:contributor . "Led Zeppelin") (dc:created . "2008") (dc:creator . "Led Zeppelin") [... elided file properties ...] (superset:index . 1) (superset:size . 1)) "a43f7a7268ee8b18381c20d7573add5dbf8781f81377279c" ((stats (blocks-stored . 2046) (bytes-stored . 1815317503) [... elided manifest properties ...] (manifest-path . "test.manifest"))) ("4cd56f916a63399b252976e842dcae0b87f058b5a60c93a4" ((ctime . 1428958437.0) (dc:contributor . "Led Zeppelin") [...] And finally, you might just want to get the hashes of matching files (which are particularly useful for extraction operations, which we'll come to next). To do this, specify a format of "keys", which outputs one line per match, containing just the hash:
$ ugarit search ugarit.conf music '(or
   (= ($ dc:creator) "Led Zeppelin")
   (= ($ dc:contributor) "Led Zeppelin"))' keys
ce6f6484337de772de9313038cb25d1b16e28028136cc291 6af5c664cbfa1acb22a377e97aee35d94c0fc003d239dd0c 92e91e79b384478b5aab31bf1b2ff9e25e7e2c4b48575185 6ddb9a41d4968468a904f05ecf7e0e73d2c7c7ad76bc394b a074dddcef67cd93d92c6ffce845894aa56594674023f6e1 4f65f735bbb00a6fda4bc887b370b3160f55e5e07ec37ffa 97cc8b8ba70c39387fc08ef62311b751aea4340d636eb421 72358dbe3eb60da42eadcf6de325b2a6686f4e17ea41fa60 [...] However, to write filter expressions, you need to know what properties you have available to search on. You might remember, or go for standard properties, or look at existing files in verbose mode to find some; but you can also just ask Ugarit what properties it has in an archive, like so:
$ ugarit search-props  
You can even ask what properties are available for files matching an existing filter:
$ ugarit search-props   
This is useful if you're interested in further narrowing down a filter, and so only care about properties that files already matching that filter have. For a bunch of music files imported with the Ugarit Manifest Maker, you can expect to see something like this: ctime dc:contributor dc:created dc:creator dc:format dc:publisher dc:subject dc:title file-size filename import-path mtime set:index set:size set:title superset:index superset:size Now you know what properties to search, next you'll be wanting to know what values to look for. Again, Ugarit has a command to query the available values of any given property:
$ ugarit search-values   
And you can limit that just to files matching a given filter:
$ ugarit search-values    
The resulting list of values is ordered by popularity, so the most widely-used values will be listed first. Let's see what genres of music were in my sample of music files I imported:
$ ugarit search-values test.conf archive-tag dc:subject
The result is: Classic Rock Alternative & Punk Electronic Trip-Hop Ok, let's now use a filter to find out what artists (dc:creator) I have that made Trip-Hop music (what even IS that?):
$ ugarit search-values test.conf archive-tag \
    '(= ($ dc:subject) "Trip-Hop")' \
    dc:creator
The result is: Portishead Ah, OK, now I know what "Trip-Hop" is.

Extracting

All this searching is lovely, but what it gets us, in the end, is a bunch of file hashes. Perhaps we might want to actually play some music, or look at a photo, or something. To do that, we need to extract from the archive. We've already seen the contents of an archive in the explore mode virtual filesystem, so we could go into the archive history, find the import, go into the manifest, pick the file out there, and use get to extract it, but that would be yucky. Thankfully, we have a command-line interface to get things from archives, in one of two ways. Firstly, we can extract a file (or a directory tree) from an archive, out into the local filesystem:
$ ugarit archive-extract    
The "target" is the name to give it in the local filesystem. We could pull out that Led Zeppelin song from our search results above, like so:
$ ugarit archive-extract test.conf archive-tag \
    ce6f6484337de772de9313038cb25d1b16e28028136cc291 foo.mp3
We now have a foo.mp3 file in the current directory. However, sometimes it would be nicer to have it streamed to standard output, which can be done like so:
$ ugarit archive-stream   
This lets us write a command such as:
$ ugarit archive-stream test.conf archive-tag \
    ce6f6484337de772de9313038cb25d1b16e28028136cc291 | mpg123 -
...to play it in real time.

Storage administration

Each backend offers a number of administrative commands for administering the storage underlying vaults. These are accessible via the ugarit-storage-admin command line interface. To use it, run it with the following command:
$ ugarit-storage-admin ''
The available commands differ between backends, but all backends support the info and help commands, which give basic information about the vault, and list all available commands, respectively. Some offer a stats command that examines the vault state to give interesting statistics, but which may be a time-consuming operation.

Administering splitlog storages

The splitlog backend offers a wide selection of administrative commands. See the help command on a splitlog vault for details. The following commands are available:
help
List the available commands.
info
List some basic information about the storage.
stats
Examine the metadata to provide overall statistics about the archive. This may be a time-consuming operation on large storages.
set-block-size! BYTES
Sets the block size to the given number of bytes. This will affect new blocks written to the storage, and leave existing blocks untouched, even if they are larger than the new block size.
set-max-logfile-size! BYTES
Sets the size at which a log file is finished and a new one started (likewise, existing log files will be untouched; this will only affect new log files)
set-commit-interval! UPDATES
Sets the frequency of automatic synching of the storage state to disk. Lowering this harms performance when writing to the storage, but decreases the number of in-progress block writes that can fail in a crash.
write-protect!
Disables updating of the storage.
write-unprotect!
Re-enables updating of the storage.
reindex!
Reindex the storage, rebuilding the block and tag state from the contents of the log. If the metadata file is damaged or lost, reindexing can rebuild it (although any configuration changes made via other admin commands will need manually repeating as they are not logged).

Administering sqlite storages

The sqlite backend has a similar administrative interface to the splitlog backend, except that it does not have log files, so lacks the set-max-logfile-size! and reindex! commands.

Administering cache storages

The cache backend provides a minimalistic interface:
help
List the available commands.
info
List some basic information about the storage.
stats
Report on how many entries are in the cache.
clear!
Clears the cache, dropping all the entries in it.

.ugarit files

By default, Ugarit will vault everything it finds in the filesystem tree you tell it to snapshot. However, this might not always be desired; so we provide the facility to override this with .ugarit files, or global rules in your .conf file. Note: The syntax of these files is provisional, as I want to experiment with usability, as the current syntax is ugly. So please don't be surprised if the format changes in incompatible ways in subsequent versions! In quick summary, if you want to ignore all files or directories matching a glob in the current directory and below, put the following in a .ugarit file in that directory:
(* (glob "*~") exclude)
You can write quite complex expressions as well as just globs. The full set of rules is: * (glob "pattern") matches files and directories whose names match the glob pattern * (name "name") matches files and directories with exactly that name (useful for files called *...) * (modified-within number seconds) matches files and directories modified within the given number of seconds * (modified-within number minutes) matches files and directories modified within the given number of minutes * (modified-within number hours) matches files and directories modified within the given number of hours * (modified-within number days) matches files and directories modified within the given number of days * (not rule) matches files and directories that do not match the given rule * (and rule rule...) matches files and directories that match all the given rules * (or rule rule...) matches files and directories that match any of the given rules Also, you can override a previous exclusion with an explicit include in a lower-level directory:
(* (glob "*~") include)
You can bind rules to specific directories, rather than to "this directory and all beneath it", by specifying an absolute or relative path instead of the `*`:
("/etc" (name "passwd") exclude)
If you use a relative path, it's taken relative to the directory of the .ugarit file. You can also put some rules in your .conf file, although relative paths are illegal there, by adding lines of this form to the file:
(rule * (glob "*~") exclude)

Questions and Answers

What happens if a snapshot is interrupted?

Nothing! Whatever blocks have been uploaded will be uploaded, but the snapshot is only added to the tag once the entire filesystem has been snapshotted. So just start the snapshot again. Any files that have already be uploaded will then not need to be uploaded again, so the second snapshot should proceed quickly to the point where it failed before, and continue from there. Unless the vault ends up with a partially-uploaded corrupted block due to being interrupted during upload, you'll be fine. The filesystem backend has been written to avoid this by writing the block to a file with the wrong name, then renaming it to the correct name when it's entirely uploaded. Actually, there is *one* caveat: blocks that were uploaded, but never make it into a finished snapshot, will be marked as "referenced" but there's no snapshot to delete to un-reference them, so they'll never be removed when you delete snapshots. (Not that snapshot deletion is implemented yet, mind). If this becomes a problem for people, we could write a "garbage collect" tool that regenerates the reference counts in a vault, leading to unused blocks (with a zero refcount) being unlinked.

Should I share a single large vault between all my filesystems?

I think so. Using a single large vault means that blocks shared between servers - eg, software installed from packages and that sort of thing - will only ever need to be uploaded once, saving storage space and upload bandwidth. However, do not share a vault between servers that do not mutually trust each other, as they can all update the same tags, so can meddle with each other's snapshots - and read each other's snapshots.

CAVEAT

It's not currently safe to have multiple concurrent snapshots to the same split log backend; this will soon be fixed, however.

Security model

I have designed and implemented Ugarit to be able to handle cases where the actual vault storage is not entirely trusted. However, security involves tradeoffs, and Ugarit is configurable in ways that affect its resistance to different kinds of attacks. Here I will list different kinds of attack and explain how Ugarit can deal with them, and how you need to configure it to gain that protection.

Vault snoopers

This might be somebody who can intercept Ugarit's communication with the vault at any point, or who can read the vault itself at their leisure. Ugarit's splitlog backend creates files with "rw-------" permissions out of the box to try and prevent this. This is a pain for people who want to share vaults between UIDs, but we can add a configuration option to override this if that becomes a problem.

Reading your data

If you enable encryption, then all the blocks sent to the vault are encrypted using a secret key stored in your Ugarit configuration file. As long as that configuration file is kept safe, and the AES algorithm is secure, then attackers who can snoop the vault cannot decode your data blocks. Enabling compression will also help, as the blocks are compressed before encrypting, which is thought to make cryptographic analysis harder. Recommendations: Use compression and encryption when there is a risk of vault snooping. Keep your Ugarit configuration file safe using UNIX file permissions (make it readable only by root), and maybe store it on a removable device that's only plugged in when required. Alternatively, use the "prompt" passphrase option, and be prompted for a passphrase every time you run Ugarit, so it isn't stored on disk anywhere.

Looking for known hashes

A block is identified by the hash of its content (before compression and encryption). If an attacker was trying to find people who own a particular file (perhaps a piece of subversive literature), they could search Ugarit vaults for its hash. However, Ugarit has the option to "key" the hash with a "salt" stored in the Ugarit configuration file. This means that the hashes used are actually a hash of the block's contents *and* the salt you supply. If you do this with a random salt that you keep secret, then attackers can't check your vault for known content just by comparing the hashes. Recommendations: Provide a secret string to your hash function in your Ugarit configuration file. Keep the Ugarit configuration file safe, as per the advice in the previous point.

Vault modifiers

These folks can modify Ugarit's writes into the vault, its reads back from the vault, or can modify the vault itself at their leisure. Modifying an encrypted block without knowing the encryption key can at worst be a denial of service, corrupting the block in an unknown way. An attacker who knows the encryption key could replace a block with valid-seeming but incorrect content. In the worst case, this could exploit a bug in the decompression engine, causing a crash or even an exploit of the Ugarit process itself (thereby gaining the powers of a process inspector, as documented below). We can but hope that the decompression engine is robust. Exploits of the decryption engine, or other parts of Ugarit, are less likely due to the nature of the operations performed upon them. However, if a block is modified, then when Ugarit reads it back, the hash will no longer match the hash Ugarit requested, which will be detected and an error reported. The hash is checked after decryption and decompression, so this check does not protect us against exploits of the decompression engine. This protection is only afforded when the hash Ugarit asks for is not tampered with. Most hashes are obtained from within other blocks, which are therefore safe unless that block has been tampered with; the nature of the hash tree conveys the trust in the hashes up to the root. The root hashes are stored in the vault as "tags", which an vault modifier could alter at will. Therefore, the tags cannot be trusted if somebody might modify the vault. This is why Ugarit prints out the snapshot hash and the root directory hash after performing a snapshot, so you can record them securely outside of the vault. The most likely threat posed by vault modifiers is that they could simply corrupt or delete all of your vault, without needing to know any encryption keys. Recommendations: Secure your vaults against modifiers, by whatever means possible. If vault modifiers are still a potential threat, write down a log of your root directory hashes from each snapshot, and keep it safe. When extracting your backups, use the ls -ll command in the interface to check the "contents" hash of your snapshots, and check they match the root directory hash you expect.

Process inspectors

These folks can attach debuggers or similar tools to running processes, such as Ugarit itself. Ugarit backend processes only see encrypted data, so people who can attach to that process gain the powers of vault snoopers and modifiers, and the same conditions apply. People who can attach to the Ugarit process itself, however, will see the original unencrypted content of your filesystem, and will have full access to the encryption keys and hashing keys stored in your Ugarit configuration. When Ugarit is running with sufficient permissions to restore backups, they will be able to intercept and modify the data as it comes out, and probably gain total write access to your entire filesystem in the process. Recommendations: Ensure that Ugarit does not run under the same user ID as untrusted software. In many cases it will need to run as root in order to gain unfettered access to read the filesystems it is backing up, or to restore the ownership of files. However, when all the files it backs up are world-readable, it could run as an untrusted user for backups, and where file ownership is trivially reconstructible, it can do restores as a limited user, too.

Attackers in the source filesystem

These folks create files that Ugarit will back up one day. By having write access to your filesystem, they already have some level of power, and standard Unix security practices such as storage quotas should be used to control them. They may be people with logins on your box, or more subtly, people who can cause servers to writes files; somebody who sends an email to your mailserver will probably cause that message to be written to queue files, as will people who can upload files via any means. Such attackers might use up your available storage by creating large files. This creates a problem in the actual filesystem, but that problem can be fixed by deleting the files. If those files get stored into Ugarit, then they are a part of that snapshot. If you are using a backend that supports deletion, then (when I implement snapshot deletion in the user interface) you could delete that entire snapshot to recover the wasted space, but that is a rather serious operation. More insidiously, such attackers might attempt to abuse a hash collision in order to fool the vault. If they have a way of creating a file that, for instance, has the same hash as your shadow password file, then Ugarit will think that it already has that file when it attempts to snapshot it, and store a reference to the existing file. If that snapshot is restored, then they will receive a copy of your shadow password file. Similarly, if they can predict a future hash of your shadow password file, and create a shadow password file of their own (perhaps one giving them a root account with a known password) with that hash, they can then wait for the real shadow password file to have that hash. If the system is later restored from that snapshot, then their chosen content will appear in the shadow password file. However, doing this requires a very fundamental break of the hash function being used. Recommendations: Think carefully about who has write access to your filesystems, directly or indirectly via a network service that stores received data to disk. Enforce quotas where appropriate, and consider not backing up "queue directories" where untrusted content might appear; migrate incoming content that passes acceptance tests to an area that is backed up. If necessary, the queue might be backed up to a non-snapshotting system, such as rsyncing to another server, so that any excessive files that appear in there are removed from the backup in due course, while still affording protection.

Acknowledgements

The Ugarit implementation contained herein is the work of Alaric Snell-Pym and Christian Kellermann, with advice, ideas, encouragement and guidance from many. The original idea came from Venti, a content-addressed storage system from Plan 9. Venti is usable directly by user applications, and is also integrated with the Fossil filesystem to support snapshotting the status of a Fossil filesystem. Fossil allows references to either be to a block number on the Fossil partition or to a Venti key; so when a filesystem has been snapshotted, all it now contains is a "root directory" pointer into the Venti archive, and any files modified therafter are copied-on-write into Fossil where they may be modified until the next snapshot. We're nowhere near that exciting yet, but using FUSE, we might be able to do something similar, which might be fun. However, Venti inspired me when I read about it years ago; it showed me how elegant content-addressed storage is. Finding out that the Git version control system used the same basic tricks really just confirmed this for me. Also, I'd like to tip my hat to Duplicity. With the changing economics of storage presented by services like Amazon S3 and rsync.net, I looked to Duplicity as it provided both SFTP and S3 backends. However, it worked in terms of full and incremental backups, a model that I think made sense for magnetic tapes, but loses out to content-addressed snapshots when you have random-access media. Duplicity inspired me by its adoption of multiple backends, the very backends I want to use, but I still hungered for a content-addressed snapshot store. I'd also like to tip my hat to Box Backup. I've only used it a little, because it requires a special server to manage the storage (and I want to get my backups *off* of my servers), but it also inspires me with directions I'd like to take Ugarit. It's much more aware of real-time access to random-access storage than Duplicity, and has a very interesting continuous background incremental backup mode, moving away from the tape-based paradigm of backups as something you do on a special day of the week, like some kind of religious observance. I hope the author Ben, who is a good friend of mine, won't mind me plundering his source code for details on how to request real-time notification of changes from the filesystem, and how to read and write extended attributes! Moving on from the world of backup, I'd like to thank the Chicken Team for producing Chicken Scheme. Felix and the community at #chicken on Freenode have particularly inspired me with their can-do attitudes to combining programming-language elegance and pragmatic engineering - two things many would think un-unitable enemies. Of course, they didn't do it all themselves - R5RS Scheme and the SRFIs provided a solid foundation to build on, and there's a cast of many more in the Chicken community, working on other bits of Chicken or just egging everyone on. And I can't not thank Henry Baker for writing the seminal paper on the technique Chicken uses to implement full tail-calling Scheme with cheap continuations on top of C; Henry already had my admiration for his work on combining elegance and pragmatism in linear logic. Why doesn't he return my calls? I even sent flowers. A special thanks should go to Christian Kellermann for porting Ugarit to use Chicken 4 modules, too, which was otherwise a big bottleneck to development, as I was stuck on Chicken 3 for some time! And to Andy Bennett for many insightful conversations about future directions. Thanks to the early adopters who brought me useful feedback, too! And I'd like to thank my wife for putting up with me spending several evenings and weekends and holiday days working on this thing...

Version history

* 2.0: Archival mode [dae5e21ffc], and to support its integration into Ugarit, implemented typed tags [08bf026f5a], displaying tag types in the VFS [30054df0b6], refactoring the Ugarit internals [5fa161239c], made the storage of logs in the vault better [68bb75789f], made it possible to view logs from within the VFS [4e3673e0fe], supported hidden tags [cf5ef4691c], recording configuration information in the vault (and providing instant notification if your vault hashing/encryption setup is incorrect, thanks to a clever idea by Andy Bennett) [0500d282fc], rearranged how local caching is handled [b5911d321a], and added support for the history of a snapshot or archive tag to have arbitrary branches and merges [a987e28fef], which (as a side-effect) improved the performance of running "ls" in long snapshot histories [fcf8bc942a]. Also added an sqlite backend [8719dfb84f], which makes testing easier but is useful in its own right as it's fully-featured and crash-safe, while storing the vault in a single file; and improved the appearance of the explore mode ls command, as the VFS layout has become more complex with the new log/properties views and all the archive mode stuff. * 1.0.9: More humane display of sizes in explore's directory listings, using low-level I/O to reduce CPU usage. Myriad small bug fixes and some internal structural improvements. * 1.0.8: Bug fixes to work with the latest chicken master, and increased unit test coverage to test stuff that wasn't working due to chicken bugs. Looking good! * 1.0.7: Fixed bug with directory rules (errors arose when files were skipped). I need to improve the test suite coverage of high-level components to stop this happening! * 1.0.6: Fixed missing features from v1.0.5 due to a fluffed merge (whoops), added tracking of directory sizes (files+bytes) in the vault on snapshot and the use of this information to display overall percentage completion when extracting. Directory sizes can be seen in the explore interface when doing "ls -l" or "ls -ll". * 1.0.5: Changed the VFS layout slightly, making the existence of snapshot objects explicit (when you go into a tag, then go into a snapshot, you now need to go into "contents" to see the actual file tree; the snapshot object itself now exists as a node in the tree). Added traverse-vault-* functions to the core API, and tests for same, and used traverse-vault-node to drive the cd and get functions in the interactive explore mode (speeding them up in the process!). Added "extract" command. Added a progress reporting callback facility for snapshots and extractions, and used it to provide progress reporting in the front-end, every 60 seconds or so by default, not at all with -q, and every time something happens with -v. Added tab completion in explore mode. * 1.0.4: Resurrected support for compression and encryption and SHA2 hashes, which had been broken by the failure of the autoload egg to continue to work as it used to. Tidying up error and ^C handling somewhat. * 1.0.3: Installed sqlite busy handlers to retry when the database is locked due to concurrent access (affects backend-fs, backend-cache, and the file cache), and gained an EXCLUSIVE lock when locking a tag in backend-fs; I'm not clear if it's necessary, but it can't hurt. BUGFIX: Logging of messages from storage backends wasn't happening correctly in the Ugarit core, leading to errors when the cache backend (which logs an info message at close time) was closed and the log message had nowhere to go. * 1.0.2: Made the file cache also commit periodically, rather than on every write, in order to improve performance. Counting blocks and bytes uploaded / reused, and file cache bytes as well as hits; reporting same in snapshot UI and logging same to snapshot metadata. Switched to the posix-extras egg and ditched our own posixextras.scm wrappers. Used the parley egg in the ugarit explore CLI for line editing. Added logging infrastructure, recording of snapshot logs in the snapshot. Added recovery from extraction errors. Listed lock state of tags in explore mode. Backend protocol v2 introduced (retaining v1 for compatability) allowing for an error on backend startup, and logging nonfatal errors, warnings, and info on startup and all protocol calls. Added ugarit-archive-admin command line interface to backend-specific administrative interfaces. Configuration of the splitlog backend (write protection, adjusting block size and logfile size limit and commit interval) is now possible via the admin interface. The admin interface also permits rebuilding the metadata index of a splitlog vault with the reindex! admin command. BUGFIX: Made file cache check the file hashes it finds in the cache actually exist in the vault, to protect against the case where a crash of some kind has caused unflushed changes to be lost; the file cache may well have committed changes that the backend hasn't, leading to references to nonexistant blocks. Note that we assume that vaults are sequentially safe, eg if the final indirect block of a large file made it, all the partial blocks must have made it too. BUGFIX: Added an explicit flush! command to the backend protocol, and put explicit flushes at critical points in higher layers (backend-cache, the vault abstraction in the Ugarit core, and when tagging a snapshot) so that we ensure the blocks we point at are flushed before committing references to them in the backend-cache or file caches, or into tags, to ensure crash safety. BUGFIX: Made the splitlog backend never exceed the file size limit (except when passed blocks that, plus a header, are larger than it), rather than letting a partial block hang over the 'end'. BUGFIX: Fixed tag locking, which was broken all over the place. Concurrent snapshots to the same tag should now block for one another, although why you'd want to *do* that is questionable. BUGFIX: Fixed generation of non-keyed hashes, which was incorrectly appending the type to the hash without an outer hash. This breaks backwards compatability, but nobody was using the old algorithm, right? I'll introduce it as an option if required. * 1.0.1: Consistency check on read blocks by default. Removed warning about deletions from backend-cache; we need a new mechanism to report warnings from backends to the user. Made backend-cache and backend-fs/splitlog commit periodically rather than after every insert, which should speed up snapshotting a lot, and reused the prepared statements rather than re-preparing them all the time. BUGFIX: splitlog backend now creates log files with "rw-------" rather than "rwx------" permissions; and all sqlite databases (splitlog metadata, cache file, and file-cache file) are created with "rw-------" rather then "rw-r--r--". * 1.0: Migrated from gdbm to sqlite for metadata storage, removing the GPL taint. Unit test suite. backend-cache made into a separate backend binary. Removed backend-log. BUGFIX: file caching uses mtime *and* size now, rather than just mtime. Error handling so we skip objects that we cannot do something with, and proceed to try the rest of the operation. * 0.8: decoupling backends from the core and into separate binaries, accessed via standard input and output, so they can be run over SSH tunnels and other such magic. * 0.7: file cache support, sorting of directories so they're archived in canonical order, autoloading of hash/encryption/compression modules so they're not required dependencies any more. * 0.6: .ugarit support. * 0.5: Keyed hashing so attackers can't tell what blocks you have, markers in logs so the index can be reconstructed, sha2 support, and passphrase support. * 0.4: AES encryption. * 0.3: Added splitlog backend, and fixed a .meta file typo. * 0.2: Initial public release. * 0.1: Internal development release.