# Introduction Ugarit is a backup/archival system based around content-addressible storage. This allows it to upload incremental backups to a remote server or a local filesystem such as an NFS share or a removable hard disk, yet have the archive instantly able to produce a full snapshot on demand rather than needing to download a full snapshot plus all the incrementals since. The content-addressible storage technique means that the incrementals can be applied to a snapshot on various kinds of storage without needing intelligence in the storage itself - so the snapshots can live within Amazon S3 or on a removable hard disk. Also, the same storage can be shared between multiple systems that all back up to it - and the incremental upload algorithm will mean that any files shared between the servers will only need to be uploaded once. If you back up a complete server, than go and back up another that is running the same distribution, then all the files in `/bin` and so on that are already in the storage will not need to be backed up again; the system will automatically spot that they're already there, and not upload them again. ## So what's that mean in practice? You can run Ugarit to back up any number of filesystems to a shared archive, and on every backup, Ugarit will only upload files or parts of files that aren't already in the archive - be they from the previous snapshot, earlier snapshots, snapshot of entirely unrelated filesystems, etc. Every time you do a snapshot, Ugarit builds an entire complete directory tree of the snapshot in the archive - but reusing any parts of files, files, or entire directories that already exist anywhere in the archive, and only uploading what doesn't already exist. The support for parts of files means that, in many cases, gigantic files like database tables and virtual disks for virtual machines will not need to be uploaded entirely every time they change, as the changed sections will be identified and uploaded. Because a complete directory tree exists in the archive for any snapshot, the extraction algorithm is incredibly simple - and, therefore, incredibly reliable and fast. Simple, reliable, and fast are just what you need when you're trying to reconstruct the filesystem of a live server. Also, it means that you can do lots of small snapshots. If you run a snapshot every hour, then only a megabyte or two might have changed in your filesystem, so you only upload a megabyte or two - yet you end up with a complete history of your filesystem at hourly intervals in the archive. Conventional backup systems usually either store a full backup then incrementals to their archives, meaning that doing a restore involves reading the full backup then reading every incremental since and applying them - so to do a restore, you have to download *every version* of the filesystem you've ever uploaded, or you have to do periodic full backups (even though most of your filesystem won't have changed since the last full backup) to reduce the number of incrementals required for a restore. Better results are had from systems that use a special backup server to look after the archive storage, which accept incremental backups and apply them to the snapshot they keep in order to maintain a most-recent snapshot that can be downloaded in a single run; but they then restrict you to using dedicated servers as your archive stores, ruling out cheaply scalable solutions like Amazon S3, or just backing up to a removable USB or eSATA disk you attach to your system whenever you do a backup. And dedicated backup servers are complex pieces of software; can you rely on something complex for the fundamental foundation of your data security system? ## System Requirements Ugarit should run on any POSIX-compliant system that can run [Chicken Scheme](http://www.call-with-current-continuation.org/). It stores and restores all the file attributes reported by the `stat` system call - POSIX mode permissions, UID, GID, mtime, and optionally atime and ctime (although the ctime cannot be restored due to POSIX restrictions). Ugarit will store files, directories, device and character special files, symlinks, and FIFOs. Support for extended filesystem attributes - ACLs, alternative streams, forks and other metadata - is possible, due to the extensible directory entry format; support for such metadata will be added as required. Currently, only local filesystem-based archive storage backends are complete: these are suitable for backing up to a removable hard disk or a filesystem shared via NFS or other protocols. However, the backend can be accessed via an SSH tunnel, so a remote server you are able to install Ugarit on to run the backends can be used as a remote archive. However, the next backend to be implemented will be one for Amazon S3, and an SFTP backend for storing archives anywhere you can ssh to. Other backends will be implemented on demand; an archive can, in principle, be stored on anything that can store files by name, report on whether a file already exists, and efficiently download a file by name. This rules out magnetic tapes due to their requirement for sequential access. Although we need to trust that a backend won't lose data (for now), we don't need to trust the backend not to snoop on us, as Ugarit optionally encrypts everything sent to the archive. ## Terminology A Ugarit backend is the software module that handles backend storage. An archive is an actual storage system storing actual data, accessed through the appropriate backend for that archive. The backend may run locally under Ugarit itself, or via an SSH tunnel, on a remote server where it is installed. For example, if you use the recommended "splitlog" filesystem backend, your archive might be `/mnt/bigdisk` on the server `prometheus`. The backend (which is compiled along with the other filesystem backends in the `backend-fs` binary) must be installed on `prometheus`, and Ugarit clients all over the place may then use it via ssh to `prometheus`. However, even with the filesystem backends, the actual storage might not be on `prometheus` where the backend runs - `/mnt/bigdisk` might be an NFS mount, or a mount from a storage-area network. This ability to delegate via SSH is particularly useful with the "cache" backend, which reduces latency by storing a cache of what blocks exist in a backend, thereby making it quicker to identify already-stored files; a cluster of servers all sharing the same archive might all use SSH tunnels to access an instance of the "cache" backend on one of them (using some local disk to store the cache), which proxies the actual archive storage to an archive on the other end of a high-latency Internet link, again via an SSH tunnel. ## What's in an archive? An Ugarit archive contains a load of blocks, each up to a maximum size (usually 1MiB, although other backends might impose smaller limits). Each block is identified by the hash of its contents; this is how Ugarit avoids ever uploading the same data twice, by checking to see if the data to be uploaded already exists in the archive by looking up the hash. The contents of the blocks are compressed and then encrypted before upload. Every file uploaded is, unless it's small enough to fit in a single block, chopped into blocks, and each block uploaded. This way, the entire contents of your filesystem can be uploaded - or, at least, only the parts of it that aren't already there! The blocks are then tied together to create a snapshot by uploading blocks full of the hashes of the data blocks, and directory blocks are uploaded listing the names and attributes of files in directories, along with the hashes of the blocks that contain the files' contents. Even the blocks that contain lists of hashes of other blocks are subject to checking for pre-existence in the archive; if only a few MiB of your hundred-GiB filesystem has changed, then even the index blocks and directory blocks are re-used from previous snapshots. Once uploaded, a block in the archive is never again changed. After all, if its contents changed, its hash would change, so it would no longer be the same block! However, every block has a reference count, tracking the number of index blocks that refer to it. This means that the archive knows which blocks are shared between multiple snapshots (or shared *within* a snapshot - if a filesystem has more than one copy of the same file, still only one copy is uploaded), so that if a given snapshot is deleted, then the blocks that only that snapshot is using can be deleted to free up space, without corrupting other snapshots by deleting blocks they share. Keep in mind, however, that not all storage backends may support this - there are certain advantages to being an append-only archive. For a start, you can't delete something by accident! The supplied fs backend supports deletion, while the splitlog backend does not yet. However, the actual snapshot deletion command hasn't been implemented yet either, so it's a moot point for now... Finally, the archive contains objects called tags. Unlike the blocks, the tags contents can change, and they have meaningful names rather than being identified by hash. Tags identify the top-level blocks of snapshots within the system, from which (by following the chain of hashes down through the index blocks) the entire contents of a snapshot may be found. Unless you happen to have recorded the hash of a snapshot somewhere, the tags are where you find snapshots from when you want to do a restore! Whenever a snapshot is taken, as soon as Ugarit has uploaded all the files, directories, and index blocks required, it looks up the tag you have identified as the target of the snapshot. If the tag already exists, then the snapshot it currently points to is recorded in the new snapshot as the "previous snapshot"; then the snapshot header containing the previous snapshot hash, along with the date and time and any comments you provide for the snapshot, and is uploaded (as another block, identified by its hash). The tag is then updated to point to the new snapshot. This way, each tag actually identifies a chronological chain of snapshots. Normally, you would use a tag to identify a filesystem being backed up; you'd keep snapshotting the filesystem to the same tag, resulting in all the snapshots of that filesystem hanging from the tag. But if you wanted to remember any particular snapshot (perhaps if it's the snapshot you take before a big upgrade or other risky operation), you can duplicate the tag, in effect 'forking' the chain of snapshots much like a branch in a version control system. # Using Ugarit ## Installation Install [Chicken Scheme](http://www.call-with-current-continuation.org/) using their [installation instructions](http://chicken.wiki.br/Getting%20started#Installing%20Chicken). Ugarit can then be installed by typing (as root): chicken-install ugarit See the [chicken-install manual](http://wiki.call-cc.org/manual/Extensions#chicken-install-reference) for details if you have any trouble, or wish to install into your home directory. ## Setting up an archive Firstly, you need to know the archive identifier for the place you'll be storing your archives. This depends on your backend. The archive identifier is actually the command line used to invoke the backend for a particular archive; communication with the archive is via standard input and output, which is how it's easy to tunnel via ssh. ### Local filesystem backends These backends use the local filesystem to store the archives. Of course, the "local filesystem" on a given server might be an NFS mount or mounted from a storage-area network. #### Logfile backend The logfile backend works much like the original Venti system. It's append-only - you won't be able to delete old snapshots from a logfile archive, even when I implement deletion. It stores the archive in two sets of files; one is a log of data blocks, split at a specified maximum size, and the other is the metadata: an sqlite database used to track the location of blocks in the log files, the contents of tags, and a count of the logs so a filename can be chosen for a new one. To set up a new logfile archive, just choose where to put the two parts. It would be nice to put the metadata file on a different physical disk to the logs directory, to reduce seeking. If you only have one disk, you can put the metadata file in the log directory ("metadata" is a good name). You can then refer to it using the following archive identifier: "backend-fs splitlog ...log directory... ...metadata file... max-logfile-size" For most platforms, a max-logfile-size of 900000000 (900 MB) should suffice. For now, don't go much bigger than that on 32-bit systems until Chicken's `file-position` function is fixed to work with files more than 1GB in size. #### Filesystem backend The filesystem backend creates archives by storing each block or tag in its own file, in a directory. To keep the objects-per-directory count down, it'll split the files into subdirectories. Because of this, it uses a stupendous number of inodes (more than the filesystem being backed up). Only use it if you don't mind that; splitlog is much more efficient. To set up a new filesystem-backend archive, just create an empty directory that Ugarit will have write access to when it runs. It will probably run as root in order to be able to access the contents of files that aren't world-readable (although that's up to you), so be careful of NFS mounts that have `maproot=nobody` set! You can then refer to it using the following archive identifier: "backend-fs fs ...path to directory..." ### Proxying backends These backends wrap another archive identifier which the actual storage task is delegated to, but add some value along the way. ### SSH tunnelling It's easy to access an archive stored on a remote server. The caveat is that the backend then needs to be installed on the remote server! Since archives are accessed by running the supplied command, and then talking to them via stdin and stdout, the archive identified needs only be: "ssh ...hostname... '...remote archive identifier...'" ### Cache backend The cache backend is used to cache a list of what blocks exist in the proxied backend, so that it can answer queries as to the existance of a block rapidly, even when the proxied backend is on the end of a high-latency link (eg, the Internet). This should speed up snapshots, as existing files are identified by asking the backend if the archive already has them. The cache backend works by storing the cache in a local sqlite file. Given a place for it to store that file, usage is simple: "backend-cache ...path to cachefile... '...proxied archive identifier...'" The cache file will be automatically created if it doesn't already exist, so make sure there's write access to the containing directory. - WARNING - WARNING - WARNING - WARNING - WARNING - WARNING - If you use a cache on an archive shared between servers, make sure that you either: * Never delete things from the archive or * Make sure all access to the archive is via the same cache If a block is deleted from an archive, and a cache on that archive is not aware of the deletion (as it did not go "through" the caching proxy), then the cache will record that the block exists in the archive when it does not. This will mean that if a snapshot is made through the cache that would use that block, then it will be assumed that the block already exists in the archive when it does not. Therefore, the block will not be uploaded, and a dangling reference will result! Some setups which *are* safe: * A single server using an archive via a cache, not sharing it with anyone else. * A pool of servers using an archive via the same cache. * A pool of servers using an archive via one or more caches, and maybe some not via the cache, where nothing is ever deleted from the archive. * A pool of servers using an archive via one cache, and maybe some not via the cache, where deletions are only performed on servers using the cache, so the cache is always aware. ## Writing a ugarit.conf `ugarit.conf` should look something like this: (storage ) (hash tiger "") [double-check] [(compression [deflate|lzma])] [(encryption aes )] [(file-cache "")] [(rule ...)] The hash line chooses a hash algorithm. Currently Tiger-192 (`tiger`), SHA-256 (`sha256`), SHA-384 (`sha384`) and SHA-512 (`sha512`) are supported; if you omit the line then Tiger will still be used, but it will be a simple hash of the block with the block type appended, which reveals to attackers what blocks you have (as the hash is of the unencrypted block, and the hash is not encrypted). This is useful for development and testing or for use with trusted archives, but not advised for use with archives that attackers may snoop at. Providing a salt string produces a hash function that hashes the block, the type of block, and the salt string, producing hashes that attackers who can snoop the archive cannot use to find known blocks (see the "Security model" section below for more details). Whichever hash function you use, you will need to install the required Chicken egg with one of the following commands: chicken-install -s tiger-hash # for tiger chicken-install -s sha2 # for the SHA hashes `double-check`, if present, causes Ugarit to perform extra internal consistency checks during backups, which will detect bugs but may slow things down. `lzma` is the recommended compression option for low-bandwidth backends or when space is tight, but it's very slow to compress; deflate or no compression at all are better for fast local archives. To have no compression at all, just remove the `(compression ...)` line entirely. Likewise, to use compression, you need to install a Chicken egg: chicken-install -s z3 # for deflate chicken-install -s lzma # for lzma Likewise, the `(encryption ...)` line may be omitted to have no encryption; the only currently supported algorithm is aes (in CBC mode) with a key given in hex, as a passphrase (hashed to get a key), or a passphrase read from the terminal on every run. The key may be 16, 24, or 32 bytes for 128-bit, 192-bit or 256-bit AES. To specify a hex key, just supply it as a string, like so: (encryption aes "00112233445566778899AABBCCDDEEFF") ...for 128-bit AES, (encryption aes "00112233445566778899AABBCCDDEEFF0011223344556677") ...for 192-bit AES, or (encryption aes "00112233445566778899AABBCCDDEEFF00112233445566778899AABBCCDDEEFF") ...for 256-bit AES. Alternatively, you can provide a passphrase, and specify how large a key you want it turned into, like so: (encryption aes ([16|24|32] "We three kings of Orient are, one in a taxi one in a car, one on a scooter honking his hooter and smoking a fat cigar. Oh, star of wonder, star of light; star with royal dynamite")) Finally, the extra-paranoid can request that Ugarit prompt for a passphrase on every run and hash it into a key of the specified length, like so: (encryption aes ([16|24|32] prompt)) (note the lack of quotes around `prompt`, distinguishing it from a passphrase) Please read the "Security model" section below for details on the implications of different encryption setups. Again, as it is an optional feature, to use encryption, you must install the appropriate Chicken egg: chicken-install -s aes A file cache, if enabled, significantly speeds up subsequent snapshots of a filesystem tree. The file cache is a file (which Ugarit will create if it doesn't already exist) mapping filenames to (mtime,size,hash) tuples; as it scans the filesystem, if it finds a file in the cache and the mtime and size have not changed, it will assume it is already archived under the specified hash. This saves it from having to read the entire file to hash it and then check if the hash is present in the archive. In other words, if only a few files have changed since the last snapshot, then snapshotting a directory tree becomes an O(N) operation, where N is the number of files, rather than an O(M) operation, where M is the total size of files involved. For example: (storage "ssh ugarit@spiderman 'backend-fs splitlog /mnt/ugarit-data /mnt/ugarit-metadata/metadata 900000000'") (hash tiger "Giung0ahKahsh9ahphu5EiGhAhth4eeyDahs2aiWAlohr6raYeequ8uiUr3Oojoh") (encryption aes (32 "deing2Aechediequohdo6Thuvu0OLoh6fohngio9koush9euX6el9iesh6Aef4augh3WiY7phahmesh2Theeziniem5hushai5zigushohnah1quae1ooXo0eingu1Aifeo1eeSheaz9ieSie9tieneibeiPho0quu6um8weiyagh4kaeshooThooNgeyoul2Ahsahgh8imohw3hoyazai9gaph5ohhaechiedeenusaeghahghipe8ii3oo9choh5cieth5iev3jiedohquai4Thiedah5sah5kohcepheixai3aiPainozooc6zohNeiy6Jeigeesie5eithoo0ciiNae8Nee3eiSuKaiza0VaiPai2eeFooNgeengaif9yaiv9rathuoQuohy0ohth6OiL9aisaetheeWoh9aiQu0yoo6aequ3quoiChi7joonohwuvaipeuh2eiPoogh1Ie8tiequesoshaeBue5ieca8eerah0quieJoNoh3Jiesh1chei8weidixeen1yah1ioChie0xaimahWeeriex5eetiichahP9iey5ux7ahGhei7eejahxooch5eiqu0Pheir9Reiri4ahqueijuchae8eeyieMeixa4ciisioloe9oaroof1eegh4idaeNg5aepeip8mah7ixaiSohtoxaiH4oe5eeGoh4eemu7mee8ietaecu6Zoodoo0hoP5uquaish2ahc7nooshi0Aidae2Zee4pheeZee3taerae6Aepu2Ayaith2iivohp8Wuikohvae2Peange6zeihep8eC9mee8johshaech1Ubohd4Ko5caequaezaigohyai1TheeN6Gohva6jinguev4oox2eet5auv0aiyeo7eJieGheebaeMahshifaeDohy8quut4ueFei3eiCheimoechoo2EegiveeDah1sohs7ezee3oaWa2iiv2Chi1haiS5ahph4phu5su0hiocee3ooyaeghang7sho7maiXeo5aex")) (compression lzma) (file-cache "/var/ugarit/cache") Be careful to put a set of parentheses around each configuration entry. White space isn't significant, so feel free to indent things and wrap them over lines if you want. Keep copies of this file safe - you'll need it to do extractions! Print a copy out and lock it in your fire safe! Ok, currently, you might be able to recreate it if you remember where you put the storage, but encryption keys are harder to remember. ## Your first backup Think of a tag to identify the filesystem you're backing up. If it's `/home` on the server `gandalf`, you might call it `gandalf-home`. If it's the entire filesystem of the server `bilbo`, you might just call it `bilbo`. Then from your shell, run (as root): # ugarit snapshot [-c] [-a] For example, if we have a `ugarit.conf` in the current directory: # ugarit snapshot ugarit.conf -c localhost-etc /etc Specify the `-c` flag if you want to store ctimes in the archive; since it's impossible to restore ctimes when extracting from an archive, doing this is useful only for informational purposes, so it's not done by default. Similarly, atimes aren't stored in the archive unless you specify `-a`, because otherwise, there will be a lot of directory blocks uploaded on every snapshot, as the atime of every file will have been changed by the previous snapshot - so with `-a` specified, on every snapshot, every directory in your filesystem will be uploaded! Ugarit will happily restore atimes if they are found in an archive; their storage is made optional simply because uploading them is costly and rarely useful. ## Exploring the archive Now you have a backup, you can explore the contents of the archive. This need not be done as root, as long as you can read `ugarit.conf`; however, if you want to extract files, run it as root so the uids and gids can be set. $ ugarit explore This will put you into an interactive shell exploring a virtual filesystem. The root directory contains an entry for every tag; if you type `ls` you should see your tag listed, and within that tag, you'll find a list of snapshots, in descending date order, with a special entry `current` for the most recent snapshot. Within a snapshot, you'll find the root directory of your snapshot, and will be able to `cd` into subdirectories, and so on: > ls Test > cd Test /Test> ls 2009-01-24 10:28:16 2009-01-24 10:28:16 current /Test> cd current /Test/current> ls README.txt LICENCE.txt subdir .svn FIFO chardev blockdev /Test/current> ls -ll LICENCE.txt lrwxr-xr-x 1000 100 2009-01-15 03:02:49 LICENCE.txt -> subdir/LICENCE.txt target: subdir/LICENCE.txt ctime: 1231988569.0 As well as exploring around, you can also extract files or directories (or entire snapshots) by using the `get` command. Ugarit will do its best to restore the metadata of files, subject to the rights of the user you run it as. Type `help` to get help in the interactive shell. ## Duplicating tags As mentioned above, you can duplicate a tag, creating two tags that refer to the same snapshot and its history but that can then have their own subsequent history of snapshots applied to each independently, with the following command: $ ugarit fork ## `.ugarit` files By default, Ugarit will archive everything it finds in the filesystem tree you tell it to snapshot. However, this might not always be desired; so we provide the facility to override this with `.ugarit` files, or global rules in your `.conf` file. Note: The syntax of these files is provisional, as I want to experiment with usability, as the current syntax is ugly. So please don't be surprised if the format changes in incompatible ways in subsequent versions! In quick summary, if you want to ignore all files or directories matching a glob in the current directory and below, put the following in a `.ugarit` file in that directory: (* (glob "*~") exclude) You can write quite complex expressions as well as just globs. The full set of rules is: * `(glob "`*pattern*`")` matches files and directories whose names match the glob pattern * `(name "`*name*`")` matches files and directories with exactly that name (useful for files called `*`...) * `(modified-within ` *number* ` seconds)` matches files and directories modified within the given number of seconds * `(modified-within ` *number* ` minutes)` matches files and directories modified within the given number of minutes * `(modified-within ` *number* ` hours)` matches files and directories modified within the given number of hours * `(modified-within ` *number* ` days)` matches files and directories modified within the given number of days * `(not ` *rule*`)` matches files and directories that do not match the given rule * `(and ` *rule* *rule...*`)` matches files and directories that match all the given rules * `(or ` *rule* *rule...*`)` matches files and directories that match any of the given rules Also, you can override a previous exclusion with an explicit include in a lower-level directory: (* (glob "*~") include) You can bind rules to specific directories, rather than to "this directory and all beneath it", by specifying an absolute or relative path instead of the `*`: ("/etc" (name "passwd") exclude) If you use a relative path, it's taken relative to the directory of the `.ugarit` file. You can also put some rules in your `.conf` file, although relative paths are illegal there, by adding lines of this form to the file: (rule * (glob "*~") exclude) # Questions and Answers ## What happens if a snapshot is interrupted? Nothing! Whatever blocks have been uploaded will be uploaded, but the snapshot is only added to the tag once the entire filesystem has been snapshotted. So just start the snapshot again. Any files that have already be uploaded will then not need to be uploaded again, so the second snapshot should proceed quickly to the point where it failed before, and continue from there. Unless the archive ends up with a partially-uploaded corrupted block due to being interrupted during upload, you'll be fine. The filesystem backend has been written to avoid this by writing the block to a file with the wrong name, then renaming it to the correct name when it's entirely uploaded. ## Should I share a single large archive between all my filesystems? I think so. Using a single large archive means that blocks shared between servers - eg, software installed from packages and that sort of thing - will only ever need to be uploaded once, saving storage space and upload bandwidth. However, do not share an archive between servers that do not mutually trust each other, as they can all update the same tags, so can meddle with each other's snapshots - and read each other's snapshots. # Security model I have designed and implemented Ugarit to be able to handle cases where the actual archive storage is not entirely trusted. However, security involves tradeoffs, and Ugarit is configurable in ways that affect its resistance to different kinds of attacks. Here I will list different kinds of attack and explain how Ugarit can deal with them, and how you need to configure it to gain that protection. ## Archive snoopers This might be somebody who can intercept Ugarit's communication with the archive at any point, or who can read the archive itself at their leisure. ### Reading your data If you enable encryption, then all the blocks sent to the archive are encrypted using a secret key stored in your Ugarit configuration file. As long as that configuration file is kept safe, and the AES algorithm is secure, then attackers who can snoop the archive cannot decode your data blocks. Enabling compression will also help, as the blocks are compressed before encrypting, which is thought to make cryptographic analysis harder. Recommendations: Use compression and encryption when there is a risk of archive snooping. Keep your Ugarit configuration file safe using UNIX file permissions (make it readable only by root), and maybe store it on a removable device that's only plugged in when required. Alternatively, use the "prompt" passphrase option, and be prompted for a passphrase every time you run Ugarit, so it isn't stored on disk anywhere. ### Looking for known hashes A block is identified by the hash of its content (before compression and encryption). If an attacker was trying to find people who own a particular file (perhaps a piece of subversive literature), they could search Ugarit archives for its hash. However, Ugarit has the option to "key" the hash with a "salt" stored in the Ugarit configuration file. This means that the hashes used are actually a hash of the block's contents *and* the salt you supply. If you do this with a random salt that you keep secret, then attackers can't check your archive for known content just by comparing the hashes. Recommendations: Provide a secret string to your hash function in your Ugarit configuration file. Keep the Ugarit configuration file safe, as per the advice in the previous point. ## Archive modifiers These folks can modify Ugarit's writes into the archive, its reads back from the archive, or can modify the archive itself at their leisure. Modifying an encrypted block without knowing the encryption key can at worst be a denial of service, corrupting the block in an unknown way. An attacker who knows the encryption key could replace a block with valid-seeming but incorrect content. In the worst case, this could exploit a bug in the decompression engine, causing a crash or even an exploit of the Ugarit process itself (thereby gaining the powers of a process inspector, as documented below). We can but hope that the decompression engine is robust. Exploits of the decryption engine, or other parts of Ugarit, are less likely due to the nature of the operations performed upon them. However, if a block is modified, then when Ugarit reads it back, the hash will no longer match the hash Ugarit requested, which will be detected and an error reported. The hash is checked after decryption and decompression, so this check does not protect us against exploits of the decompression engine. This protection is only afforded when the hash Ugarit asks for is not tampered with. Most hashes are obtained from within other blocks, which are therefore safe unless that block has been tampered with; the nature of the hash tree conveys the trust in the hashes up to the root. The root hashes are stored in the archive as "tags", which an archive modifier could alter at will. Therefore, the tags cannot be trusted if somebody might modify the archive. This is why Ugarit prints out the snapshot hash and the root directory hash after performing a snapshot, so you can record them securely outside of the archive. The most likely threat posed by archive modifiers is that they could simply corrupt or delete all of your archive, without needing to know any encryption keys. Recommendations: Secure your archives against modifiers, by whatever means possible. If archive modifiers are still a potential threat, write down a log of your root directory hashes from each snapshot, and keep it safe. When extracting your backups, use the `ls -ll` command in the interface to check the "contents" hash of your snapshots, and check they match the root directory hash you expect. ## Process inspectors These folks can attach debuggers or similar tools to running processes, such as Ugarit itself. Ugarit backend processes only see encrypted data, so people who can attach to that process gain the powers of archive snoopers and modifiers, and the same conditions apply. People who can attach to the Ugarit process itself, however, will see the original unencrypted content of your filesystem, and will have full access to the encryption keys and hashing keys stored in your Ugarit configuration. When Ugarit is running with sufficient permissions to restore backups, they will be able to intercept and modify the data as it comes out, and probably gain total write access to your entire filesystem in the process. Recommendations: Ensure that Ugarit does not run under the same user ID as untrusted software. In many cases it will need to run as root in order to gain unfettered access to read the filesystems it is backing up, or to restore the ownership of files. However, when all the files it backs up are world-readable, it could run as an untrusted user for backups, and where file ownership is trivially reconstructible, it can do restores as a limited user, too. ## Attackers in the source filesystem These folks create files that Ugarit will back up one day. By having write access to your filesystem, they already have some level of power, and standard Unix security practices such as storage quotas should be used to control them. They may be people with logins on your box, or more subtly, people who can cause servers to writes files; somebody who sends an email to your mailserver will probably cause that message to be written to queue files, as will people who can upload files via any means. Such attackers might use up your available storage by creating large files. This creates a problem in the actual filesystem, but that problem can be fixed by deleting the files. If those files get archived into Ugarit, then they are a part of that snapshot. If you are using a backend that supports deletion, then (when I implement snapshot deletion in the user interface) you could delete that entire snapshot to recover the wasted space, but that is a rather serious operation. More insidiously, such attackers might attempt to abuse a hash collision in order to fool the archive. If they have a way of creating a file that, for instance, has the same hash as your shadow password file, then Ugarit will think that it already has that file when it attempts to snapshot it, and store a reference to the existing file. If that snapshot is restored, then they will receive a copy of your shadow password file. Similarly, if they can predict a future hash of your shadow password file, and create a shadow password file of their own (perhaps one giving them a root account with a known password) with that hash, they can then wait for the real shadow password file to have that hash. If the system is later restored from that snapshot, then their chosen content will appear in the shadow password file. However, doing this requires a very fundamental break of the hash function being used. Recommendations: Think carefully about who has write access to your filesystems, directly or indirectly via a network service that stores received data to disk. Enforce quotas where appropriate, and consider not backing up "queue directories" where untrusted content might appear; migrate incoming content that passes acceptance tests to an area that is backed up. If necessary, the queue might be backed up to a non-snapshotting system, such as rsyncing to another server, so that any excessive files that appear in there are removed from the backup in due course, while still affording protection. # Future Directions Here's a list of planned developments, in approximate priority order: ## General * More checks with `double-check` mode activated. Perhaps read blocks back from the archive to check it matches the blocks sent, to detect hash collisions. Maybe have levels of double-check-ness. * Everywhere I use (sql ...) to create an sqlite prepared statement, don't. Create them all up-front and reuse the resulting statement objects, it'll save memory and time. (done for backend-fs/splitlog and backend/cache, file-cache still needs it). * Migrate the source repo to Fossil (when there's a kitten-technologies.co.uk migration to Fossil), and update the egg locations thingy. ## Backends * Look at http://bugs.call-cc.org/ticket/492 - can this help? * Extend the backend protocol with a special "admin" command that allows for arbitrary backend-specific operations, and write an ugarit-backend-admin CLI tool to administer backends with it. The input should be a single s-expression as a list, and the result should be an alist which is displayed to the user in a friendly manner, as "Key: Value\n" lines. * Extend the backend protocol with a `flush` command, such that operations performed without a subsequent `flush` might not "stick" in failure cases (make `close!` have an implicit `flush`, of course). Then use this to let splitlog and cache backends buffer sqlite `INSERT`s and then spit them out in a single transaction per `flush`/`close` or when the buffer hits a determined size limit, to improve throughput. * Implement "info" admin commands for all backends, that list any available stats, and at least the backend type and parameters. * Support for recreating the index and tags on a backend-splitlog if they get corrupted, from the headers left in the log, as a "reindex" admin command. * Support for flushing the cache on a backend-cache, via an admin command. * Support for unlinking in backend-splitlog, by marking byte ranges as unused in the metadata (and by touching the headers in the log so we maintain the invariant that the metadata is a reconstructible cache) and removing the entries for the unlinked blocks, perhaps provide an option to attempt to re-use existing holes to put blocks in for online reuse, and provide an offline compaction operation. Keep stats in the index of how many byte ranges are unused, and how many bytes unused, in each file, and report them in the info admin interface, along with the option to compact any or all files. * Have read-only and unlinkable config flags in the backend-split metadata file, settable via admin commands. * Optional support in backends for keeping a log of tag changes, and admin commands to read the log. * Support for SFTP as a storage backend. Store one file per block, as per `backend-fs`, but remotely. See http://tools.ietf.org/html/draft-ietf-secsh-filexfer-13 for sftp protocol specs; popen an `ssh -p sftp` connection to the server then talk that simple binary protocol. Tada! * Support for S3 as a storage backend. There is now an S3 egg! * Support for replicated archives. This will involve a special storage backend that can wrap any number of other archives, each tagged with a trust percentage and read and write load weightings. Each block will be uploaded to enough archives to make the total trust be at least 100%, by randomly picking the archives weighted by their write load weighting. A read-only archive automatically gets its write load weighting set to zero, and a warning issued if it was configured otherwise. A local cache will be kept of which backends carry which blocks, and reads will be serviced by picking the archive that carries it and has the highest read load weighting. If that archive is unavailable or has lost the block, then they will be tried in read load order; and if none of them have it, an exhaustive search of all available archives will be performed before giving up, and the cache updated with the results if the block is found. In order to correctly handle archives that were unavailable during this, we might need to log an "unknown" for that block key / archive pair, rather than assuming the block is not there, and check it later. Users will be given an admin command to notify the backend of an archive going missing forever, which will cause it to be removed from the cache. Affected blocks should be examined and re-replicated if their replication count is now too low. Another command should be available to warn of impending deliberate removal, which will again remove the archive from the cluster and re-replicate, the difference being that the disappearing archive is usable for re-replicating FROM, so this is a safe operation for blocks that are only on that one archive. The individual physical archives that we put replication on top of won't be "valid" archives unless they are 100% replicated, as they'll contain references to blocks that are on other archives. It might be a good idea to mark them as such with a special tag to avoid people trying to restore directly from them. A copy of the replication configuration could be stored under a special tag to mark this fact, and to enable easy finding of the proper replicated archive to work from. There should be a configurable option to snapshot the cache to the archives whenever the replicated archive is closed, too. The command line to the backend, "backend-replicated", should point to an sqlite file for the configuration and cache, and users should use admin commands to add/remove/modify archives in the cluster. ## Core * Log all WARNINGs produced during a snapshot job, and attach them to the snapshot object as a text file. * Clarify what characters are legal in tag names sent to backends, and what are legal in human-supplied tag names, and check that human-supplied tag names match a regular expression. Leave space for system-only tag names for storing archive metadata; suggest making a # sign illegal in tag names. * Clarify what characters are legal in block keys. Ugarit will only issue hex characters for normal blocks, but may use other characters for special metadata blocks; establish a contract of what backends must support (a-z, A-Z, 0-9, hyphen?) * API documentation for the modules we export * Encrypt tags, with a hash inside to check it's decrypted correctly. Add a special "#ugarit-archive-format" tag that records a format version number, to note that this change has been applied. Provide an upgrade tool. Don't do auto-upgrades, or attackers will be able to drop in plaintext tags. * Store a test block in the archive that is used to check the same encryption and hash settings are used for an archive, consistently (changing compression setting is supported, but changing encryption or hash will lead to confusion). Encrypt the hash of the passphrase and store it in the test block, which should have a name that cannot clash with any actual hash (eg, use non-hex characters in its name). When the block does not exist, create it; when it does exist, check it against the current encryption and hashing settings to see if it matches. When creating a new block, if the "prompt" passphrase specification mechanism is in use, prompt again to confirm the passphrase. If no encryption is in use, check the hash algorithm doesn't change by storing the hash of a constant string, unencrypted. To make brute-forcing the passphrase or hash-salt harder, consider applying the hash a large number of times, to increase the compute cost of checking it. Thanks to Andy Bennett for this idea. * More `.ugarit` actions. Right now we just have exclude and include; we might specify less-safe operations such as commands to run before and after snapshotting certain subtrees, or filters (don't send this SVN repository; instead send the output of `svnadmin dump`), etc. Running arbitrary commands is a security risk if random users write their own `.ugarit` files - so we'd need some trust-based mechanism; they'd need to be explicitly enabled in `ugarit.conf`, then a `.ugarit` option could disable all unsafe operations in a subtree. * `.ugarit` rules for file sizes. In particular, a rule to exclude files above a certain size. Thanks to Andy Bennett for this idea. * Support for FFS flags, Mac OS X extended filesystem attributes, NTFS ACLs/streams, FAT attributes, etc... Ben says to look at Box Backup for some code to do that sort of thing. * Implement lock-tag! etc. in backend-fs, as a precaution against two concurrent snapshots racing over updating the tag, where concurrent access to the archive is even possible. * Deletion support - letting you remove snapshots. Perhaps you might want to remove all snapshots older than a given number of days on a given tag. Or just remove X out of Y snapshots older than a given number of days on a given tag. We have the core support for this; just find a snapshot and `unlink-directory!` it, leaving a dangling pointer from the snapshot, and write the snapshot handling code to expect this. Again, check Box Backup for that. * Some kind of accounting for storage usage by snapshot. It'd be nice to track, as we write a snapshot to the archive, how many bytes we reuse and how many we back up. We can then store this in the snapshot metadata, and so report them somewhere. The blocks uploaded by a snapshot may well then be reused by other snapshots later on, so it wouldn't be a true measure of 'unique storage', nor a measure of what you'd reclaim by deleting that snapshot, but it'd be interesting anyway. * Option, when backing up, to not cross mountpoints * Option, when backing up, to store inode number and mountpoint path in directory entries, and then when extracting, keeping a dictionary of this unique identifier to pathname, so that if a file to be extracted is already in the dictionary and the hash is the same, a hardlink can be created. * Archival mode as well as snapshot mode. Whereas a snapshot record takes a filesystem tree and adds it to a chain of snapshots of the same filesystem tree, archival mode takes a filesystem tree and inserts it into a search tree anchored on the specified tag, indexing it on a list of key+value properties supplied at archival time. An archive tag is represented in the virtual filesystem as a directory full of archive objects, each identified by their full hash; each archive object references the filesystem root as well as the key+value properties, and optionally a parent link like a snapshot, as an archive can be made that explicitly replaces an earlier one and should replace it in the index; there is also a virtual directory for each indexed property which contains a directory for each value of the property, full of symlinks to the archive objects, and subdirectories that allow multi-property searches on other properties. The index itself is stored as a B-Tree with a reasonably small block size; when it's updated, the modified index blocks are replaced, thereby gaining new hashes, so their parents need replacing, all the way up the tree until a new root block is created. The existing block unlink mechanism in the backends will reclaim storage for blocks that are superceded, if the backend supports it. When this is done, ugarit will offer the option of snapshotting to a snapshot tag, or archiving to an archive tag, or archiving to an archive tag while replacing a specified archive object (nominated by path within the tag), which causes it to be removed from the index (except from the directory listing all archives by hash), and the new archive object is inserted, referencing the old one as a parent. * Dump/restore format. On a dump, walk an arbitrary subtree of an archive, serialising objects. Do not put any hashes in the dump format - dump out entire files, and just identify objects with sequential numbers when forming the directory / snapshot trees. On a restore, read the same format and slide it into an archive (creating any required top-level snapshot objects if the dump doesn't start from a snapshot) and putting it onto a specified tag. The intension is that this format can be used to migrate your stuff between archives, perhaps to change to a better backend. ## Front-end * Better error messages * Line editing in the "explore" CLI, ideally with tab completion * API mode: Works something like the backend API, except at the archive level. Supports all the important archive operations, plus access to sexpr stream writers and key stream writers, archive-node-fold, etc. Requested by andyjpb, perhaps I can write the framework for this and then let him add API functions as he desires. * Command-line support to extract the contents of a given path in the archive, rather than needing to use explore mode. Also the option to extract given just a block key (useful when reading from keys logged manually at snapshot time, or from a backend that has a tag log). * FUSE support. Mount it as a read-only filesystem :-D Then consider adding Fossil-style writing to the `current` of a snapshot, with copy-on-write of blocks to a buffer area on the local disk, then the option to make a snapshot of `current`. * Filesystem watching. Even with the hash-caching trick, a snapshot will still involve walking the entire directory tree and looking up every file in the hash cache. We can do better than that - some platforms provide an interface for receiving real-time notifications of changed or added files. Using this, we could allow ugarit to run in continuous mode, keeping a log of file notifications from the OS while it does an initial full snapshot. It can then wait for a specified period (one hour, perhaps?), accumulating names of files changed since it started, before then creating a new snapshot by uploading just the files it knows to have changed, while subsequent file change notifications go to a new list. ## Testing * An option to verify a snapshot, walking every block in it checking there's no dangling references, and that everything matches its hash, without needing to put it into a filesystem, and applying any other sanity checks we can think of en route. Optionally compare it to an on-disk filesystem, while we're at it. * A unit test script around the `ugarit` command-line tool; the corpus should contain a mix of tiny and huge files and directories, awkward cases for sharing of blocks (many identical files in the same dir, etc), complex forms of file metadata, and so on. It should archive and restore the corpus several times over with each hash, compression, and encryption option. # Acknowledgements The original idea came from Venti, a content-addressed storage system from Plan 9. Venti is usable directly by user applications, and is also integrated with the Fossil filesystem to support snapshotting the status of a Fossil filesystem. Fossil allows references to either be to a block number on the Fossil partition or to a Venti key; so when a filesystem has been snapshotted, all it now contains is a "root directory" pointer into the Venti archive, and any files modified therafter are copied-on-write into Fossil where they may be modified until the next snapshot. We're nowhere near that exciting yet, but using FUSE, we might be able to do something similar, which might be fun. However, Venti inspired me when I read about it years ago; it showed me how elegant content-addressed storage is. Finding out that the Git version control system used the same basic tricks really just confirmed this for me. Also, I'd like to tip my hat to Duplicity. With the changing economics of storage presented by services like Amazon S3 and rsync.net, I looked to Duplicity as it provided both SFTP and S3 backends. However, it worked in terms of full and incremental backups, a model that I think made sense for magnetic tapes, but loses out to content-addressed snapshots when you have random-access media. Duplicity inspired me by its adoption of multiple backends, the very backends I want to use, but I still hungered for a content-addressed snapshot store. I'd also like to tip my hat to Box Backup. I've only used it a little, because it requires a special server to manage the storage (and I want to get my backups *off* of my servers), but it also inspires me with directions I'd like to take Ugarit. It's much more aware of real-time access to random-access storage than Duplicity, and has a very interesting continuous background incremental backup mode, moving away from the tape-based paradigm of backups as something you do on a special day of the week, like some kind of religious observance. I hope the author Ben, who is a good friend of mine, won't mind me plundering his source code for details on how to request real-time notification of changes from the filesystem, and how to read and write extended attributes! Moving on from the world of backup, I'd like to thank the Chicken Team for producing Chicken Scheme. Felix and the community at #chicken on Freenode have particularly inspired me with their can-do attitudes to combining programming-language elegance and pragmatic engineering - two things many would think un-unitable enemies. Of course, they didn't do it all themselves - R5RS Scheme and the SRFIs provided a solid foundation to build on, and there's a cast of many more in the Chicken community, working on other bits of Chicken or just egging everyone on. And I can't not thank Henry Baker for writing the seminal paper on the technique Chicken uses to implement full tail-calling Scheme with cheap continuations on top of C; Henry already had my admiration for his work on combining elegance and pragmatism in linear logic. Why doesn't he return my calls? I even sent flowers. A special thanks should go to Christian Kellermann for porting Ugarit to use Chicken 4 modules, too, which was otherwise a big bottleneck to development, as I was stuck on Chicken 3 for some time! And to Andy Bennett for many insightful conversations about future directions. Thanks to the early adopters who brought me useful feedback, too! And I'd like to thank my wife for putting up with me spending several evenings and weekends and holiday days working on this thing... # Version history * 1.0.1: Consistency check on read blocks by default. Removed warning about deletions from backend-cache; we need a new mechanism to report warnings from backends to the user. Made backend-cache and backend-fs/splitlog commit periodically rather than after every insert, which should speed up snapshotting a lot, and reused the prepared statements rather than re-preparing them all the time. BUGFIX: splitlog backend now creates log files with "rw-------" rather than "rwx------" permissions; and all sqlite databases (splitlog metadata, cache file, and file-cache file) are created with "rw-------" rather then "rw-r--r--". * 1.0: Migrated from gdbm to sqlite for metadata storage, removing the GPL taint. Unit test suite. backend-cache made into a separate backend binary. Removed backend-log. BUGFIX: file caching uses mtime *and* size now, rather than just mtime. Error handling so we skip objects that we cannot do something with, and proceed to try the rest of the operation. * 0.8: decoupling backends from the core and into separate binaries, accessed via standard input and output, so they can be run over SSH tunnels and other such magic. * 0.7: file cache support, sorting of directories so they're archived in canonical order, autoloading of hash/encryption/compression modules so they're not required dependencies any more. * 0.6: .ugarit support. * 0.5: Keyed hashing so attackers can't tell what blocks you have, markers in logs so the index can be reconstructed, sha2 support, and passphrase support. * 0.4: AES encryption. * 0.3: Added splitlog backend, and fixed a .meta file typo. * 0.2: Initial public release. * 0.1: Internal development release.