Proposal for a lightweight fixity tool

Greg Janée and John Kunze
{gjanee,jak}@ucop.edu

March 2010

Motivation

Checksumming files, periodically verifying the integrity of files, and replicating files to reduce the risk of loss are among the most basic operations any data repository must perform. In fact, these operations are so basic that they are already performed in various and overlapping ways by filesystems (e.g., ZFS), storage networks (RAID systems, SANs), and some higher-level systems (e.g., iRODS).

Nevertheless, there's value in having a tool that can be employed by repositories to monitor file integrity independent of storage technologies, underlying systems, and vendors. Such a tool would allow a repository to verify the integrity of its contents itself and to its own satisfaction; to avoid having to place total faith in underlying systems; to ensure the correctness of migrations across storage systems and institutions; and to employ a uniform integrity mechanism across heterogeneous systems.

By integrity or "fixity" checking we're motivated by three use cases:

  1. Ongoing fixity checking. Given a set of files that are relatively (but not necessarily completely) stable, both in terms of content and location, periodically check integrity and report any changes. Provide a mechanism to "bless" (okay) changes that do occur.
  2. Transmission correctness. Ensure that files copied across dissimilar systems arrive unchanged.
  3. Replication consistency. Given a set of files that are replicated across possibly heterogeneous storage systems, periodically check that the copies are consistent.

The ability to accommodate and bless change is an important consideration in use case 1. While the essence of fixity is ensuring that nothing changes, or at least not intentionally, it must be recognized that, given enough time, change is all but inevitable. Collections may grow over time, for example, and individual objects may be modified (to migrate across formats, to support new technologies), augmented (by creating new derivatives), or corrected or amended. Furthermore, accommodating change allows fixity checking to be adopted earlier in repository and data processing workflows.

We might ask why a fixity tool is required at all, since fixity checking can for the most part be accomplished by stringing together a few standard Unix utilities:

find /root -type f | xargs md5sum | sort > manifest.new
diff manifest.old manifest.new

But a dedicated tool can provide convenience, more readable logging output, and additional options.

Use case 1 has been addressed by at least two tools to date, ACE and Tripwire. ACE has a heavy footprint in that it requires Tomcat and MySQL and thus inherits the configuration and maintenance burdens those dependencies entail. Also, to use ACE's Integrity Management Service, a connection to a master hash server at UMIACS is required. Thus a decision to use ACE is not one that can be made lightly. Tripwire can be used for fixity checking, though it is really targeted at operating system-wide intrusion detection. Its configuration process is inflexible and onerous, requiring, for example, that configuration files be encrypted. For use cases 2 and 3, rsync can be used for actual file transfers, and it can be used as a cross-system consistency checker of sorts by using its --checksum and --dry-run options. However, rsync's reporting is not optimal for this purpose, and the ability to customize its reporting is limited and sensitive to the exact versions of rsync involved.

The proposal here is to develop a tool that requires no real installation or up-front configuration; that uses a simple, text-based file format for storing checksums; that can be used directly from the command line for both ad hoc checks and, in conjunction with cron, periodic checks; that is scriptable; and that works cooperatively with rsync.

File formats

Two file formats are required. For storing checksums we propose to use a slight extension of the Checkm manifest format. A Checkm manifest is a line-oriented, pipe-delimited text file that lists one filename per line along with checksum information and, optionally, other file attributes not relevant for our purposes. For example:

# filename | algorithm | digest
/path/to/file1.c | md5 | 49afbd86a1ca9f34b677a3f09655eae9
/path/to/file2.h | md5 | 408ad21d50cef31da4df6d9ed81b01a7

As is, this format supports detection of file changes and file deletions. However, detecting file additions requires some conception of a universe of possible files, e.g., a root directory or a filename glob pattern. The universe could be specified at invocation time (i.e., as a command line argument), but then the definition of the universe would ultimately be buried in a shell script somewhere. Given the important role the universe plays in use case 1, we propose to put the universe specification in the manifest itself using structured comments of the following form:

#%fileset: filesets...

Each such comment defines one or more sets of files (or "filesets") to be included in the universe; there may be multiple such comments at the head of a manifest. Syntactically, each fileset is an extended glob pattern (cf. Ant patterns); directory recursion is implied. A fileset beginning with a minus sign defines a negative or exclusion set (the first fileset listed in a manifest must be a positive or inclusion set). Thus, for example, to specify all files under a directory /root, but omitting log files contained anywhere therein:

#%fileset /root
#%fileset -/root/**/*.log

In addition to a manifest file format, it will be convenient to have a file format for recording the results of a fixity check. We propose to use the manifest file format as above, but with a column prepended indicating a file's status. If using single characters to denote status (M for modified, R for removed, A for added), the log file resulting from a fixity check against the manifest above might be:

# status | filename | algorithm | digest
M | /path/to/file1.c | md5 | 3b94d82b61cfe178a620cc5451fa3f58
R | /path/to/file2.h | md5 | 408ad21d50cef31da4df6d9ed81b01a7
A | /path/to/file3.o | md5 | fd2478a7b1fd0cc92610d184be1776a8

Files still present and unchanged need not be listed, hence an empty log file would represent an entirely successful fixity check.

Command line functions

We propose a new command line utility, fixity. In the following, manifests and other files may be referenced by pathnames (relative and absolute), HTTP URLs, SSH-style paths ([user@]host:path), SSH URLs (ssh://[user@]host/path), and possibly other protocols and storage API naming methods. The options listed below are suggestive, not exhaustive.

fixity create manifest filesets...

Create and populate a manifest with files in the named filesets. Recursive descent into directories is implied.

--relative
Record pathnames relative to the including fileset.
--algorithm md5|sha1|etc
The checksum algorithm to use.
--wait n
Wait n seconds between computing checksums.
fixity verify manifest

Check the integrity of the files listed in the manifest and output a log file. The log file can be fed back to fixity; see below.

--root path
Add a root directory to relative pathnames.
--replacepath from to
Perform regular expression-based path prefix replacement.
--email address
Email the log file.
fixity update manifest log

Update a manifest given a log file.

--ignore A|M|R
Ignore additions/modifications/removals.
fixity backup manifest destination
Copy a manifest to an external location, e.g., to a remote system or a web email account or web storage service. This verb is obviously an outlier compared to the others, since it is not actually processing a manifest in any way, just copying it. But the integrity of manifests themselves is critical to the overall integrity of the fixity process, and replication of critical information on distributed, heterogeneous storage systems must be convenient for it to be implemented as a standard practice.

Conclusion

The above functions and options provide the basis for a simple integrity checking system.