Distributing Files

This page compares and contrasts the different methods system administrators may use to distribute files. It does not cover remote file systems or P2P file sharing protocols.

First read this paper on rdist and alternatives as of 1992.

Connectivity

File distribution relies on some kind of network substrate. These substrates have different trade-offs.

rcmd

The rcmd(3) function call is used by rsh. Programs written in C (such as rdist) may use this function to run commands on the remote host. To send multiple files using rcmd(3) you need to make repeated calls or some kind of serializer, because rcmd(3) provides only a single TCP stream to the remote host.

Requires some synchronization of user info.

The rcmd(3) function was mostly useful because it was a form of connectivity usually configured very early in a machine's "life". This is less true today because of security concerns and ssh.

TODO: talk about security issues here

rsh

Some programs (such as scripted ones) may invoke rsh instead of using rcmd(3) directly. Using rsh has the same tradeoffs as rcmd(3).

ssh

The ssh programs provide essentially the same service as rsh, but are much more secure. For modern (or modernized) sites, it is usually the case that ssh access is present. This makes it a very useful transport mechanism for system administration.

TCP

There are a great many file distribution programs that require a dedicated TCP port on one endpoint of a communication. Because of the difficulty involved in configuring the TCP port and daemon on all the destination machines, TCP-based file distribution systems tend to be "pull" systems, rather than "push" systems.

FTP

The File Transfer Protocol (FTP) is effecient at transferring files, but its communication properties require one TCP connection per file and is difficult for firewalls to treat properly. FTP is waning in popularity, losing a lot of its market share to HTTP.

HTTP

The HyperText Transfer Protocol (HTTP) requires a web server running at one end of the connection, limiting its utility as a transfer substrate for sysadmin work. However, it is common for people to use web servers to share data with the public, so it is still quite useful for downloading that kind of data.

Transfer Styles

See this excellent paper on push vs. pull transfers. A third option is peer-to-peer.

Serializers

Some tools are designed to make one stream (file) out of many. This is useful in file distribution to avoid the cost of creating and destroying a TCP connection for every file.

tar

This is the most popular serializer. File distribution between machines in different administrative domains is almost always in the form of tar files. You may wish to see the GNU tar home page. This program always changes the inode change time as it is read, but GNU tar has an option to reset the access time after reading. This program stores UNIX permissions, the group and owner name, and modification time. The tar program tries to extract the metadata, but a few factors get in the way. Permissions are affected by your umask, unless you use an argument to preserve them. Non-root users cannot set the owner and group. If the owner and group names do not exist, neither can root (they will end up owned by the effective UID/GID). Finally, I would be surprised if tar was capable of storing things other than files and symlinks (e.g. special files, like device files in /dev). I believe that tar properly handles hard links, so I occasionally use a tar pipeline (one in create mode and one in extract mode) in preference to cp to copy hierarchies of files.

cpio

This serializer used to be popular, but it seems to have fallen out of favor for a variety of reasons. Its "copy-pass mode" is probably the most useful feature, and is often used to make a copy of a hierarchy of files. The GNU version of cpio supports a variety of formats, including tar files. You may wish to visit the GNU cpio home page for more information. It would appear that cpio stores checksums as well as other metadata. I need to research this more.

dump

Modern BSD systems have a program called dump for backing up file systems or portions of file systems. It uses a master-slave architecture so that it can parallelize the acquisition of data from the partition. This improves its speed significantly.

The dump program allows you to specify a single file system, or a list of files and directories (using globbing syntax) on a single file system. This means that dump cannot serialize a set of files that span multiple partitions. Dump cares about things like inode numbers, which are useful primarily when backing up to tape and not so useful for general sysadmin tasks.

You may wish to read the NetBSD dump manpage.

rdist

The rdist program is designed for distributing files. Originally (in BSD Unix) it only supported rsh as a transport, but now it supports ssh. You can use globbing and exceptions when specifying lists of files to distribute, and it has multiple targets within one distfile, similar to make(1).

You probably want to have a synchronized /etc/passwd and /etc/group file if you use rdist, because it attempts to preserve owner and group for files when distributing them around. This doesn't sound like a big deal until you realize that having a file that can't be read by the correct processes is relatively useless. For example, suppose you have a group www on your web server, but not on your other machines. If you rdist a group of files from your desktop box, they will not be owned by group www, and so the web server may not be able to read them. Manually fixing ownership and permissions after each dist defeats the purpose of automatic file distribution, and rdist has limited (and apparently buggy) facilities for fixing up files after a dist (via the "special" directive).

I have noted that rdist seems to unlink hard links, so it appears to remove a file instead of overwriting it. I do not know if it writes to a temp file, then mv's it to the right spot, or whether it rm's it then writes a new copy.

pax

The pax program is the POSIX archive extractor. It supports multiple archive formats and should probably be used instead of tar or cpio.

shar

Running shell archives created by someone you don't trust is not a very bright idea. Shar files are not very common these days, crowded out by tar.

unison

The unison program is designed to maintain consistency between two copies of data.

Systems

rcp

Perhaps the most widely used ad-hoc mechanism for file distribution is rcp. The rcp system uses filename globbing to specify the files to distribute, so it is somewhat difficult to say things like "all these files except for foo and bar".

Rcp uses rcmd(3) and has some problems of its own; for example, if you copy a file that is too large for the destination partition, it will create what is known as a hole in the file. The net result of this is that the file appears to have been copied successfully, yet the last part of the file will be all NUL characters.

rsync

The rsync program is a replacement for rcp with many more features. Its most important differences are that it can copy multiple files, and it does incremental updates. It uses rsh or ssh as a transport mechanism. It also can preserve all of the inode attributes (owner, symlink, permissions, etc.).

bbcp

The bbcp program is a peer-to-peer application that uses ssh to start instances of the program at both ends (source and target). This enables you to do third-party transfers. It uses ssh as a transport mechanism.

Track

Daniel Nachbar, "When Network File Systems Aren't Enough: Automatic Software Distribution Revisited", in USENIX Conference Proceedings, pp. 159-171, USENIX, Atlanta, GA, Summer 1986. Reviewed here.

HTTPsync

See HTTPsync.

Ninstall

Mike Rodriguez, "Software Distribution in a Network Environment", in Large Installation System Administrators Workshop Proceedings, p.20, USENIX Philadelphia, PA, April 9-10, 1987. Reviewed here.

Ru

Tim Sigmon, "Automatic Software Distribution," in Large Installation System Administrators Workshop Proceedings, p. 21, USENIX, Philadelphia, PA, April 9-10, 1987. Reviewed here.

sup

Need a reference here... where is its homepage?

cvsup

See the cvsup home page.

SystemImager

The SystemImager program is based on rsync.

mirror

Works via FTP. Allows put or get methods. Supports exclusion lists.

Old URLs, need new ones!:

ftp://wuarchive.wustl.edu/packages/mirror/
http://www.shsu.edu/tex-archive/archive-tools/mirror/mirror-2.3.tar.gz

wget

The wget program works via HTTP.

Go to this level's index

Travcom auto92089@hushmail.com

Original date: Mon May 26 10:09:56 EDT 2003
Updated: