Version 1.3 (April 13, 2008)
  - changed xbup_helper script to run mkdir to create subdirectories
    as necessary...this allows one to run "xbup --local" in a 
    newly created directory

Version 1.2 (Jan. 17, 2008)
  - changed behavior of splitf_xattr, so that now
    every file gets an entry in the cattr catalog, even if there
    is no metadata -- this ensures that restorating using
    joinf_xattr works more like join_xattr.

Version 1.1 (November 1, 2007)

Copyright 2007 Victor Shoup.
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

*******************************************************************

Mac OSX extended attribute (xattr) tools
****************************************
****************************************

These are some simple tools for backing up a directory on an HFS+ filesystem to
a non-HFS+ filesystem on a remote Unix machine using rsync.


The problem: 
************

What I really want is a way to backup the critical files on my laptop to a file
server at my university (and from there, it will get backed up even further, so
I can get at old versions of files, if necessary).

The OSX file system HFS+ associates various non-standard metadata with files,
including resource forks and special information for the Finder.  If you use
rsync to backup your data, you will simply lose this metadata.  Some
applications actually store important information in resource forks, so this is
bad.  While Apple has patched rsync to deal with metadata (with the -E option),
their implementation is notoriously buggy, and moreover, does not allow backup
to non-OSX machines.  


Possible Solutions
******************

Use UFS
-------

One solution is to keep the files you want to back up on a UFS-formatted
partition.  This way, all the metadata is stored in explicit Apple Double
files:  the metadata (if any) of a file called "foo" is stored in a file called
"._foo" (in the same directory as "foo").

Now one can use rsync to back up directly to a standard, non-OSX Unix machine.  

Also, on Tiger (OSX 10.4), commands such as cp and mv have been patched to keep
these Apple Double files consistent, so using UFS is not so bad. 

Also note that when using rsync to copy from the non-OSX machine to OSX, you
may have to run rsync twice to get all the Apple Double files:  it seems that if
rsync first creates "._foo" locally, and then creates "foo", the file "._foo"
will be deleted (this does make sense at some level).  A second rsync will just
create "._foo", and all is well.

The bad news: UFS is deprecated -- it is not being actively maintained on OSX,
and may soon disappear altogether.   Moreover, UFS does not provide
"journaling", which means that in the event of a system crash, it is more
susceptable to corruption that a journaled HFS+ filesystem.



Use various patched versions of rsync
-------------------------------------

Besides Apple's patched version of rsync, there are a few different patched
versions of rsync floating around the web that should do the job, more or less.
However, it is not at all clear if these are well maintained, and they all have
various limitations.  Also, as rsync evolves, these patches will invariably
fall behind.

Here are some relevant patched rsyncs I've found on the web:
   http://www.quesera.com/reynhout/misc/rsync+hfsmode/
   http://www.onthenet.com.au/~q/rsync/
   http://lartmaker.nl/rsync/




Use rsync 3.0
-------------

A new version of rsync is being developed that directly supports xattrs.
Hopefully, this will work better than Apple's patched version, but still, it
may not provide a good solution to the problem at hand:

  * It will only work at all if the remote Unix file system itself
    supports xattrs

  * Even if that is the case, there may be limitations that prevent it
    from working, e.g., size restrictions (resource forks can be quite
    large -- several megabytes -- and some Unix filesystems only support
    small xattr values)

  * Even if the first two points above are satisfied, it may be slow:
    as I understand it, rsync 3.0 will send *all* of the xattr data 
    *all* of the time, as part of its "file list"

  * There is some other metadata, besides xattrs, that will
    not be preserved  (although arguably, this is metadata is less important)


Use rdiff-backup
----------------

rdiff-backup is a free program that apparently does many things, and maybe even
does what I want here.  However, it is not clear if this program is really
being actively maintained.  


Use Commercial backup tools
---------------------------

One reasonable commercial backup tool is Chronosync
http://www.econtechnologies.com).  While it can be used to backup to a remote
filesystem, the latter must be "mounted" in some way, e.g., via NFS.  Even if
this works, it would not likely be nearly as fast as rsync.


Use Time Machine
----------------

Time MAchine is Apple's new backup software to be bundled with Leopard (OSX v
10.5).  Of course, one problem is that it is not avaiable yet.  Another likely
problem is that it probably won't do what I really want --- it seems likely it
would require the remote filesystem to be mounted, as discused above.


Use the tools provided here
---------------------------

The tools provided here work in conjunction with any standard rsync.   There
are several commands provided, but the most important ones are split_xattr and
join_xattr.

Suppose I want to backup the directory /Users/shoup/mystuff to the remote
directory access.cims.nyu.edu:/home/shoup/mystuff.  Working in my home
directory (/Users/shoup), I execute:

  split_xattr mystuff mystuff-xattr

This creates a directory mystuff-xattr, which has the same directory structure
as mystuff, but contains special "xattr container" files that store all the
non-standard metadata.  So for example, if I have a file "mystuff/path/to/foo"
with funny metadata, then there will be an xattr container
"mystuff-xattr/path/to/@_foo".  Note that "mystuff/path/to/foo" may itself be a
directory -- directories may have funny metadata too -- in which case, the
corresponding xattr container is "mystuff-xattr/path/to/foo/@_.".

Now I run rsync twice:

  rsync --rsh=ssh -avz --delete mystuff/ access.cims.nyu.edu:/home/shoup/mystuff
  rsync --rsh=ssh -avz -c --delete mystuff-xattr/ access.cims.nty.edu:/home/shoup/mystuff-xattr

Note the "-c" option in the second rsync.  This forces a "checksum" to
determine which files are to be transferred.  This is the safest way to do it;
however, it is still pretty safe to leave this off (see detailed discussion
below).  While the checksum takes time, it will typically be significantly
faster than actually transmitting all the data over the network.

The next time I want to back up my files, I first remove the directory
mystuff-xattr, and run the three commands above.  So a complete backup script
is:

  rm -rf mystuff-xattr
  split_xattr mystuff mystuff-xattr
  rsync --rsh=ssh -avz --delete mystuff/ access.cims.nty.edu:/home/shoup/mystuff
  rsync --rsh=ssh -avz -c --delete mystuff-xattr/ access.cims.nty.edu:/home/shoup/mystuff-xattr

To restore mystuff, I use rsync to restore the directories mystuff and
mystuff-xattr, and then run the command:

  join_xattr mystuff mystuff-xattr

This will set the funny metadata for all files in mystuff, using the
information contained in mystuff-xattr.

Note that by naming xattr container for a subdirectory "foo/@_." rather than
"@_foo", one can use rsync to sync just this subdirectory (in both the data and
xattr tree), and preserve the metadata of the directory itself.

There is also a Perl script xbup provided (see below) which provides even more
functionality -- but feel free to "roll your own"



Macintosh Metadata Madness
**************************

A file consists of its data, together with various types of metadata.

Standard Unix metadata
----------------------

   owner/group
   permissions
   
   atime -- access time (time of last read)
   mtime -- modification time (time of last write)
   ctime -- change time (time of last change of write or change of metadata)

BSD flags
---------

   immutable/append-only flags -- the immutable flag corresponds to the Locked 
                                  status in an Info panel

                                  these are the "user" versions of these
                                  flags...there are also "system" versions
                                  of these flags, as well as a few other
                                  flags, but these are not treated here.


Extended Attributes (xattrs)
-----------------------------

   These are arbitrary name/value pairs.
   Two special and important xattrs are
      com.apple.FinderInfo -- encodes various Finder flags
      com.apple.ResourceFork -- the so-called "resource fork" of a file,
                                which may contain lots of application-specific
                                data, like custom icons, and other stuff

   In fact, "under the hood", these are not really xattrs in the
   underlying filesystem, but are made to appear as such via the
   getxattr/setxattr function interface.

Other timestamps
----------------

   crtime -- creation time (time the file was created)

Access control lists (ACLs)
---------------------------

   These provide more refined access control.
   On OSX, this feature is turned off by default.
   In a single user environment, there is little need for these.


Discussion
----------

By default, rsync will preserve permissions and mtime, and if run as root, it
will also preserve owner/group.  It will not preserve BSD flags, xattrs,
crtime, or ACLs.  Except for ACLs, this is the information stored in the "xattr
container" files created by split_xattr.  Actually, by default, this command
*does not* store the crtime: to get this, you have to invoke it with the option
--crtime, e.g.,

   split_xattr --crtime myfiles myfiles-xattr

The reason that crtime is not stored by default is that few files have BSD
flags or xattrs, while *all* files have a crtime.  Thus, storing crtime
generates a lot more xattr container files, and so is more costly, both in
terms of space and time.  Most people (like me) don't care much about crtime,
and losing it is normally not a problem.  However, some folks care passionately
about it, so it is there as an option.

Note that in discussion boards on the Net, there is much confusion about ctime
and crtime -- they have *nothing* to do with each other, other than the fact
that their names both start with the letter "c".



Installation:
*************

the tarball is called xattr.tgz

First, if necessary, unpack the tarball:

   tar -xzvf xattr.tgz

This puts all files in a subdirectory called "xattr".
Then do:

   cd xattr
   make
   make install  

Note that make install will copy executables into "~/bin" by default.  Edit the
makefile to change this behavior, or copy the executables by hand.  Make sure
your $PATH environment variable includes this location.



The Commands
************


split_xattr
-----------

split_xattr options filedir xattrdir

options:  --crtime
          --files-from listfile
          --recycle old-xattrdir

filedir should be a directory, xattrdir should not exist, but its parent should
(so that "mkdir xattrdir" works).  split_xattr creates a directory structure at
xattrdir that mirrors that of filedir.  For every file "foo" in filedir with
xattrs or BSD flags, a corresponding xattr container file "@_foo" is placed in
xattrdir -- if "foo" is a directory, the xattr container file is "foo/@_.".

Options:

  -- crtime 
        store creation time.  This creates an xattr container file
        for every file/directory in filedir.

  --files-from listfile
        listfile contains a list of file/directory names,
        which are relative to filedir.  Only those files that
        appear in this list (or whose ancestor directories appear)
        will be processed.  Names in listfile should be one per line,
        with no extra blanks (although empty lines are ignored)
        and no leading or trailing slashes.

        For example, to backup just my iPhoto and iTunes stuff
        in my home directory, I could place the following
        in listfile:
           Pictures
           Music
        or possibly:
           Pictures/iPhoto Library
           Music/iTunes
        The same listfile can be used in conjunction with the
        files-from option in rsync to back up the data files.

    --recycle old-xattrdir
        An experimental optimization to speed things up if you
        have *lots* of metadata.  split_xattr creates an xattr container
        for a file with an mtime equal to the ctime of the data file.
        Theoretically, if the metadata of a file changes, then its ctime
        should change.  Unfortunately, this is not quite true:
        removing com.apple.FinderInfo via removexattr does not update ctime,
        which is probably a bug.  Ignoring this limitation,
        the --recycle option can be used as follows.  Suppose you 
        run split_xattr once, and then move xattrdir to old-xattrdir.
        If you later run
           split_xattr --recycle old-xattrdir filedir xattrdir
        then for each file, if an xattr container needs to be generated,
        and if a corresponding xattr container file exists in old-xattrdir
        whose mtime matches the ctime of the data file, then instead
        of generating the xattr container file in xattrdir,
        the file in old-xattrdir is *moved* to xattrdir.

        In backing up the xattr containers via rsync, one could also
        leave of the -c (checksum) option -- the above mtime/ctime
        correspondence should work well enough, modulo the above-mentioned
        limitation/bug.  One could also spoof split_xattr by renaming 
        files with the same ctime (the mv command does not change the ctime).
        Of course, the same kind of spoofing can happen with rsync
        (if mtimes match).

        Note: the speed-up is not as dramatic as I would like, and given
        the possibility that some changes are not properly tracked,
        it is not clear if the use of this flag should be recommended.


join_xattr
----------

join_xattr [ --files-from listfile ] filedir xattrdir


The "opposite" of split_xattr:  xattr container files in xattrdir are used to
set the xattrs of files in filedir.  Note that if a file in filedir does not
have a corresponding xattr container in xattrdir, that file will be stripped of
xattrs.  The --files-from option works in exactly the same way as in
split_xattr.  If you ran split_xattr with a --files-from option, you probably
also want to run join_xattr with the same --files-from option, as otherwise,
you will lose xattrs.


split1_xattr
------------

split1_xattr [ --crtime ] file 

Writes xattr container for file to stdout



join1_xattr
-----------

join1_xattr file 

Reads xattr container for file from stdin


Example: transfer xattrs (including crtime) from file foo to file bar:

   split1_xattr --crtime foo | join1_xattr bar



splitf_xattr
------------

splitf_xattr options filedir 

options:  --crtime
          --files-from listfile


Works much like split_xattr, but writes an "xattr catalog", consisting of a
list of file name/xattr container pairs, to standard output (the file names are
relative to filedir).  Such a pair is generated for each file, even if there is
no metadata.  If there is a lot of metadata, this can be significantly faster
than split_xattr.  This is especially useful for backups to hard drives, where
there is little advantage of having the xattr containers represented as
individual files. I find it significantly faster to write one big file to a
hard drive than to create a whole directory structure.

If errors are detected while processing, a file may be skipped (and processing
continues), or the processing may be aborted.  Of course, an error message is
issued, and in either case, the output is guaranteed to be either a valid
catalog, or a prefix of a valid catalog.



joinf_xattr
-----------

joinf_xattr filedir

Works much like join_xattr, be reads a "xattr catalog", as created by
splitf_xattr, from standard input.  Unlike join_xattr, if a file in filedir
does not have a corresponding xattr container, it is left alone, rather than
being stripped of its metadata.  It is really the xattr catalog read from
standard input that drives the action, rather than the directory structure of
filedir.  Also, if a given xattr container has no corresponding file in
filedir, then it is simply skipped over.  This is useful for restoring just a
subdirectory.

In the current implementation, processing stops when any "unrecoverable" error
occurs (e.g., the input file is corrupt, or memory is exhausted), but continues
if an error is deemed "recoverable" (e.g., a file in filedir has funny
ownership).  This ensures that as much metadata is restored as is reasonably
possible.

An example:

The following will copy a directory dir to dir-copy, transferring all
metadata, including creation time:

   rsync -a dir/ dir-copy
   splitf_xattr --crtime dir | joinf_xattr dir-copy
   


strip_locks
-----------

strip_locks [ --files-from listfile] filedir

This will strip any BSD lock flags from all files and directories in filedir.
The --files-from listfile will restrict the action to all files in listfile
(including all descendants *and* ancestors).

This command may be useful if you want to restore files *in place* using rsync:
rsync does not know about locks, and the restoration will likely fail if you
there are any locks.  After running strip_locks and rsync, join_xattr will
restore the locks.

In the current implementation, running the command

   strip_locks --files-from listfile filedir

will strip locks from all files in listfile, including both ancestors and
descendents.  However, running join_xattr with the same --files-from option
will not restore locks to ancestors.  For example, if listfile contains
foo1/foo2/foo3, then foo1, foo1/foo2, and foo1/foo2/foo3 all get their locks
stripped, while only foo1/foo2/foo3 will get its locks restored by join_xattr.
One reason for this is that in restoring a file, rsync needs to have the parent
directory of a file unlocked, in addition to the file itself.  In any case,
when restoring files, it is probably best to simply restore them into an empty
directory.



xat
---

This is a general utility program for inspecting/modifying xattrs.

usage: xat option file
   where option is one of the following:

--list               list xattr names and their lengths
--get <name>         write value of xattr <name> to stdout
--print <name>       same as above, but human readable
--del <name>         delete xattr <name>
--set <name>         set value of xattr <name> to value read from stdin
--set <name>=<value> set value of xattr <name> to <value>
--has <name>         test if xattr <name> exists (useful in find scripts)
--has-any            test if any xattrs exist (useful in find scripts)


xbup
----

A Perl script for backup/restore that uses the commands split/join_xattr,
strip_locks, along with rsync.  xbup provides a number of conveniences:
  * quick backup of a subdirectory, or just selected files
  * automatic archival of changed/deleted files (which themselves
    get automatically deleted after a specified amount of time).

usage: xbup options
 
options: --local              make effective backup directory the
                              current working directory, rather than the
                              root of the backup tree

         --files              backup only those files and directories
                              listed in .bupfiles (located in the
                              effective backup directory)
  
         --files-from file    like --files, but use specified file
                              instead of .bupfiles
                
         --config file        read config from file, instead of ~/.xbupconfig

         --checksum           always checksum data files
                              by default, no transfer occurs if
                              modtime agrees.  Note that xattr containers
                              are always checksummed.

         --dry-run            just a dry run
                              tip: use --checksum --dry-rum
                              to compare source and destination

         --restore            restores files, instead of backing them up


Before using this command, you will also have to: 
  1) copy the script xbup_helper to the remote machine
  2) set various options in a file ~/.xbupconfig
     (see sample-.xbupconfig)





Quirks and Implementation Notes
*******************************

Accesing metadata
-----------------

The functions listxattr, getxattr, setxattr, and removexattr are used to access
xattrs.  The functions getattrlist and setattrlist are used to access the
crtime.  The function setattrlist is also used to set BSD flags,
access mode bits, and mtimes (see discussion below on symlinks).


Resource Forks
--------------
        
If one reads the resource fork of a file, the atime of the file is updated.
Worse, if one writes the resource fork, the mtime is updated.  The
implementation works around this by restoring the mtime of a file after
updating any xattrs.

Another quirk of resource forks is that setting com.apple.ResourceFork via
setxattr does not replace the resource fork, but rather overwrites it.  This
means if you set the resource fork to "abcd", and then set it again to "123",
its value actually is "123d".   The solution is to first remove the resource
fork using removexattr -- this is how the current implementation works.

Resource forks can be fairly big: apparently, up to 16MB.  While get/setxattr
provide a special interface for reading/writing resource forks in small chunks,
the current implementation does not use this:  one big buffer is allocated via
malloc.  This should be OK, as otherwise the memory usage of these programs is
fairly small.  Typically, resource forks are around 30-50KB (for custom icons).

Other xattrs are small -- com.apple.FinderInfo is 32 bytes, and there seems to
be a 4KB limit on all other xattrs (but that could change).


Write permissions
-----------------

To set *any* xattr of a file, one must have write permission on the file.  In
the implementation, join_xattr will make the file writeable before setting
xattrs, and then it will restore permissions afterward.  


Ownership
---------

If some files are not owned by the user, join_xattr may have to be executed
with root prividges.


Directories
-----------

Directories can have xattrs, but apparently not a resource fork.  Directories
can also have BSD flags and a crtime.  These are all preserved.


Symbolic Links
--------------

Symbolic links can have xattrs, but apparently not a resource fork.  In fact,
all symlinks get created with a com.apple.FinderInfo xattr (with a special
"type" and "creator" values), which is actually kind of annoying.

The current implementation will preserve xattrs for symbolic links.  

Symbolic links also apparently have a crtime.  With the --crtime flag,
split_xattr will store the crtime of the symlink, and while join_xattr attempts
to restore it (using setattrlist), it quietly fails (this seems to be a
bug/limitation in setattrlist).

Another quirk of symlinks is that they can have BSD flags on an HFS+ volume.
This contrary to the BSD documentation, which says they cannot.  In the
implementation, lstat is used to obtain these flags, but they are set with
setattrlist, rather than chflags (there is no "lchflags").  

Similarly, to make a symlink writeable (as discussed above), setattrlist,
rather than chmod, is used, because the latter always follows through symlinks
(there is no "lchmod").  

Similarly, to set the mtime of a symlink, the function setattrlist is called,
rather than utimes, as again, this does not follow symlinks and also
does not set the atime -- unfortunately, just as for crtime, this
does not really work.

get/setattrlist weirdness
-------------------------

The interface to these functions is hopelessly Byzantine, and the documentation
is out of sync with the implementation (the source code is at least available
to look at).  For example, in setting the access modes (with the
ATTR_CMN_ACCESSMASK option), the man page says the value should be a mode_t,
while in fact, it should be a uint32_t (and indeed, the call to setattrlist
will fail if a mode_t is used, as this is only a 16-bit type).  Similarly, the
length field of an attribute buffer and the value used when using the
ATTR_CMN_FLAGS options, should both be uint32_t's, while the man page says
these should be unsigned long's (this is only relevant on 64-bit machines).
Moreover, on 64-bit machines (compiling w/ -m64), one needs
to specify special alignment for the structures passed to
get/setattrlist.



ctime anomolies
---------------

As already mentioned above, changing any metadata of a file (permissions, BSD
flags, or xattrs) should (in theory) update the ctime of the file.  This
appears to be the case in all circumstances, except for one: removing the
com.apple.FinderInfo extended attribute via the removexattr function *does
not*.




32-bit crtime
-------------

crtime is encoded as a 32-bit quantity when stored externally in an xattr
container.  This makes it susceptible to the "Y2038 bug".  Note that rsync also
currently stores mtimes externally as 32-bit integers.


copyfile, anyone?
-----------------

Instead of calling the functions get/setxattr directly, there is a function
called copyfile which can be used to "split and join" so-called Apple Double
files (which serve the same role as my xattr containers, but the format is
different.  I didn't use copyfile because (a) it is undocumented and the header
file isn't even included in the installation, (b) it is notoriously buggy, and
(c) it doesn't do anything you can't do yourself with other (documented)
functions.


Leopard
-------

Hopefully, the code here will work "as is" with Leopard, but who knows: Apple
does not seem to care much about stability and backward compatibility, and
anything could happen. I'll have to test it out when Leopard arrives.  


