Skip to content

Let's back that sh*t up

Now that I find myself with gigs of online storage space to be used for backups, it gives me a perfect excuse to organize my files in a sensible manner and schedule them to be backed up regularly.  Now, "sensible manner" in my case ended up involving many directories and a host of symlinks (which appear as shortcuts under Windows...  nice touch by Cygwin), and so the backup script had to apply some filtering rules.  It turned out to be rather more involved than one would think is necessary.  rsync include / exclude mechanism is rather tricky, and many people recommend using find for filtering and piping the results to rsync.   It is dubious whether that is a more straightforward solution: for instance, in order to select a set of files with extensions .pl, .h, and .cpp, I ended up with the following gem:

eval ls -A1 $srcpath*.{$exts} 2>/dev/null | \
xargs -I{} echo {} | eval sed 's/^.\\{${#srcpath}\\}//g' | \
rsync -vzaEHxuh --progress --stats --append --delete \
--files-from=- $srcpath $hostname:~/$srcdir

Whoa.  What's going on here?..  Many subtle things.  Line 4 creates a list of files with the given extensions.  eval is necessary to combine brace expansion with variable expansion, while the output redirection kills stderr messages, e.g. in case there are no files with .h extension.  However, rync has problems unless the files are given relative to the source path.  Line 5, then, strips out the first ${#srcpath} characters (this is the bash syntax for length of string in $srcpath).  If we wanted to remove 5 chars from the start of $foo, we would do
echo $foo | sed 's/^.\{5\}//g'
-- pretty standard usage of sed.  Here, we need once again to use eval in order to get to the ${#srcpath} variable, and since we are using eval the braces in sed have to be escaped twice.
The xargs construct simply takes the list of files and feeds it to the sed construct line by line (-I{} instructs xargs to place the line where {} are in the following code).

Now, of course after feeling great about myself for being at ease with all aspects of this eval sed construction, I realized (and by that I mean, Andrei realized) that  echo $foo | eval sed 's/^.\\{${#srcpath}\\}//g' is completely equivalent to basename $foo. I like this sort of brevity, although it's a little disappointing that as of late my attempts to showcase knowledge of sed and awk have been torpredoed by these sorts of built-in bash commands.

Here's another one: I want to back up all the hidden (dot-) files and directories, as well as their contents.

ls $srcdir -A1 | grep -e "^\." 2>/dev/null | \
rsync -vzraEHxuh --progress --stats --append --delete \
--files-from=- $srcdir $hostname:~/
# note that it is necessary to explicitly use the r switch!  The file
# list includes the directories, but not the contents of their directories.
# Even though -a switch in rsync implies -r, it won't go into the directories
# if -r isn't there explicitly.

Luckily, -r switch on rsync forced it to look into the directories given in the list and process them recursively.  If that didn't work, I would have had to use
find ~ -path ~/.\*
-- (notice the escaped *), and I'd have to strip the ~ from the start of full path strings using the complicated construct above.

The other good side effect of this project (in addition to having my data safely backed up) is that I understand the find command much better now.  In hinsight, it is quite simple, actually.  find starts in a given directory and recursively constructs a file tree.  It then goes through this file tree, applying a set of expressions to each element.  If the expressions evaluate to true, it performs certain actions (default is to print the pathname).  Expressions can involve options, tests (-name, -path, -regex, -type, -user, etc.), and actions (-delete, -exec, -prune, ...).    So what does a statement like
find $mydir  \! -name $mydirname -prune  -name ".*"
actually mean?

Well, the parenthesized expression means: if the name of the file doesn't contain $mydirname, (\! -name $mydirname) then prune it (don't include it).  (Notice that negation has high precedence -- higher than the implicit AND between -name and -prune.)   For instance,  find ~  \! -name Leo -prune  will see /home/Leo, see that the (base)name
is Leo, and will let it pass, but everything else will die -- since /home/Leo will be the first "file" listed, it won't go into any of the subdirectories.  Using this trick, we can preclude find from recursing into subdirectories.

Another example (taken from the manual):
find . -name .snapshot -prune -o  \! -name *~ -print0
This is a common construct used to ignore certain files and directories while performing some other action.  OR is needed because for files that don't match our pruning criterion, the -name .snapshot -prune will be false.  If we had an implicit AND, the expressions that follow would also be short-circuited to false.

Another very common usage (one of my earliest memories of find, actually):
find . -name "*.nb" -exec grep -H PATTERN {} +
This simply runs a command (grep in this case) putting the matching file(s) in {}.  (It used to be that instead of + the expression was terminated by an escaped semicolon \; -- but + is faster).   Apparently, this runs quite fast and in efficiency rivals piping the result of find to xargs:
find . -name "*.nb" -print0 | xargs -0 grep PATTERN
(note the -print0 / xargs -0 pair -- this guards against non-standard filenames).

Other things that I learned this fine holiday weekend include being wary of wc character count (it counts newlines as characters!  (cf.   echo "123" | wc -m vs printf "123" | wc -m or echo -n "123" | wc -m), got reminded of the basename command, refreshed basic awk syntax...  Not bad, not bad at all.

Tagged with .

Some HTML is OK


(required, but never shared)

or, reply to this post via trackback.