Backing up with duplicity and Amazon S3

Duplicity is a great little backup tool that can do incremental
backups to a variety of different servers (ftp, scp, and Amazon S3).

The Install

I am installing on a Centos 5 system which uses the yum tool. I also
have the http://rpmforge.net/repository enabled. I need that for
librsync and librsync-devel which provides rsync capabilities when the
remote file details are not available (like in FTP).
All of these requirements are found on the duplicity main
page.

yum install librsync librsync-devel# GnuPG for python (stable, hasn't changed in years)
wget http://internap.dl.sourceforge.net/sourceforge/py-gnupg/GnuPGInterface-0.3.2.tar.gz
tar -xzf GnuPGInterface-0.3.2.tar.gz
cd GnuPGInterface-0.3.2
python setup.py installcd ..
# boto for python (Python S3 interface -- active development, check version #)
wget http://boto.googlecode.com/files/boto-1.1c.tar.gz
tar -xzvf boto-1.1c.tar.gz
cd boto-1.1c
python setup.py installcd ..
# duplicity (active development, check version #)
tar -xzf duplicity-0.4.10.tar.gz 
cd duplicity-0.4.10
python setup.py install
cd ..

Make sure duplicity works by running the command “duplicity”
You should see something like this (and not errors about GnuPG
instances).

[[email protected] ~]# duplicity 
Command line error: Expected 2 args, got 0
Enter 'duplicity --help' for help screen.

Make your gpg keys

Do this as the user that you’ll be running the backup as – it makes
things easier. If you have your own existing keys and know how to import
them, you can skip this step. Otherwise, we’ll create a key just for
encrypting the backups before sending them to our backup server (in case
of rogue system admins at Amazon).

[[email protected] ~]# gpg --gen-keygpg 
(GnuPG) 1.2.6; Copyright (C) 2004 Free Software Foundation, Inc.This program comes with ABSOLUTELY NO WARRANTY.This is free software, and you are welcome to redistribute itunder certain conditions. See the file COPYING for details.
gpg: failed to create temporary file `/home/ytjohn/.gnupg/.#lk0x9d199d8.serv02.example.com.12823': No such file or directory
gpg: /home/ytjohn/.gnupg: directory created
gpg: new configuration file `/home/ytjohn/.gnupg/gpg.conf' created
gpg: WARNING: options in `/home/ytjohn/.gnupg/gpg.conf' are not yet active during this run
gpg: keyring `/home/ytjohn/.gnupg/secring.gpg' created
gpg: keyring `/home/ytjohn/.gnupg/pubring.gpg' created
Please select what kind of key you want: 
  (1) DSA and ElGamal (default) 
  (2) DSA (sign only) 
  (4) RSA (sign only)
Your selection? 1
DSA keypair will have 1024 bits.
About to generate a new ELG-E keypair.
              minimum keysize is  768 bits
              default keysize is 1024 bits
    highest suggested keysize is 2048 bits
What keysize do you want? (1024) 2048
Requested keysize is 2048 bits
Please specify how long the key should be valid.
         0 = key does not expire
      <n>  = key expires in n days
      <n>w = key expires in n weeks
      <n>m = key expires in n months
      <n>y = key expires in n years
Key is valid for? (0) 0
Key does not expire at all
Is this correct (y/n)? y                        
You need a User-ID to identify your key; the software constructs the user id
from Real Name, Comment and Email Address in this form:    "Heinrich Heine (Der Dichter) <[email protected]>"
Real name: Backup Key
Email address: [email protected]
Comment: Backup key for duplicity  
You selected this USER-ID:           "Backup Key (Backup key for duplicity) <[email protected]>"
Change (N)ame, (C)omment, (E)mail or (O)kay/(Q)uit? O
You need a Passphrase to protect your secret key.    

At this point, let me interrupt and talk about the Passphrase. You can
make this anything, but I would recommend avoiding special characters
(especially dealing with <> ‘ ” ` ) that might be interpreted by
your system shell. I generated a 15 character password online using
only numbers, letters, and LETTERS. You will need to keep track of the
password – you will need it later when you write the backup script.

We need to generate a lot of random bytes. It is a good idea to perform
some other action (type on the keyboard, move the mouse, utilize the
disks) during the prime generation; this gives the random number
generator a better chance to gain enough entropy.
+++++++++++++++++++++++++.+++++++++++++++++++++++++++++++++++..+++++.++
+++++++++++++..++++++++++.+++++++++++++++..++++++++++++++++++++++++++++
++.>+++++..............................................................
.......................................................................
......+++++
gpg: /home/ytjohn/.gnupg/trustdb.gpg: trustdb created
public and secret key created and signed.key marked as ultimately 

     trusted.pub  1024D/53F0891A 2008-04-08 
     Backup Key (Backup key for duplicity) <[email protected].com>
     Key fingerprint = 135A 1533 5C94 3A58 5398  7467 98A0 C424 5BF0 8C2E
     sub  2048g/630FAA4F 2008-04-08

We see our key ID is 53F0891A – make a note of this for the backup
script.

The backup script

Essentially, what we want is a script that you just run and it will
perform the backup for you. For my purposes, I want to backup to
Amazon’s Simple Storage Service (s3). To do this, you will need to
sign up for the service (no cost to signup, just pay for space/bandwidth
used) and get the AWS/AWS secret keys (think of them like
username/passwords).
The following script will backup a directory called /mnt/backups to an
Amazon bucket called j123backup (the bucket name must be unique between
all Amazon S3 users). Please note that while you can use the same script
and gpg keys on multiple servers (or have multiple backup scripts on the
same server backing up different directories), you will want to make a
separate bucket for each different source backup.

#!/bin/bash
# Export some ENV variables so you don't have to type anything
export AWS_ACCESS_KEY_ID=accesskeyexport AWS_SECRET_ACCESS_KEY=secretkey
# GPG passphrase we used earlier
export PASSPHRASE=123456789012345
GPG_KEY=53F0891A
# The source of your backup
SOURCE=/mnt/backup
# The destination
# Note that the bucket need not exist
# but does need to be unique amongst all
# Amazon S3 users. So, choose wisely. 
DEST=s3+http://j123backup
# You can of course change your destination to an ftp or# scp (ssh copy) server:
#DEST=scp://[email protected]/backups
duplicity 
    --encrypt-key=${GPG_KEY} 
    --sign-key=${GPG_KEY} 
    ${SOURCE} ${DEST} 
# this is an example of backing up multiple 
# directories at once and excluding others:
## duplicity 
#     --encrypt-key=${GPG_KEY} 
#     --sign-key=${GPG_KEY} 
#     --include=/home 
#     --include=/var/www/html 
#     --exclude=/var/www/html/cache/* 
#     ${SOURCE} ${DEST} # Reset the ENV variables. Don't need them sitting around
export AWS_ACCESS_KEY_ID=
export AWS_SECRET_ACCESS_KEY=
export PASSPHRASE=

As a final note, you can definitely point your backup at another
destination such as ftp or scp. I later ended up choosing an scp server
over Amazon S3.–
Backing up with duplicity and Amazon S3 By YourTech John on October 26,
2009 7:00 AM

Forgotten Realms

Yesterday I was working on a very old server — a Cobalt RAQ 4.  I
actually leased a RAQ 3 for web hosting back in 2001 and thought it was
the neatest thing, but found it too limiting in many ways.  These
servers were popular because they could be administered using a series
of push buttons and an LCD screen up front.  It also has a web interface
that controlled everything.  These servers were highly popular back in
there day.

The Cobalt product line went away back in 2004 or so, but obviously a
good number of these servers are still in use 5 years later.  As this
server booted up, I saw an email address ending in @cobaltnet.com and
wondered if that was still around.

Registrant:
Sun Microsystems, Inc.
   4150 Network Circle
   Santa Clara, CA 95054
   US



Domain Name: COBALTNET.COM

Administrative Contact, Technical Contact:
      Sun Microsystems, Inc.            [email protected]
      4150 Network Circle
      Santa Clara, CA 95054
      US
      1-650-960-1300 fax: 650 336 6623




Record expires on 15-Jun-2010.
   Record created on 30-Jul-2004.
   Database last updated on 30-Sep-2009 00:06:34 EDT.

Domain servers in listed order:

NS1.COBALT.COM
   NS2.COBALT.COM

So the cobaltnet.com domain is still in existance, at least until 2004. 
However, it doesn’t resolve to anything.  The reason for that is that
the Cobalt.com domain name (notice ns1 and ns2.cobalt.com) itself has
been purchased by another company, completely unrelated to the Cobalt
server product.  So this is interesting in that Sun not only let
cobalt.com go, but they never bothered to update the cobaltnet.com
domain to point to an active name server. Sun paid about $2 billion for
the Cobalt name and now it sits in a neglected corner of the Internet,
just a few months away from finally expiring.

This is one of many such example found on the net of things that vanish
without a trace. Cobalt was one of the first companies to really produce
a polished interface for managing a web server.  Part of me wonders what
would have happened if Cobalt/Sun had released that code as open source
before the end came.  Would the community have picked it up and
developed something amazing with it, or would it have vanished like the
parent company.  Based on recent happenings with BeOSand Haiku,
I suspect the former would have occurred.

Back into programming, monitoring notes

I have been out of the programming circuit for a few years and have been
looking at getting back into it.  My traditional programming style is an
ssh window into my server and all my editing takes place on a
development server, in vi.

Recently, I’ve been trying to decent work out a way to determine how the
world sees your connectivity from within your network.  Essentially, I
wanted to simulate accessing one of my locally connected machines from
the Internet.  Typically, you have to subscribe to a third party service
to perform this service for you.  Coincidentally, I have been reading up
on Google App Engine and saw the potential in using GAE for my
purpose.  I could envision writing a monitoring system that runs
entirely on the GAE.  Unfortunately, I had no programming experience in
Python or Java.  I did see that PHP has been ported to Java and
someone got PHP running on GAE.  The possibility of either creating
a new monitoring system in PHP (or modifying an existing PHP-based
monitoring system) entered my mind.

I decided that rather than stay with PHP, I would use this project as a
method to learn Python.  I started digging into Python resources and
contemplating how I would want my monitoring system to work.  Ultimately
though, I decided I didn’t want to create a brand-new monitoring system
(even a basic one) when existing ones such as Nagios and Zabbix do
perfectly well. In my research, I found a project called
mirrorrrthat used GAE as a web-proxy. 

This solution was immediately obvious.  My existing monitoring system
(Zabbix) has support for fetching web pages.  I could place a file with
the word “OK” on my local web server and then fetch it through GAE.   I
could even determine through the returned page whether my server was
down or if GAE was down.

I set to work testing out the mirrorrr code under my own account.  The
major issue I observed is that mirrorrr is configured to cache pages,
meaning that when I changed my OK to FAIL, mirrorrr never updated the
page.  In the closed-source world, that is the end of the story. 
However, since this is an open-source project, this was an opportunity.

I’ve been wanting to get back into programming, and I when I start back,
I want to be familiar with using an IDE (namely Eclipse).  In
preparation for creating a monitoring system, I had setup a development
station in (gasp) Windows Server 2003.  I connect to this via remote
desktop and generally leave Eclipse running 24×7. I had also gone
through the steps of installing the PyDev plugin, the GAE plugins, and
SVN plugins.

The Process

I downloaded a copy of the code using SVN checkout and set to work
editing the mirror.py file to disable CACHE.  I call this the “dive in
and learn to swim later” process.  Here, I could make changes to
existing code and test them out immediately.  In fact, the SDK for GAE
works as a sort of mini-server.  Once I run the code within the SDK, any
change I make to the source affects the running instance. 

I was able to read through and alter the code to allow me to switch
between caching and non-caching.  I ran into an issue with their “recent
urls” feature.  This shows the last 5 urls you have visited.  When you
are not caching your data, this never gets sets and starts throwing
errors.  I realized I would have to improve that section of the code
before I could truly implement a configurable “enable cache” option.

At this point, I backed out of my file and considered my options.  I
wanted to make two distinct changes to the source, one of which requires
the other.  The author of the project hasn’t maid a change to his(her?)
code since December of 2008.  However, I did see recent entries in the
Wiki, indicating this wasn’t an abandoned project.  I realized that to
truly make my changes worthwhile, I should try and get them included
back in the upstream.  To do so, I should submit each change separately.
That required more tracking than I had been doing.

So, I took care of another todo list item.  I went ahead and setup an
“official” repository server for YourTech, checked out a fresh copy of
mirrorrr, and then imported it into mine.  Now I could work.  I imported
the project from my repo server into Eclipse and started recreating my
work.  First I added added a feature to disable the recent urls.  At the
same time, I made an improvement by moving a chunk of code into a
self-contained method (which is apparently what python calls functions,
as near I can tell at this juncture).  Once this was committed, I went
ahead and proceeded with recreating my work on disabling the cache.

Once I was done, you could enable or disable each feature separately. 
However, if you disabled the cache but left recent urls enabled, your
recent urls would never update.  On the flip side, if recent urls were
recorded before disabling the cache, then you could see the most recent
urls before caching was disabled.  Some person may want that feature —
they could start up the mirrorrr, visit several links, then disable the
cache, preserving those links on the main page forever.

At this point, the only remaining task was to submit my changes to the
project maintainer.  I opened issues 6 & 7 and now I await
response.

Conclusion

The more I use Eclipse, the more I like it.  The ability to perform
every step of the process in one program is extremely useful.  Steps
like comparing history or checking out a specific version is much easier
to grasp than when you are tooling around the command line.  In the
shell, I would typically have most of my files open as a background
task.  Reverting a file from subversion required switching back to the
file, closing it, then reverting.  In Eclipse, it’s a right-click
operation, regardless of whether the file is open or not.

Python’s syntax is a bit weird coming from a Perl and PHP background,
but it’s learn-able.   As I plan to make several more improvements to
mirrorrr, I hope to become proficient in this language as well. 
However, I may be picking up a Perl project in the near future using the
Mojo toolkit, so everything is up in the air.