[H-GEN] Excalibur patches for S3 backup

Sat Jul 4 20:54:27 EDT 2009

Stephen,

The following URL contains the patches to Excalibur so it will do hourly
backups to Amazon S3:

http://www.stuart.id.au/russell/files/pub/humbug-s3-rdiff-image-patch.tar.gz

Could you apply them please.  The patch replaces the image-backup stuff.
The backup bit functions in a similar way to image-backup, with a
similar [1] config file.

Overview
--------

This patch creates two backups of the excalibur VM:

1.  The "secret" backup contains just the files with sensitive
information, such as passwords.  It is GPG encrypted using the keys
present in /etc/rdiff-image/gpg-keys/.

2.  The "base" backup is an image of the entire VM, but with all
passwords [2] changed to "x".  Because the passwords have been replaced
as opposed to being removed entirely, this backup can be used to
re-create the VM so it can be worked on, improved, and the resulting
patches posted here for review and inclusion.  Because it contains no
sensitive information anyone can do this and contribute to its
development.  Indeed, as I have don't have an account on excalibur, so
that is how this patch was developed.

These backups are then made available at a URL on www.humbug.org.au, and
are sent to an Amazon S3 account.  The current configuration [3] holds
on Amazon 24 hourly backups, 7 daily backups, 5 weekly backups and 6
monthly backups.

The backups are optimised so that the communications overhead in sending
them on an hourly basis to S3 (or any other remote location) is
minimised.  This is done by sending the entire backup the first time,
then by only sending a differences between now and that first backup
thereafter.  The program automatically calculates when it is more
efficient to send a new complete backup using the formula [4]:

  Send new backup if:
    current_diff_size > complete_backup_size * 2 / diffs_sent        (1)

If we assume:

- The complete backup size is 1G (pessimistic: current is 800M).
- The difference in backups grows linearly by 100K a week (compressed, 
  _very_ pessimistic).
- We are doing hourly backups.

Then using the above formula it can be shown [5]:

- The Amazon charges per year will be less than ............ USD$9.
- Average www load for transferring backups will be ........ 486K/hour.
- When the diffs are at their max size, the load is approx . 1M/hour.
- The peak www load ........................................ 1G/hour.
  (for 1 hour every 76 days, when sending a complete backup)
- Max transfers per month .................................. 1.03G.
- Max transfers per month allowed by hosting provider ...... 200G.

The backup system writes a web page showing its current status.
Included on the web page is a link to a log of all activities done on
S3.  This includes the total storage used on S3, and the number of bytes
transferred and received for each activity.

At the moment the S3 account being used is one I have created and
am paying for.  Its name is humbug-excalibur-backup.  Credentials
to access the backup will be sent to Stephen in a separate email.
He will have to put them in the rdiff-image configuration file
on excalibur.  They will be backed up to the secret backup, of 
course.

If/when the patch is applied, I will write some doco on how to download
the images and use the VM, and put it on the wiki.

Instructions for applying the patch
-----------------------------------

# cd /tmp
# wget http://www.stuart.id.au/russell/files/pub/humbug-s3-rdiff-image-patch.tar.gz
# cd ..
# tar xzf /tmp/humbug-s3-rdiff-image-patch.tar.gz
# ed /etc/rdiff-image/rdiff-image.conf # Add S3 credentials
# patch -p1 < /tmp/humbug-s3-rdiff-image.patch
# apt-get install python-boto rdiff
# /etc/init.d/cron restart
# /etc/init.d/apache2 restart
# #
# # Purge up the old stuff
# #
# crontab -r
# rm -r /etc/image-backup
# rm -r /srv/http/humbug.org.au/www/machine-image/*
# rm /tmp/humbug-s3-image*
# #
# # Get rid of sundry crap
# #
# rm -f /srv/http/humbug.org.au/tmp

Contents of the Patch
---------------------

/etc/rdiff-image/

  This directory holds the rdiff-image programs and data.  Hopefully
  putting them here means they will one day be in a VCS.

/etc/rdiff-image/rdiff-image.conf

  Configuration file.  The documentation for it is the comments at the
  top.

/etc/rdiff-image/rdiff-image-backup.sh

  Creates the "base" and "secret" backups according to rdiff-image.conf,
  and writes the results to the output file names given on the command
  line.  It is the program that generates the diffs, and decides when
  it would be better to generate a complete backup.

  Full documentation is in comments at the start.

/etc/rdiff-image/rdiff-image-cron.sh

  Run hourly from cron, this program runs rdiff-image-backup.sh to
  create new backups, encrypts the secret backup, makes the whole
  shebang available on the www.humbug.org.au web site, and runs
  rdiff-image-s3.py to transfer the result to S3.

  Full documentation is in comments at the start.

/etc/rdiff-image/rdiff-image-s3.py

  Transfers the backups passed on the command line to S3, and purges
  existing backups on S3 according to the backup cycle information
  passed on the command line.  It understands differential backups,
  and does not delete anything a current differential backup depends
  on.  It can work with any type of backup, not just the style
  generated by rdiff-image-backup.sh.  It requires python >= 2.4 and
  python-boto.

  Full documentation is in comments at the start.

/etc/rdiff-image/gpg-keys/

  This directory is pointed to by the rdiff-image.conf "gpgdir:"
  entry.  The secret backups are encrypted with the gpg keys in here.
  If there are no keys in there no secret backup is generated.  So
  for now, to ensure a secret backup is generated, it has my gpg
  key in there.

  For full documentation see the comments in rdiff-image.conf.

/var/cron.d/rdiff-image.cron

  This runs rdiff-image-cron.sh hourly.

/var/lib/rdiff-image/

  This directory is pointed to by the rdiff-image.conf "work:"
  entry.  For full documentation see the comments in rdiff-image.conf.

/tmp/humbug-s3-rdiff-image.patch

  Contains patches to existing files to make this all work:

  1.  The apache2 config is altered so host names typically used when
      running the VM locally will work.  That may have already been
      done, in which case you will get conflicts.

  2.  The apache2 config is altered to make the status page work.

[1] The rdiff-image configuration file is an extended version of the old
image-backup configuration file.  The keywords are different, but what
it does and how it does it is similar.

[2] As far as I am aware, the only truly sensitive data on excalibur is
passwords.  To see what data is protected, look at the "secret:" lines
in /etc/rdiff-image/rdiff-image.conf in the patch.

[3] To see/change what backups are held on S3, look at the "s3" line
in /etc/rdiff-image/rdiff-image.conf in the patch.

[4] A description of how the communications is optimised is contained in
comments in /etc/rdiff-image/rdiff-image-backup.sh, at about line 400.

[5] How those figures are arrived at.  Since the diff grows linearly we
    can figure out how much the diff grows between backups, ie growth
    rate per hour:

      growth_rate           = 100K / week / (24 hours/day) / (7 days/week)
                            = 530 bytes per backup.

    Other things we know are:

      current_diff_size     = growth_rate * diffs_sent
      complete_backup_size  = 1G

    so using (1) we can compute the value diffs_sent will reach when
    we are forced to send a new complete backup (ie start a new cycle):

      growth_rate * total_diffs            = 1G * 2 / (total_diffs + 1)
      total_diffs^2 + total_diffs - 3360K  = 0
      total_diffs                          = 1833

    To determine the charges from S3, we need to know how much is
    transferred, and how much is stored.  To know how much is stored
    we in turn need to know how many backups of each sort (complete,
    and diffs) we will store.  To recover a diff, you needs it ands
    it matching complete backup, so the worst case will be when all
    backups are diffs of the maximum size plus the maximum possible
    number of accompanying complete backups.  The maximum number of
    complete backups will be at least 1, plus one for each multiple
    of the cycle period we keep.  We know:

      max_age_of_backup      = 6 months      [in the current config]

      cycle_period [months]  = diffs_sent [per hour] / hours_per_month
                             = 1833 / (30 * 24)
                             = 2.5 [months]

      max_complete_backups   = ceil(6/2.5) + 1
                             = 4

    So we can now calculate the maximum storage space required:

      max_diff_size = 100K/week * 6months * (30 days/month) / (7 days/week)
                    = 2.6M

      max_storage   =   max_complete_backups * complete_backup_size 
                      + backups_kept * max_diff_size
                    = 4 * 1G + 39 * 2.6M
                    = 4.1G

    From http://aws.amazon.com/s3/#pricing we know Amazon charges 
    USD$0.15 / Gigabyte / Month for storage, so the maximum storage
    charges will be:

      s3_yearly_storage_charges = max_storage * USD$0.15 / month * (12 months/year)
                                = 4.1 * 0.15 * 12
                                = USD$7.40    [pessimistically].

    Amazon charges for requests, but at USD$0.00001 per REST
    request it is noise.  Amazon also charges for USD$0.10/G for 
    transfers in.  We won't be transferring out.

      cycles_per_year             =   (24 hours/day) * (360 days/year)
                                    / ((totals_diffs+1) hours/cycle)
                                  = 4.8

    Using the result the sum of the series: sum(N...M) = 
    (M-N+1)*(M+N)/2, we can calculate the total number of 
    bytes of the diff's sent as:

      total_size_of_diffs = (1833-1+1)*(1833+1)/2 * growth_rate = 1G.

      byte_per_cycle      = complete_backup_size + total_size_of_diffs
                          = 1G + 1G
                          = 2G

      s3_yearly_transfer_fees = bytes_per_cycle * cycles_per_year * USD$0.10/G
                              = 2G * 4.8 * USD$0.10/G
                              = USD$0.96

    Total S3 charges will be:

      s3_yearly_fees  = s3_yearly_storage_fees + s3_yearly_transfer_fees
                      = USD$7.40 + USD$0.96
                      = USD$8.36

    Calculating transfer www load the backups put on excalibur is
    difficult because it spikes.  The worst that happens is we have
    to transfer a new complete backup.  Outside of that spike, we 
    are transferring diffs.  I have chosen the two interesting 
    points: when the diffs are an average size, and when they are
    at their biggest.

      complete_backup_load = 1G per hour, this happens every 76 days.

      average_diff_load    = total_diffs * growth_rate / 2, per hour
                           = 1833 * 530 / 2, per hour
                           = 486K per hour.

      max_diff_load        = total_diffs * growth_rate, per hour
                           = 1833 * 530, per hour
                           = 972K per hour.

    The maximum bytes transferred per month will occur when a complete
    backup is sent on the last day on the month.  Again using the series
    summation formula sum(N...M) = (M-N+1)*(M+N)/2:

      sum_max_months_diffs =   growth_rate 
                             * (total_diffs-(total_diffs-29)+1) 
                             * (total_diffs + (total_diffs-29))
                             / 2
                           = 540 * (1833-(1833-29)+1) * (1833+(1833-29)) / 2
                           < 30M

      max_month_transfer   = complete_backup_size + sum_max_months_diffs
                           = 1G + 30M
                           = 1.03G.