Backup and Restore

Guild lets you easily backup runs to the cloud and restore those runs at a later time. This is useful for safeguarding runs from accidental deletion and can also be used for collaboration.

This guide demonstrates Guild’s backup and restore capabilities on Amazon S3. Refer to Requirements below for more information.

Important

Storing files on S3 will incur ongoing costs. Refer to Cleanup below for steps to delete all files uploaded to S3 from this guide.

Requirements

This guide is a continuation of Train an Image Classifier. Complete that guide before proceeding below.

In addition, you must complete the following steps for Amazon S3 bucket support:

Note the name of the S3 bucket created.

Note the AWS access key ID and secret access key of the IAM user who has write access to the S3 bucket.

Create a remote entry in Guild config

Using your text editor, open ~/.guild/config.yml.

Add the following to the bottom of the file:

remotes:
  s3:
    type: s3
    bucket: <name of S3 bucket>
Adding s3 to the remotes section in ~/.guild/config.yml

Replace <name of S3 bucket> with the name of the S3 bucket created in Requirements above.

Note

If a remotes section already exists in config.yml, omit that line from the snippet above and only copy the lines after remotes.

Save you changes to ~/.guild/config.yml.

Set AWS environment variables

In the same command console you’ll use throughout this guide, set the following environment variables:

export AWS_ACCESS_KEY_ID=<access key ID>
export AWS_SECRET_ACCESS_KEY=<secret access key>

Replace <access key ID> and <secret access key> with the respective values associated with the IAM user who has write access to the S3 bucket.

In the same console, verify that you have access to the S3 bucket from Guild by running:

guild remote status s3

If you have access, you will see:

s3 (S3 bucket BUCKET_NAME) is available

where BUCKET_NAME is the name of the S3 bucket associated with the remote.

Run fashion_mnist_mlp.py

Train the Fashion-MNIST classifier by running:

guild run fashion_mnist_mlp.py

Press Enter to start training.

This creates a run that we backup in the next step.

Backup latest run to S3

Copy the latest run to the s3 remote by running:

guild push 1 s3

The 1 in the command tells Guild to only copy the latest run. s3 is the name of the remote you configured above.

Guild prompts before copying:

You are about to copy (push) the following runs to s3:
  [0def3262]  fashion_mnist_mlp.py  2019-03-22 11:53:44  completed
  Continue? (Y/n)

Press Enter to continue.

Guild copies the latest run to the S3 bucket. You can verify this using the AWS Management Console.

Additionally, you can list runs available on the s3 remote using:

guild runs -r s3
Synchronizing runs with s3
[1:0def3262]  fashion_mnist_mlp.py  2019-03-22 11:53:44  completed

Delete the latest local run

Later we restore the fashion_mnist_mlp.py run from S3 — so let’s delete the local run first:

guild runs rm 1

Once again, the number 1 in the command tells Guild to only delete the latest run.

Press Enter to delete the run.

Restore the deleted run

In this step, we restore the deleted run by copying it from its backup location on S3.

To copy latest the run, use:

guild pull 1 s3

As with the earlier commands, the number 1 indicates that we only want to copy the latest run from S3. If you omit this argument, Guild will copy all of the runs from S3.

Press Enter to copy the run from S3.

You can verify that the run has been copied with:

guild runs info

Compare that information to the same command applied to the remote:

guild runs info -r s3

Cleanup

To delete all of the runs from S3, run:

guild runs rm --permanent -r s3

Note that this command deletes all of the runs from the s3 remote bucket — not just the latest run. Guild prompts you before deleting.

Verify the list and press y followed by Enter.

Note

If you don’t use the ‑‑permanent command line option, the run is not deleted from S3 as it can be restored using guild restore ‑r s3. If you want to truly delete the run, use the ‑‑permanent option.

Verify that there are no runs on the s3 remote:

guild runs -r s3

And that there are no restorable runs:

guild runs -r s3 --deleted

Summary

In this guide we used S3 to backup and restore a run.

This feature is useful for guarding against accidental run deletion, but it’s also useful when collaborating on teams:

  1. One or more researchers or engineers train models and push results to S3

  2. Other users pull runs for various uses including comparison, summary, and release processes

Next steps

Train a model remotely to take advantage of cloud based GPUs.
Explore Guild's comprehensive docs to learn more.