Train on EC2

  1. Requirements
  2. Define a remote
  3. Check remote status
  4. Start remote
  5. Guild commands with remote
  6. Train Fashion-MNIST on EC2
    1. Pull remote runs
  7. Stop remote
  8. Summary

This guide describes how to run Guild operations on EC2 through Guild’s remote facility.

Requirements

Define a remote

Guild remotes are defined in ~/.guild/config.yml. You must edit this file to add and modify remote definitions.

In this guide we add a remote named ec2‑k80 that uses an EC2 p2.xlarge GPU instance (running a Tesla K80 GPU).

In addition to the instance type the EC2 remote requires:

  • AWS region to start instance in
  • AMI
  • Path to your SSH public key

Modify ~/.guild/config.yml and add the following at the end of the file:

remotes:
  ec2-k80:
    type: ec2
    description: Tesla K80 running on EC2
    instance-type: p2.xlarge
    region: us-east-2
    ami: ami-4f62582a
    public-key: ~/.ssh/id_rsa.pub
    user: ubuntu
    init: |
      set -ex
      echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64' > ~/.bashrc
      . ~/.bashrc
      sudo pip install --pre --upgrade guildai tensorflow-gpu
      guild check

Note

If you already have a remotes section, add ec2‑k80 within that section—don’t add a second remotes section.

Save your changes to ~/.guild/config.yml.

In a command console, list available remotes:

guild remotes

Guild should show:

ec2-k80        ec2  Tesla K80 running on EC2

If you don’t see the remote or Guild exits with an error, verify the step above and try again.

Check remote status

Use remote status to check status for ec2‑k80:

guild remote status ec2-k80

Guild should exit with this error message:

guild: missing required AWS_ACCESS_KEY_ID environment variable

Guild requires AWS access keys to check server status in EC2. You must define the following two environment variables to use EC2 remotes in Guild:

AWS_ACCESS_KEY_ID
Access key ID for your AWS security credentials.
AWS_SECRET_ACCESS_KEY
Secret access key for your AWS security credentials.

Note

If you don’t have these values, refer to Requirements above for help.

Define the required environment variables, replacing <...> with your access key values:

AWS_ACCESS_KEY_ID=<your access key id>
AWS_SECRET_ACCESS_KEY=<your secret access key>

Check status again:

guild remote status ec2-k80

Guild should show:

guild: remote ec2-k80 is not available (not started)

If Guild exits with an error, verify that the requirements above are met. If you cannot resolve the issue, open an issue on GitHub.

Important

Do not post AWS security credentials to GitHub issues or otherwise make them available in plain text to others.

Start remote

Start the ec2‑k80 remote by running:

guild remote start ec2-k80

Press Enter to confirm.

Guild uses Terraform to create and start the various services on AWS used by the remote. This may take several minutes.

Guild uses the script specified in the remote’s init to initialize the server when it starts. You can customize this script as needed to initialize your own servers.

Here’s the script that we’re using in this guide:

set -ex
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64' > ~/.bashrc
. ~/.bashrc
sudo pip install --pre --upgrade guildai tensorflow-gpu
guild check

Init scripts are specific to the remote AMI. In this case, the AMI we’re using requires a few things:

  • Minor change to the environment init
  • Installation of TensorFlow and Guild AI

We use set ‑ex at the start of the script to fail on any error and to help debug issues.

We run guild check at the end of the script to verify that the environment is working as expected.

If the server fails to start, note the error message and open an issue on GitHub to get help.

If the command succeeds, Guild shows the host name of the EC2 instance.

Confirm that remote is available by checking status:

guild remote status ec2-k80

Guild updates status for the remote and should display the following (host name will differ):

ec2-k80 (ec2-18-222-63-152.us-east-2.compute.amazonaws.com) is available

Guild commands with remote

The ec2‑k80 remote is now available for use. Let’s run some basic Guild commands with the remote. Commands that support remote execution all support a ‑r, ‑‑remote option, which indicates that the command applies to the specified remote.

Check the Guild environment:

guild check -r ec2-k80

Guild shows information for the remote. You can use this to quickly check a remote environment.

List remote runs:

guild runs -r ec2-k80

As we haven’t run an operations on the remote, the list is empty.

Train Fashion-MNIST on EC2

In this section we train the Basic Fashion-MNIST image classifier example on the ec2‑k80 server.

First, clone the Guild Examples repository:

git clone https://github.com/guildai/examples.git

Change to the examples/fashion directory:

cd examples/fashion

Before training, we need to prepare the Fashion-MNIST images.

Run prepare‑data on the ec2‑k80 remote:

guild run prepare-data -r ec2-k80

Press Enter to confirm the operation.

Guild packages the local project and installs it on the remote. It then runs the operation in EC2.

When the operation finishes, view the remote run:

guild runs -r ec2-k80

Guild shows the runs on the remote (ID and times will differ):

[1:43e252de]  fashion/fashion:prepare-data  2018-10-25 13:24:21  completed

Next, run train on the remote:

guild run train -r ec2-k80

Press Enter to continue.

Guild similarly packages the local project and runs it on EC2.

When the train operation finishes, list remote runs:

guild runs -e ec2-k80

Guild shows two remote runs (IDs and times will differ):

[1:b50bea64]  fashion/fashion:train         2018-10-25 13:34:41  completed
[2:43e252de]  fashion/fashion:prepare-data  2018-10-25 13:24:21  completed

Both of these runs reside on the remote server in EC2. In the next section we copy them to the local system.

Pull remote runs

We use EC2 to perform computation but we ultimately want to capture the results. We can do that using the pull command, which synchronizes runs on a remote to our local environment.

Pull all of the remote runs:

guild pull ec2-k80

Press Enter to confirm.

Guild copies the runs on EC2 to the local system.

When the command finishes, view the local runs:

guild runs

Guild shows the two remote runs, which have been copied to the local system.

Once remote runs are pulled, they are like any other local run.

With the runs safely copied, we can stop the remote server so we don’t pay for unused EC2 resources.

Stop remote

Use remote stop to terminate the ec2‑k80 instance and all of its supporting EC2 services:

guild remote stop ec2-k80

Type y and press Enter to confirm that you want to stop the remote.

Important

Stopping an EC2 remote will terminate the associated EC2 instance. You will lose any files stored on non-persistent storage. If you want remote runs, copy them locally using pull first.

Summary

In this guide we use a Guild remote to start a server in EC2 that is used to train the basic Fashion-MNIST example. Once trained, we copy the remote runs to the local system and stop the remote to avoid ongoing AWS costs.