Runs

Overview

A Guild run is a computation that generates a result you’re interested in saving. We often refer to runs as experiments. Runs can perform any type of task including:

  • Train models
  • Prepare data sets
  • Evaluate and test trained models
  • Analyze and summarize sets of runs
  • Optimize models size and performance
  • Deploy models

The term run refers to one of two things:

  • A in-process run, represented by an operating system process
  • A file system artifact associated with a run operating system process

Runs play a central role for systematic model improvement. By capturing experiment details, you establish a series of baseline measurements against which to compare future experiments. You maintain a record that can used to check progress and make informed next-steps.

Runs serve as a unit of reproducibility. By automating and capturing experiments in your workflow, you provide a smooth path for others to rerun your work and compare results.

Run Artifacts

Guild saves runs on standard file systems. Guild is different in this respect from experiment tracking systems that save experiment results in databases or exotic file systems…

Runs are stored under a runs directory located in Guild home. To show where Guild saves runs, refer to the guild_home attribute shown by guild check. Each run is saved in a unique subdirectory. For more information, see Run Directory below.

Run Directory

A run directory is a unique directory created for a run process. Each run directory is persistent record of a run, including files generated by the operation script and metadata describing the run.

Runs directories are stored together in a runs subdirectory of Guild home.

Run directories are named using a Guild-generated unique identifier. This value correponds to the full run ID. Run IDs are globally unique to differentiate them across systems over time.

Generated Files

When Guild starts a run, it changes the current directory to the unique run directory. Relative paths written during the operation are written to the run directory. In this way, a run “captures” script output.

Guild additionally saves run metadata in the run directory. This information is shown when you run guild runs info.

Resource Links

Guild supports required resources for runs, which are files that an operation needs to run successfully. For example, a model training operation may require a data set prepared by another operation. These dependencies are defined in a Guild file.

Resource links may include:

  • Files from other runs
  • Downloaded files
  • Project files

When preparing a run directory, Guild resolves required resources by locating the appropriate files (downloading them if needed) and creating symbolic links to those files in the run directory. These links are part of the run artifact.

For for more information on defining and using required resources, see Dependencies.

Run Metadata

Run metadata consists of:

  • Flags
  • Snapshot of the source code used by the operation
  • OS process information such as command, environment, process ID, and exit code
  • Run output (content written to standard output and standard error during the operation)

Run metadata are stored in files under a Guild-generated .guild subdirectory of the run directory. You can list these files for a run using guild ls as follows:

guild ls -a -p .guild [RUN]

Guild provides various commands to show run metadata. See Get Run Information below for details.

Start a Run

Start a run using guild run. The run command is a multi-feature command that supports the following actions:

  • Run a Python script
  • Run an operation defined in a Guild file
  • Start a batch of runs
  • Stage runs for deferred execution
  • Restart a stopped run
  • Start a run from a prototype
  • Show help for an operation
  • Test an operation

Run a Python Script

In Get Started you run a mock training script named train.py.

guild run train.py

Example of running a Python script directly. train.py in this case is a file in the current directory

This is often a good place to start because it doesn’t require additional configuration. For more control over a run, see Run an Operation below.

Note While Guild supports other languages, Guild currently only supports direct execution of Python scripts. To run a script in a different language, define an operation. See Run an Operation below for more information.

For more information on Guild’s default behavior, see Default Behavior - Python Scripts.

Tip Guild’s support for directly script execution is a convenience to get started quickly. Without additional information, Guild makes various assumptions that may not hold true for your script. Consider using operations for more control over how Guild runs your script. See Run an Operation below for more information.

Auto-Detect Python Script Flags

To run a Python script without additional configuration, Guild inspects the script to determine script flags, which provide information to the script.

If the script uses argparse, Guild assumes the script accepts flags as command line options. Otherwise, Guild assumes that the script defined flags as global variables.

For example, Guild detects a single flag learning_rate from this code:

import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--learning_rate", type=float, default=0.01)

args = pargser.parse_args()
print("learning rate is %f" % args.learning_rate)

Guild detects use of argparse and assumes that command options are used as flag

Guild detects learning_rate from this code:

learning_rate = 0.01
print("learning rate is %s" % learning_date)

When argparse is not used, Guild considers global variables not starting with _ and that are assigned a constant value (string, number, or boolean) as flags

Auto-Detect Python Script Scalars

Guild logs numeric values as scalars that are written by the script using the format:

<key>: <value>

<key> must not contain spaces and <value> must be parsable as a number. Output must not contain leading spaces.

For example, the following script implicitly logs the scalars loss and accuracy:

loss, accuracy = train_model()

print("loss: %f" % loss)
print("accuracy: %f" % accuracy)

By default, Guild logs numeric values printed as KEY: VALUE as scalars

This behavior can be modified by defining output-scalars for an operation in a Guild file. See Scalars for more information.

Run an Operation

Run an operation by specifying the operation name. To show the list of available options for a project, use guild ops. To include installed operation — i.e. operations defined in installed packaged – use -i with the ops command.

With the operation name, start it with guild run. Refer to the command help for options.

Run a Batch

To run a batch, specify a flag value that implicitly starts a batch run. Start a batch explicitly by specifying --optimize or --optimizer with guild run.

See Batches below for more information.

Stage a Run

You can stage a run to start later. To stage a run, use --stage with guild run.

Start a staged run using --start RUN-ID with run.

Staged runs are often used with queues to schedule runs. See Queues for more information.

To stage multiple runs at once using flag list values, specify --stage-trials instead of --stage. See Batches below for information on generating multiple trials using flag list values.

Restart a Run

Restart a run using --restart RUN-ID with run.

Restart a run that terminates early and has a checkpoint to restart from.

Note Your operation code must support restarts from a checkpoint. This requires specific coding to handle restarts. You generally do this by looking for a checkpoint file and restarting from the checkpoint if it’s present. When Guild restarts a run, the run has access to any checkpoints it writes previously.

By default, Guild uses the source code associated with the run when restarting. To force Guild to use the current source code, specify the --force-sourcecode option along with --restart. Use this when you want to apply a fix to a terminated run.

Guild uses the run flag values when restarting. You can redefine flag values as needed when restarting a run.

Run From a Prototype

To start a new run using an existing run as a prototype, use --proto RUN-ID with run.

By default, Guild uses the source code associated with the prototype run. To force Guild to use the current source code, specify the --force-sourcecode option along with --proto.

Show Operation Help

Show help for an operation by specifying the --help-op option with run.

Command help is generated from the Guild file.

Test an Operatiion

There are various ways to test an operation. The first way is to simply run it. Consider implementing the 10 Second Rule for your operations so you can run the full code path in less than 10 seconds.

Test Source Code

Show what Guild copies as source code for an operation by specifying the --test-sourecode option with run. Guild shows the rules used to include and exclude source code files. It shows the files that are selected and those that are skipped.

Test Output Scalars

Use --test-output-scalars to test operation output scalar configuration. You can test a file or test standard input using the filename -.

Test output scalars for a run by piping run output to the run command:

guild cat --output | guild run --test-output-scalars

You can type input per line for evaluation using this command:

guild run --test-output-scalars -

When you’re done evaluating input, specify a blank line by pressing Enter or type Ctrl-C.

Test Flags

Flag detection, flag imports, and flag definitions are often a process of refinement as you develop an operation. Use the --test-flags option with run to show how Guild processes flags for an operation.

Stage a Run in a Directory

Use --stage with --run-dir to stage a run in a directory. Guild initializes the run as it would normally. This lets you inspect the run directory and even run the operation directly with Python. This is often an effecitve way to troubleshoot issues. When you’ve resolved any issues, run the operation normally.

Run on a Remote System

To run an operation on a remote, use the --remote option with guild run.

You must define the specified remote in user configuration.

See Remotes for information on defining and using remotes.

Manage Runs

Run Filters

Run-related commands support a common set of run filter options. Use these options to limit the command to runs that match the specified filters.

The table below lists the common run filter options.

Option Description
-l, --label VAL Filter runs with labels matching VAL.
-U, --unlabeled Filter only runs without labels.
-M, --marked Filter only marked runs.
-N, --unmarked Filter only unmarked runs.
-R, --running Filter only runs that are still running.
-C, --completed Filter only completed runs.
-E, --error Filter only runs that exited with an error.
-T, --terminated Filter only runs terminated by the user.
-P, --pending Filter only pending runs.
-G, --staged Filter only staged runs.
-S, --started RANGE Filter only runs started within RANGE. See above for valid time ranges.
-D, --digest VAL Filter only runs with a matching source code digest.

Commands that support these filters include cat, compare, export, import, label, ls, mark, open, publish, pull, push, runs diff, runs info, runs purge, runs restore, runs rm, runs, select, tensorboard, and view.

List Runs

List runs using guild runs. This is an alias for the full command guild runs list.

guild runs

By default Guild shows the latest 20 runs. You can show more runs using one more occurrances of the -m option. Each time you specify -m Guild shows 20 more runs.

Show all runs using the -a option.

You can filter runs using various filter options. See runs list for details.

You can list runs in an export archive by specifying the directory with the -A, --archive option.

Get Run Information

Use guild runs info to show run information. This command outputs a number of run attributes.

By default Guild shows information for the latest run. Specify a run index or run ID to show information for a different run.

To export runs info as JSON, use the --json option.

To list run files, use guild ls.

To open a file from a local run using an associate program, use guild open.

To open a new command line shell for a run, use --shell with guild open. The shell is configured with the run directory as the current directory.

To show the contents of a run file, use guild cat.

Compare Runs

Guild support various methods for comparing runs.

The table below describes how various commands are used to compare.

Guild Command When to Use
compare Compare run flags and scalars in a tabular view. Command is terminal based works the same when run remotely and locally. See Guild Compare for details.
view Graphical application (web based) for exploring and comparing runs. See Guild View for details.
tensorboard Use TensorBoard to compare run scalars, images, hyperparameters, and other logged summaries. See TensorBoard for details.
runs diff Use a diff program to compare runs.

Delete Runs

Use guild runs rm or guild runs delete to delete runs. These commands are identical. Use the form you prefer.

By default, when you delete a run, Guild moves the run to a trash directory in case you want to restore it. Use guild runs restore to restore deleted runs. List deleted runs using guild runs -d.

You can permanently delete deleted runs using guild runs purge. This operation cannot be undone. Purging runs frees disk space.

Tip Avoid the temptation to permanently delete runs until you need to. Deleted runs are surprisingly useful sources of information. When they’re permanently deleted, the information is gone.

Use guild check --space to see space used by runs. When purging runs to free disk space, consider deleting old runs using the -S option (short form of --started). For example, purge runs older than 30 days with guild runs purge -S "before 30 days ago".

Export and Import Runs

Use guild export to move or copy runs to a local directory.

Use --move with export to move runs out of the environment into a directory. This is a good method for organizing runs over time. Use different export directories to organize and archive runs. This keeps your environment free of older runs that you aren’t working with. You can import these runs back into the environment as needed.

Use guild import to move or copy runs from a directory to your environment.

Copy Runs to and from a Remote System

Use guild push to copy runs from your local environment to a remote. This is useful for backing up runs to a server or to a cloud service like S3. You can also use this method to share runs with colleagues or deploy models.

Use guild pull to copy runs from a remote to your local environment. Use this method to restore archived runs or get runs from colleagues.

Labels

A label is used to describe a run. Run labels are shown in various contexts including runs lists and run comparisons.

To show the label for a specific run, use guild runs info — the label is designated by the label attribute.

Use labels to filter runs. See Run Filters above for commands that support the label filter. For example, to show runs with labels containing “best”, run:

guild runs -l best

Set a Label for a New Run

By default, Guild generates a label using the label template configured for an operation. If an operation doesn’t specify a label template, Guild generates a default label with flag assignments.

Set a different label using the --label option with guild run.

You can alternatively tag a run with a string that’s prepended to the default label. Specify the tag with the --tag option.

Modify a Run Label

Modify the label for one or more runs using guild label. When modifying a label, you have several options.

The table below lists ways of modifying labels.

Action Command
Replace labels guild label --set LABEL
Prepend a value to labels guild label --tag VALUE (alias for guild label --prepend VALUE)
Append a value to labels guild label --append VALUE
Remove a value from labels guild label --untag VALUE (alias for guild label --remove VALUE)
Clear labels guild label --clear

By default Guild applies the label command to the latest run. To apply it to different runs, use one or more RUN arguments. To label all runs matching a filter, use the : arugment.

Batches

A batch run is a run that generates other runs. A run generated by a batch is referred to as a trial.

The following command starts a batch. It uses the default batch operation to generate two trials — one for each flag value specified:

guild run train lr='[0.01,0.1]'

The following command also starts a batch. It generates 20 runs using randomly selected values from the log-uniform distribution over the specified range:

guild run train lr=loguniform[1e-5:1e-1]

Both examples start batches implicitly based on the specified flag values. Start a batch explicitly by specifying either --optimize or --optimizer. Optimizers are batch operations. They generate one or more trials.

For more information, see Hyperparameter Optimization.

Batch Files

Trials are also generated when you specify a batch file for a run. Batch files are specified using @PATH where PATH is the path to a file containing trial flags.

Batch files must be defined using one of the following formats:

  • CSV
  • YAML
  • JSON

Sample CSV formatted batch file:

lr,batch_size,dropout
0.1,100,0.1
0.1,100,0.2
0.01,100,0.1
0.01,100,0.2

Sample YAML formatted batch file:

- lr: 0.1
  batch_size: 100
  dropout: 0.1
- lr: 0.1
  batch_size: 100
  dropout: 0.2
- lr: 0.01
  batch_size: 100
  dropout: 0.1
- lr: 0.01
  batch_size: 100
  dropout: 0.2

Sample JSON formatted batch file:

[{"lr": 0.1, "batch_size": 100, "dropout": 0.1},
 {"lr": 0.01, "batch_size": 100, "dropout": 0.2},
 {"lr": 0.1, "batch_size": 100, "dropout": 0.1},
 {"lr": 0.01, "batch_size": 100, "dropout": 0.2}]

You can override flag values when using batch files.