Train an Image Classifier
In this guide, we train an image classifier on the Fashion-MNIST data set.
If you haven’t already, install and verify Guild AI for this guide. The commands below must be entered in a command console/prompt for your system. If you are unfamiliar with using command consoles, Getting to Know the Command Line - by David Baumgold provides a number of helpful tips.
Image classifier training script
In this step, we create a script named
is an image classifier training script adapted from the official Keras
If you haven’t done so already, create a new directory for the project:
Create a file named
fashion_mnist_mlp.py, located in the
Verify that your project structure is:
Train with default settings
In a command console, change to the
guild run fashion_mnist_mlp.py
You are about to run fashion_mnist_mlp.py batch_size: 128 dropout: 0.2 epochs: 5 lr: 0.001 lr_decay: 0.0 Continue? (Y/n)
Enter to start training.
By default, the script is configured to train over 5 epochs.
You can view the training results in various ways:
- List available runs, which includes
This commands shows the last 20 runs. If you’re only interested in
listing runs of
fashion_mnist_mlp.py you can filter the list using
‑‑operation command line option:
guild runs -o fashion_mnist_mlp.py
You can type a portion of the operation name with
guild runs ‑o mnist would show all operations that
- Show information for the run
guild runs info
This command shows information for the latest run.
- List generated run files
This command lists files for the latest run. In the case of
fashion_mnist_mlp.py, we see:
Directory and filenames will differ on your system.
events file under
logs is a TensorBoard event log generated by
the training script — specifically by the TensorBoard Keras
callback used by the
View results in TensorBoard
fashion_mnist_mlp.py training run in TensorBoard:
guild tensorboard --operation mnist
This command shows any run matching “mnist” in TensorBoard. If you run this command in a separate command console, you can leave TensorBoard running in the background while you run more operations — Guild automatically syncs TensorBoard with the current runs.
See the tensorboard command for more information on running TensorBoard from Guild.
When you’re done viewing results in TensorBoard, return to the command
prompt and type
Ctrl‑C to stop TensorBoard.
Train a second time
fashion_mnist_mlp.py again — this time specify a different
guild run fashion_mnist_mlp.py lr=0.01
This changes the learning rate to
0.01 from the default
turns out that this value is too high — but we use the scenario to
demonstrate a simple trouble shooting process in Guild.
Enter to start training.
As the model trains, note the validation accuracy (represented by
val_acc in the training output) — it is roughly 10%, which is
random guessing! So we know our model isn’t learning.
You can stop the training at any point by typing
Ctrl‑C — or
let it run to completion.
Compare the runs:
guild compare --table --operation mnist --strict-cols =lr,val_acc
This variation of
‑‑strict‑cols to only show the
columns we’re interested in comparing — in this case, lr and
val_acc. The syntax
=lr means “the flag
lr” and is used to
distinguish the value from scalars.
val_acc is the name of the
scalar used for validation accuracy.
For details on compare options, see the compare command.
Show differences between runs
In the previous step, we tried a learning rate that was too high — our model failed to learn anything at all.
Let’s assume for a moment we didn’t know why this happened. How could we troubleshoot the problem?
Let’s use Guild’s
diff command to compare our last two
runs. Specifically, we compare changes to flags and source code.
guild diff --flags
--- ~/.guild/runs/7327dbd44bce11e98af6c85b764bbf34/.guild/attrs/flags +++ ~/.guild/runs/925f38e44bce11e98af6c85b764bbf34/.guild/attrs/flags @@ -1,5 +1,5 @@ batch_size: 128 dropout: 0.2 epochs: 5 -lr: 0.001 +lr: 0.01 lr_decay: 0.0
By default, Guild uses the
diff command to show
differences. You can specify an alternative program when running
diff with the
‑‑cmd command line option. For example,
if you have Meld available on your
system, you can compare the last two runs by running
You can configure the default program used for diffing in user configuration.
Here’s the diff of flags in Meld:
We can see from this comparison exactly what changed across the two
runs: the learning rate went from
0.01. While this is a
simple example, it demonstrates the value of systematically tracking
In this guide we trained a simple image classifier and used TensorBoard and diffing tools to view and compare runs.
- The training script used in this guide is a realistic example of a real machine learning algorithm
- We didn’t modify the script to take advance of Guild’s experiment tracking and comparison features
- We used a simple method of troubleshooting — diffing two runs — to explain a result