Train an Image Classifier

In this guide, we train an image classifier on the Fashion-MNIST data set.

Fashion-MNIST

Requirements

If you haven’t already, install and verify Guild AI for this guide.

Image classifier training script

In this step, we create a script named fashion_mnist_mlp.py, which is an image classifier training script adapted from the official Keras examples. 1

If you haven’t done so already, create a new directory for the project:

mkdir guild-start

Create a file named fashion_mnist_mlp.py, located in the guild‑start directory:

from tensorflow.keras.callbacks import TensorBoard
from tensorflow.keras.datasets import fashion_mnist
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.utils import to_categorical

batch_size = 128
epochs = 5
dropout = 0.2
lr = 0.001
lr_decay = 0.0

(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()
x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)

model = Sequential()
model.add(Dense(512, activation='relu', input_shape=(784,)))
model.add(Dropout(dropout))
model.add(Dense(512, activation='relu'))
model.add(Dropout(dropout))
model.add(Dense(10, activation='softmax'))

model.compile(
    loss='categorical_crossentropy',
    optimizer=RMSprop(lr=lr, decay=lr_decay),
    metrics=['accuracy'])

model.fit(
    x_train, y_train,
    batch_size=batch_size,
    epochs=epochs,
    verbose=1,
    validation_data=(x_test, y_test),
    callbacks=[TensorBoard(".")])
guild-start/fashion_mnist_mlp.py  

Verify that your project structure is:

  • guild-start
    • fashion_mnist_mlp.py
    • echo.py (from Quick Start - not used in this guide)
    • train.py (from Quick Start - not used in this guide)

Train with default settings

In a command console, change to the guild‑start project:

cd guild-start

Run fashion_mnist_mlp.py:

guild run fashion_mnist_mlp.py
You are about to run fashion_mnist_mlp.py
  batch_size: 128
  dropout: 0.2
  epochs: 5
  lr: 0.001
  lr_decay: 0.0
Continue? (Y/n)

Press Enter to start training.

By default, the script is configured to train over 5 epochs.

You can view the training results in various ways:

  • List available runs, which includes fashion_mnist_mlp.py
guild runs

This commands shows the last 20 runs. If you’re only interested in listing runs of fashion_mnist_mlp.py you can filter the list using the ‑o or ‑‑operation command line option:

guild runs -o fashion_mnist_mlp.py

Tip

You can type a portion of the operation name with ‑o. For example, guild runs ‑o mnist would show all operations that contain mnist.

  • Show information for the run
guild runs info

This command shows information for the latest run.

  • List generated run files
guild ls

This command lists files for the latest run. In the case of fashion_mnist_mlp.py, we see:

~/.guild/runs/0fbff0c44b4011e9a325d017c2ab916f:
  events.out.tfevents.1553107414.localhost

Note

Directory and filenames will differ on your system.

The events file under logs is a TensorBoard event log generated by the training script — specifically by the TensorBoard Keras callback used by the script.

View results in TensorBoard

View the fashion_mnist_mlp.py training run in TensorBoard:

guild tensorboard --operation mnist

This command shows any run matching “mnist” in TensorBoard. If you run this command in a separate command console, you can leave TensorBoard running in the background while you run more operations — Guild automatically syncs TensorBoard with the current runs.

Guild integrates with TensorBoard and automatically synchronizes filtered runs

See the tensorboard command for more information on running TensorBoard from Guild.

When you’re done viewing results in TensorBoard, return to the command prompt and type Ctrl‑C to stop TensorBoard.

TensorBoard 1.13.0 at http://localhost:65397 (Press CTRL+C to quit)
^C
Type Ctrl‑C to stop TensorBoard

Train a second time

Run fashion_mnist_mlp.py again — this time specify a different learning rate:

guild run fashion_mnist_mlp.py lr=0.01

This changes the learning rate to 0.01 from the default 0.001. It turns out that this value is too high — but we use the scenario to demonstrate a simple trouble shooting process in Guild.

Press Enter to start training.

As the model trains, note the validation accuracy (represented by val_acc in the training output) — it is roughly 10%, which is random guessing! So we know our model isn’t learning.

You can stop the training at any point by typing Ctrl‑C — or let it run to completion.

Compare the runs:

guild compare --table --operation mnist --strict-cols =lr,val_acc

This variation of compare uses ‑‑strict‑cols to only show the columns we’re interested in comparing — in this case, lr and val_acc. The syntax =lr means “the flag lr” and is used to distinguish the value from scalars. val_acc is the name of the scalar used for validation accuracy.

For details on compare options, see the compare command.

Show differences between runs

In the previous step, we tried a learning rate that was too high — our model failed to learn anything at all.

Let’s assume for a moment we didn’t know why this happened. How could we troubleshoot the problem?

Let’s use Guild’s diff command to compare our last two runs. Specifically, we compare changes to flags and source code.

guild diff --flags
--- ~/.guild/runs/7327dbd44bce11e98af6c85b764bbf34/.guild/attrs/flags
+++ ~/.guild/runs/925f38e44bce11e98af6c85b764bbf34/.guild/attrs/flags
@@ -1,5 +1,5 @@
 batch_size: 128
 dropout: 0.2
 epochs: 5
-lr: 0.001
+lr: 0.01
 lr_decay: 0.0

Tip

By default, Guild uses the diff command to show differences. You can specify an alternative program when running diff with the c or ‑‑cmd command line option. For example, if you have Meld available on your system, you can compare the last two runs by running guild run ‑c meld.

You can configure the default program used for diffing in user configuration.

Here’s the diff of flags in Meld:

We can see from this comparison exactly what changed across the two runs: the learning rate went from 0.001 to 0.01. While this is a simple example, it demonstrates the value of systematically tracking experiment details.

Summary

In this guide we trained a simple image classifier and used TensorBoard and diffing tools to view and compare runs.

  • The training script used in this guide is a realistic example of a real machine learning algorithm
  • We didn’t modify the script to take advance of Guild’s experiment tracking and comparison features
  • We used a simple method of troubleshooting — diffing two runs — to explain a result

Next steps

Learn about Guild files and how they're used to support simple reproducibility in machine learning.
Guild makes it easy to backup and restore runs, including backups to AWS S3 and on-prem servers.
Train a model remotely to take advantage of cloud based GPUs.

  1. The training script for the image classifier is adapted from keras/examples/mnist_mlp.py on GitHub