Tensorflow/Keras programming for IPU

This programming tutorial for IPU is designed for developers looking to accelerate training and evaluation of their model, tabular and computer vision.

Performance difference

The image below illustrates the difference of GPU (Tesla released in 2018) and IPU (released in 2017):

Experimental settings:

20 neural networks from keras.applications package with 1 and 32 batch size values. The identical code was executed on both IPU and GPU. The speedup ratio of the IPU over the GPU is depicted on the horizontal axis, and the arithmetic intensity (FLOPS per parameters) is the vertical axis. Each data point on the plot is labeled with the corresponding neural network architecture code name along with its associated batch size. For example, "eff1_32" corresponds to EfficientNetB1 with a batch size of 32. It is important to note that both axes are represented on a logarithmic scale.

In overall, training all those tasks one-by-one during 1 epoch takes 3h14 on GPU, 1h06 minutes on IPU.

Installation ML frameworks:

Local wheels in the ULHPC graphcore1 server: ~/poplar_sdk-ubuntu_20_04-3.0.0+1145-1b114aac3a/

  • Tensorflow:

    • Tensorflow for AMD EPYC CPU: pip install ~/poplar_sdk-ubuntu_20_04-3.0.0+1145-1b114aac3a/tensorflow-2.6.3+gc3.0.0+236842+d084e493702+amd_znver1-cp38-cp38-linux_x86_64.whl --no-index
    • Keras front-end: pip install ~/poplar_sdk-ubuntu_20_04-3.0.0+1145-1b114aac3a/keras-2.6.0+gc3.0.0+236851+1744557f-py2.py3-none-any.whl --no-index
    • ipu addons for Tensorflow pip install ~/poplar_sdk-ubuntu_20_04-3.0.0+1145-1b114aac3a/ipu_tensorflow_addons-2.6.3+gc3.0.0+236851+2e46901-py3-none-any.whl --no-index
  • JAX: pip install jax==0.3.16+ipu jaxlib==0.3.15+ipu.sdk300 -f

  • HuggingFace: pip install 'transformers @'

  • PyTorch for IPU (named "PopTorch"):

Check your ML framework installation for ipu:

After installing Jax and Tensorflow you should see your favorite frameworks and the associated IPU addons.

Tensorflow IPU code

Begin by importing Keras and TensorFlow. We'll encapsulate our training/testing loop within the 'scope' object. This object facilitates the definition of how AI accelerators (GPU or IPU) are utilized and sets the storage strategy for the model's parameters.

from tensorflow import keras
import tensorflow as tf
import numpy as np

scope = tf.distribute.get_strategy().scope #

Run the code below only if you use IPU. This is the only code difference between IPU and GPU implementation.

##############################IPU CONTEXT###############################
from tensorflow.python import ipu

# do not directly use tensorflow.compiler but ipu_compiler if the below application is Tensorflow (and not Keras).
from tensorflow.python.ipu import ipu_compiler as compiler

# Below code is detailed in:
cfg = ipu.config.IPUConfig()  # Create the IPU hardware configure
cfg.auto_select_ipus = 1  # Attach one IPU to the current process (or MPI rank)
# TODO: other settings include FP32, FP16, ...
cfg.configure_ipu_system()  # Running hardware configuration IPU

Define the main hyperparameter and deep learning architecture


def get_model():
    input_layer = tf.keras.layers.Input(shape=(IMG_SIZE, IMG_SIZE, 3),batch_size=BATCH_SIZE)

    x = tf.keras.applications.ResNet50(weights=None, include_top=False, classes=10)(input_layer)
    x = tf.keras.layers.GlobalAveragePooling2D()(x)
    x = keras.layers.Flatten()(x)
    x = keras.layers.Dense(10, activation='softmax')(x)
    model = keras.Model(inputs=input_layer, outputs=x)

    return model

Read the data

Read the data, scale, and sub-samples the dataset to accelerate the demo

#### READ THE DATA      ######

# Reading raw images
(trainImages, trainLabels), (
) = keras.datasets.cifar10.load_data()

trainImages = trainImages[:MAX_IMG]
trainLabels = trainLabels[:MAX_IMG]
testImages = testImages[:MAX_IMG]
testLabels = testLabels[:MAX_IMG]

# Preprocessing data from [0;255] to [0;1.0]
trainImages = trainImages.astype(np.float32) / 255.0
testImages = testImages.astype(np.float32) / 255.0
trainLabels = trainLabels.astype(np.int32)
testLabels = testLabels.astype(np.int32)

# Selection of images. The nunber of images should be multiple of the batch size, otherwise remainding images are ignored.
training_images = int(
    (len(trainImages) // BATCH_SIZE) * BATCH_SIZE
)  # Force all steps to be the same size
testing_images = int((len(testImages) // BATCH_SIZE) * BATCH_SIZE)
trainImages = trainImages[:training_images]
trainLabels = trainLabels[:training_images]
testImages = testImages[:testing_images]
testLabels = testLabels[:testing_images]

Efficient data access

Design efficient Tensorflow I/O pipeline through asynchronous process between training and resizing images.

#  Some Keras versions display wrong alert messages "INVALID_ARGUMENT" . Please ignore them
train_dataset =
    (tf.cast(trainImages, tf.float32), tf.cast(trainLabels, tf.float32))

eval_dataset =
    (tf.cast(testImages, tf.float32), tf.cast(testLabels, tf.float32))
xy_train_gen = (
    .batch(BATCH_SIZE, drop_remainder=True)
    .map(lambda image, label: (tf.image.resize(image, (IMG_SIZE, IMG_SIZE)), label))
xy_test_gen = (
    .batch(BATCH_SIZE, drop_remainder=True)
    .map(lambda image, label: (tf.image.resize(image, (IMG_SIZE, IMG_SIZE)), label))

Buiding, training, evaluating it


with scope():


  # Call the Adam optimizer
  if hasattr(tf.keras.optimizers, "legacy"):  # Last TF2 version
      optimizer = tf.keras.optimizers.legacy.Adam(0.01)
  else:  # Older TF2 version
      optimizer = tf.keras.optimizers.Adam(0.01)

  # Compute the number of steps
  train_steps_per_exec = len(trainImages) // BATCH_SIZE
  eval_steps_per_exec = len(testImages) // BATCH_SIZE

  # Keras computing graph construction. Plugs together : the model, the loss, the optimizer


  (loss, accuracy) = model.evaluate(xy_test_gen, batch_size=BATCH_SIZE)

  print(f"test loss: {round(loss, 4)}, test acc: {round(accuracy, 4)}")

The output is:

Please note the presence of the line below:

[##################################################] 100% Compilation Finished

which show the successful compilation of the computing graph on IPU. This line is not present when ran on GPU.

Monitoring IPU activity

Don't forget gc-monitor and check that our process is actually running on the IPU. Illustration below:

Going further

  • Don't use Tensorflow for Graph Neural Network, use instead PopTorch. The performance are either very slow ( or the processing process is stuck (
  • Switching beween Multi-IPU and Multi-GPU. Multi-IPU is backed by Popdist and Multi-GPU Horovod :