models/official/vision/docs/runtime_configurations.md

24 KiB

Runtime Configurations

Available runtime configurations

In Model Garden, runtime configurations are a set of attributes used inside train_lib.py to ensure the training and/or evaluation jobs are properly configured for target hardware and software environments. These attributes include e.g. the distribution strategy, which controls how training is distributed across multiple devices; the computation resources, which may control the number of GPUs or CPUs used for training. Runtime configurations are important to achieve optimal performance and efficiency. A concrete example for running an image classification task on TPU with bfloat16 mixed_precision_dtype can be found here.

runtime:
  distribution_strategy: 'tpu'
  mixed_precision_dtype: 'bfloat16'
task:
  ……

In this section, we would walk you through the available options, and we have grouped them into three groups

  • Common parameters: configurations applicable for all hardware and software setup
  • TPU specific parameters: configurations applicable for TPU job only
  • GPU specific parameters: configurations applicable to GPU job only

Summary table

Common Parameters distribution_strategy
mixed_precision_dtype
loss_scale
all_reduce_alg
run_eagerly
worker_hosts
task_index
enable_xla
TPU Specific Params tpu
tpu_enable_xla_dynamic_padder
GPU Specific num_gpus
Others gpu_thread_mode
per_gpu_thread_count
dataset_num_private_threads
num_packs

Common Parameters

This parameter controls the exact tf.distribute.Strategy used for setting up distributed training across multiple GPUs, multiple machines, or TPUs. It allows users to easily distribute and parallelize their training workloads across multiple machines, making it easier to scale up the training process. Distributed training helps to reduce the time required to Note that the distribution_strategy needs to be configured based on the target software and hardware environment. software and hardware environment.

  • tpu distribution strategy: it lets you run your TensorFlow training on Tensor Processing Units (TPUs) through synchronous distributed training. TPUs provide their own implementation of efficient all-reduce and other collective operations across multiple TPU cores, which are used in tpu strategy.
  • mirrored distribution strategy: it implements synchronous training across multiple GPUs on one machine. It creates copies of all variables in the model on each device across all workers.
  • multi_worker_mirrored distribution strategy: this strategy implements synchronous distributed training across multiple workers, each with potentially multiple GPUs.
  • parameter_server distribution strategy: parameter server training is a common data-parallel method to scale up model training on multiple machines. A parameter server training cluster consists of workers and parameter servers. Variables are created on parameter servers and they are read and updated by workers in each step.

Note that the distribution_strategy needs to be configured based on the target software and hardware environment.

Mixed precision is the use of both 16-bit and 32-bit floating-point types in a model during training to make it run faster and use less memory. By keeping certain parts of the model in the 32-bit types for numeric stability, the model will have a lower step time and train equally as well in terms of the evaluation metrics such as accuracy. The mixed_precision_dtype parameter is used to specify mixed precision policy, and available options are:

  • float32
  • float16
  • bfloat16 (TPU only)

If the mixed_precision_dtype is set to tf.float16, lower-precision dtypes should be used whenever possible on those devices. However, variables and a few computations should still be in float32 for numeric reasons so that the model trains to the same quality. Modern accelerators can run operations faster in the 16-bit dtypes, as they have specialized hardware to run 16-bit computations and 16-bit dtypes can be read from memory faster

  • loss_scale:
    • Optional parameter
    • Default value: None
    • Data type: String or Float

This parameter specifies the type of loss scale, or 'float' value. This is used when setting the mixed precision policy. Loss scaling is a process that multiplies the loss by a multiplier called the loss scale, and divides each gradient by the same multiplier. Loss scaling can help avoid numerical underflow in intermediate gradients when float16 tensors are used for mixed precision training. By multiplying the loss, each intermediate gradient will have the same multiplier applied. The most commonly used type is the dynamic loss scale, where the loss scale will be dynamically updated over time using an algorithm that keeps the loss scale at approximately its optimal value.

  • all_reduce_alg:
    • Optional parameter
    • Default value: None
    • Data type: String

It is used to specify the algorithm used to perform the all-reduce operation, which is used to synchronize variables across multiple machines. For mirrored strategy, valid values are nccl and hierarchical_copy. For multi_worker_mirrored Strategy, valid values are ring and nccl. If None, Distribution Strategy will choose based on device topology.

  • run_eagerly:
    • Required parameter
    • Default value: False
    • Data type: Boolean

The boolean parameter decides whether or not to perform the experiment eagerly. If it is set to False, the training and evaluation logics will not be wrapped in a tf.function. It is recommended to leave this as False unless your logic cannot be run inside a tf.function, or you would like to perform step by step debugging.

  • worker_hosts:
    • Optional parameter
    • Default value: None
    • Data type: String

worker_hosts is a parameter used to specify the network addresses of the worker nodes in a distributed training setup. This variable is typically used when performing multi-worker training with the TensorFlow distributed strategy. The variable should be set to a comma-separated list of the worker nodes in the form of 'host1:port,host2:port.

Example : worker_hosts: $HOST1:port,$HOST2:port - $HOST1 and $HOST2 are the IP addresses of the hosts, and port can be chosen from any free port on the hosts. Only the first host will write TensorBoard Summaries and save checkpoints.

  • task_index:
    • Optional parameter
    • Default value: -1
    • Data type: Int

task_index is a parameter typically used when performing multi-worker training with the TensorFlow distributed strategy. It is used to specify the index of the worker node in the network. Setting the task index variable is important, as the index is used to keep track of the worker nodes in the network and ensure that each worker is performing its assigned tasks correctly. For example, worker_hosts: $HOST1:port,$HOST2:port, you have task_index: 0 on the first host and task_index: 1 on the second and so on.

  • enable_xla:
    • Required parameter
    • Default value: False
    • Data type: Boolean

enable_xla is to enable or disable the XLA compiler in TensorFlow. The XLA compiler is a just-in-time optimized compiler that can improve the performance of TensorFlow models. XLA performs compiler optimizations, such as fusion, and attempts to emit more efficient code. This may drastically improve the performance. If set to True, the whole function needs to be compilable by XLA, or an errors.InvalidArgumentError is thrown. If None (default), compiles the function with XLA when running on TPU and goes through the regular function execution path when running on other devices.

TPU Specific Parameters

  • tpu:
    • Optional parameter
    • Default value: None
    • Data type: String

The String that represents the TPU address to connect to, if any. Must not be None if distribution_strategy is set to tpu.

It is an optional Boolean parameter in the TensorFlow runtime configuration. It is used to enable dynamic padding for XLA (Accelerated Linear Algebra) operations. XLA performs compiler optimizations, such as fusion, and attempts to emit more efficient code. This may drastically improve the performance. If set to True, the whole function needs to be compilable by XLA, or an errors.InvalidArgumentError is thrown. If None (default), compiles the function with XLA when running on TPU and goes through the regular function execution path when running on other devices.

GPU Specific Parameters

  • num_gpus:
    • Required parameter
    • Default value: 0
    • Data type: Int

This is an attribute to specify the number of GPUs to use at each worker with the distribution strategies. Note that with default value 0, the training process won't utilize any GPU even if they are present.

In addition to the above parameters, we support more but less commonly used parameters such as gpu_thread_mode, per_gpu_thread_count , dataset_num_private_threads and num_packs used for optimizing performance on GPU. Refer gpu performance guide.

Note: They are used in the TF environment variables here. But it requires to manually call keras_utils.set_gpu_thread_mode_and_count parameter. So far, only legacy code, benchmark, and code from other parties call them. Thus they are not automatically used and do not have effect when set without calling keras_utils.set_gpu_thread_mode_and_count.

Please check here for the full list of parameters.

How to set runtime configurations

This section of the user guide illustrates some of the most common use cases on how to set the runtime configurations. The most common configurations include setting the device to use for training, setting the optimizer and the loss function, setting the metric to use for evaluation, the number of workers, and setting the distribution strategy.

Additionally, there may be other configuration settings to fine-tune the model performance, such as the number of training epochs, batch size, learning rate, weight decay, the learning rate decay, and the gradient clipping and more.

Below we list a few most commonly encountered use cases for user reference.

Training on TPU

  • mixed_precision_dtype: bfloat16 (Recommended)
runtime:
  distribution_strategy: 'tpu'
  mixed_precision_dtype: 'bfloat32'
task:
  train_data:
    is_training: true
    global_batch_size: 4096
    dtype: 'bfloat32'
  validation_data:
    is_training: false
    global_batch_size: 4096
    dtype: 'bfloat32'
    drop_remainder: false
    ....

Please refer to this config file for a full example of running image classification with bfloat16 mixed_precision_dtype and tpu distribution_strategy.

  • mixed_precision_dtype: float32
runtime:
  distribution_strategy: 'tpu'
  mixed_precision_dtype: 'float32'
task:
  train_data:
    is_training: true
    global_batch_size: 4096
    dtype: 'float32'
  validation_data:
    is_training: false
    global_batch_size: 4096
    dtype: 'float32'
    drop_remainder: false
    ....

Please refer to this config file for a full example of running semantic segmentation with float32 mixed_precision_dtype and tpu distribution_strategy.

Training on GPU

  • mixed_precision_dtype: float16 (Recommended)
runtime:
  distribution_strategy: 'mirrored'
  num_gpus: 4
  mixed_precision_dtype: 'float16'
  loss_scale: 'dynamic'
task:
  ……
  train_data:
    is_training: true
    global_batch_size: 4096
    dtype: 'float16'
  validation_data:
    is_training: false
    global_batch_size: 4096
    dtype: 'float16'
    drop_remainder: false
    ……

Please refer to this config file for a full example of image classification with float16 mixed_precision_dtype and mirrored distribution_strategy.

  • mixed_precision_dtype: float32
runtime:
  distribution_strategy: 'mirrored'
  num_gpus: 4
  mixed_precision_dtype: 'float32'
  loss_scale: 'dynamic'
task:
  ……
  train_data:
    is_training: true
    global_batch_size: 4096
    dtype: 'float32'
  validation_data:
    is_training: false
    global_batch_size: 4096
    dtype: 'float32'
    drop_remainder: false
    ……

Please refer to this config file for a full example of image classification with float32 mixed_precision_dtype and mirrored distribution_strategy.

How to adjust according to different runtime configurations

While tuning runtime configurations of your job, it is important to be aware that some task related configurations should be adjusted accordingly as well. For example, if the number of accelerators is reduced, the batch_size should be reduced accordingly, otherwise each accelerator will be allocated proportionally more data.

Below are some commonly encountered use cases for reference.

Reduce number of accelerators

Consider a use case , if the template YAML uses 8 GPUs for training but the user has only 4 GPUs, it is recommended to follow the tips below. This will help ensure that the model is trained as efficiently as possible and will help avoid performance issues due to limited GPU resources.

  • Reduce batch size
  • Increase number of steps for train and validation
  • Modify learning_rate schedule
  • Decrease learning rate

Increase number of accelerators

If we want to increase the number of accelerators, the adjustment will be the opposite of the case above.

  • Increase batch size
  • Reduce number of steps for train and validation
  • Modify learning_rate schedule
  • Increase learning rate

We have provided a concrete example below for image classification on ImageNet with batch_size to be 2048 and 4096:

global_batch_size: 4096
trainer:
train_steps: 156000 # 500 epochs
validation_steps: 13
validation_interval: 312
steps_per_loop: 312 # NUM_EXAMPLES
(1281167) // global_batch_size

summary_interval: 312
checkpoint_interval: 312
optimizer_config:
learning_rate:
type: 'exponential'
exponential:
initial_learning_rate: 0.256 # 0.008
* batch_size / 128

decay_steps: 780 # 2.5 *
steps_per_epoch

decay_rate: 0.94
staircase: true.
warmup:
type: 'linear'
linear:
warmup_steps: 1560
global_batch_size: 2048
trainer:
train_steps: 312000 # 500 epochs
validation_steps: 26
validation_interval: 624
steps_per_loop: 624 # NUM_EXAMPLES
(1281167) // global_batch_size

summary_interval: 624
checkpoint_interval: 624
optimizer_config:
learning_rate:
type: 'exponential'
exponential:
initial_learning_rate: 0.128 # 0.008
* batch_size / 128

decay_steps: 1560 # 2.5 *
steps_per_epoch

decay_rate: 0.94
staircase: true
warmup:
type: 'linear'
linear:
warmup_steps: 3120

Switch from GPU to TPU

Switching from GPU to TPU will allow users to take advantage of the TensorFlow TPU distribution strategy, which in turn allows you to run your models on Users may follow below suggestions to better take advantage of the TPU's strengths:

  • float16 need to be changed to bfloat16
  • dtype of train_data and validation_data should be modified
  • May increase batch_size since TPU is more powerful
  • The batch size of any model should always be at least 64 (8 per TPU core), since the TPU always pads the tensors to this size. The ideal batch size when training on the TPU is 1024 (128 per TPU core), since this eliminates inefficiencies related to memory transfer and padding.

Refer config comparison of TPU and GPU using Image Classification examples below:

runtime:
distribution_strategy: 'mirrored'
mixed_precision_dtype: 'float16'
loss_scale: 'dynamic'
task:
……
train_data:
……
global_batch_size: 1024
dtype: 'float16'


validation_data:
……
global_batch_size: 1024
dtype: 'float16'

……
runtime:
distribution_strategy: 'tpu'
mixed_precision_dtype: 'bfloat16'

task:
……
train_data:
……
global_batch_size: 4096
dtype: 'bfloat16'


validation_data:
……
global_batch_size: 4096
dtype: 'bfloat16'

……