models/official/vision/docs/faq.md

23 KiB
Raw Permalink Blame History

Frequently Asked Questions

FAQs of TF-Vision


Q1: How to get started with Tensorflow Model Garden TF-Vision?

This user guide is a walkthrough on how to train and fine-tune models, and perform hyperparameter tuning in TF-Vision. For each model/task supported in TF-Vision, please refer to the corresponding tutorial to get more detailed instructions.


Q2: How to use the models under tensorflow_models/official/vision/?

  • Available models under TF-Vision: There is a good collection of models available in TF-Vision for various vision tasks: image classification, object detection, video classification, semantic segmentation and Instance segmentation. Please check this page to know more about our available models. We will keep adding new supports, and your suggestions are appreciated.

  • Fine-tune from a checkpoint: TF-Vision supports loading pretrained checkpoints for fine-tuning. It can be simply done by specifying task.init_checkpoint and task.init_checkpoint_modules in the task configuration. The value of task.init_checkpoint_modules depends on the pretrained modules implementation which in general can be e.g. all, backbone, and/or decoder (for detection and segmentation). If set to all, all weights from the checkpoint will be loaded. If set to backbone, only weights in the backbone component will be loaded and other weights will be initialized from scratch. An example yaml file can be found here.

  • Export SavedModel for serving: To export any TF 2.x models we trained, including tf.keras.Model and the plain tf.Module, we use the tf.saved_model.save() API. Our exporting library offers functionalities to export SavedModel for CPU/GPU/TPU serving.


Q3: How to fully/partially load pretrained checkpoints (e.g. backbone) to perform transfer learning using TF-Vision?

TF-Vision supports loading pretrained checkpoints for fine-tuning. It can be simply done by specifying task.init_checkpoint and task.init_checkpoint_modules in the task configuration. The value of task.init_checkpoint_modules depends on the pretrained modules implementation which in general can either be e.g. all, backbone, and/or decoder (for detection and segmentation). If set to all, all weights from the checkpoint will be loaded. Lets use a concrete example for elaboration. Suppose the requirements are:

  1. Train a classification model with 10-class.

  2. save off the checkpoint of the model from step 1 ( but only save the backbone before the last Conv2D + softmax).

  3. use the checkpoint from step 2 to train a new classification model with 4 novel classes.

For 2, the model needs to specify the components to be saved in the checkpoint in the checkpoint_items. For 3, you can specify the init_checkpoint and the init_checkpoint_modules ='backbone'. Then the new model with 4 classes will only initialize the backbone so that you can finetune the head. In this example, backbone is everything before the global average pooling layer for the classification model.


Q4: How to export the tensorflow models trained using the TF-Vision package?

To export any TF 2.x models we trained, including tf.keras.Model and the plain tf.Module, we use the tf.saved_model.save() API. Our exporting library offers functionalities to export SavedModel for CPU/GPU/TPU serving. Moreover, with the exported SavedModel, it is possible to further convert it to a TFLite model for on-device inference.


Q5: Where can I look for a config file and documentation for the TF-Vision pretrained models?

TF-Vision modeling library provides a collection of baselines and checkpoints for various vision tasks including e.g. image classification, object detection, video classification and segmentation. The supported pretrained models and corresponding config file can be found here. Since we are actively developing new models, you are also recommended to check our repository to find anything that has been added but not reflected in the documentation yet.


Q6: How to train a custom model for TF-Vision using models/official/vision?

We have provided an example project to demonstrate how to use TF Model Garden's building blocks to implement a new vision project from scratch. All the internal/external projects built on top of TFM can be found here for reference.


Q7: How to profile flops? Looking for a template code on profiling FLOPs on a tf2 saved model. Any suggestions?

Set log_model_flops_and_params to true when exporting a saved model to log params and flops as here.


Q8: Turning on regenerate_source_id in the mask_r_cnn data pipeline would slow down the input pipeline?

The regenrate_source_id will add some extra computation but rarely create the bottleneck. You can do a POC to see if the input pipeline is the bottleneck or not.


Q9: Are pre-trained models trained without any data preprocessing (e.g. mean, variance, or [-1, 1]) , i.e. they expect inputs in the range 0.0, 255.0?

All the pre-trained models are trained with well structured input pipelines defined in data loaders, which typically includes e.g. normalization and augmentation. The normalization approach used is task dependent, and you are recommended to check each tasks corresponding input pipeline for confirmation:

For example, the mean and std normalization is applied for classification tasks by default.


Q10: How does the model garden library write a summary? How to add image summary?

Here are the general steps to write a summary:

  • The save_summary argument of run_experiment controls whether or not to write a summary to the folder [ref].
  • Orbit controller writes the train/eval outputs to a folder with a summary writer [ref].
    • It requires an eval_summary_manager to write the summary [ref]. The default eval_summary_manager only write scalar summary.

We have supported writing image summary to show predicted bounding boxes for RetinaNet task. It can be adapted to write other types of summary. Here are the steps:

  • We have created a custom summary manager that can write image summary [ref].

  • We optionally build the summary manager if the corresponding task is supported to write such summary [ref], and pass it into the trainer as eval_summary_manager [ref].

  • In the task, we collect necessary predictions [ref] in validation_step, update them in aggregate_logs [ref], and add visualization into returned logs in reduce_aggregated_logs [ref], so that the summary manager can identify such information and write it to summary.

  • We also need to set allow_image_summary to True in task config to enable this [ref].


Q11: ViT Model: Running inference second time throws OOM error using the ViT model in inference only mode inside a colab with some modifications. It seems like we can only run inference once with it. The second time an input is fed, even if it's the same image, it runs out of GPU memory.

Check if there are any large intermediate tensors or objects that are still alive from the previous inference. If you have any python variables that refer to those tensors, then delete them. Also, you can import gc, and run garbage collection through the command gc.collect().


Q12: Is there a way to add a post train_step process similar to aggregate_logs and reduce_aggregated_logs for the validation step? How to include the individual training losses i.e. L = L_1 + L_2 + ... + L_n as part of the plots?

To do this, you will need to create a custom trainer. And to include the individual training losses, you will need to create a Mean metric for each of the losses and then propagate loss value to this metric during the train step. Indeed it depends whether you need to run these metrics on CPU, if not, you can do alike maskrcnn: define losses reference and define metrics reference.

Individual training losses should show up on Tensorboard if added in returned logs. Average precision is reported in reduce_aggregated_logs.


Q13: How to run task.eval_step (or task.train_step) in eager mode?

You can add tf.config.run_functions_eagerly = True in the main function to enable eager mode. Refer code here.


Q14: Does TFM support computing and reporting eval metrics separately on each dataset or should each custom task figure out how to do it?

Please find experiment config for single-task training and multi-task evaluation here.


If your previous training is complete with 30k and you want to train an additional 10k, there are below ways:

  • set init_checkpoint to the last saved checkpoint
  • set model_dir to the training directory

Please be alert with the optimizer config. After you modify the training steps, the LR curve will change.

Also, if you start the training in the same model dir, you will lose checkpoints for the previous training run since we only keep the last 5. So if you are planning to experiment with fine-tuning, it is suggested to start a new run.

Check out these configs for storing the best checkpoint.


Q16: Does TF-Vision support multi workers with multi GPUs?

The prerequisite is to configure "MultiWorkerMirroredStrategy". The tf.distribute.MultiWorkerMirroredStrategy implements synchronous distributed training across multiple workers, each with potentially multiple GPUs. It creates copies of all variables in the model on each device across all workers. Please follow the guidelines here.


Q17: When running multiple eval jobs with training jobs and modifying the model architecture under the task in the config yaml file using MultiEvalExperimentConfig, the eval jobs fail when loading the model. Is this an expected behavior ?

No, this is not expected behavior. The reason for the issue is that the eval jobs are not reading the model architecture under the task config but from a eval_task copy to reconstruct the model.

To address this issue, refrain from using the eval_task model configurations.The model should be constructed from the task. The MultiTaskEvaluator class takes the eval data tasks and the model should be created here.


Q18: What is the advised approach for determining whether it is in the training phase within the Task.build_losses() method?

Users can add a training argument in the build_losses() method. build_losses is invoked in either from train_step or validation_step, you can pass correct training arguments from each step.


Q19: How to mix two input datasets with fixed ratio in image classification training?

We have the implementation to support sampling from multiple training dataset for all major tasks such as classification, retinanet, maskrcnn and segmentation tasks. The create_combine_fn of input_reader.py creates and returns a combine_fn for dataset mixing and is called in the build_inputs method of the respective tasks.

Refer sample config below:

train_data:
    input_path:
    d1: train1*,
    d2: train2*,
    weights:
    d1: 0.8
    d2: 0.2

Q20: How to add gradient magnitude logging to the metrics reported to TensorBoard if training from scratch using a model like mobilenet_imagenet?

You can add gradient magnitude logging into your metric log in the task class as a new dictionary key-value pair.

Please refer to the Image Classification Task, here you can obtain the pair of gradient and trainable_variables (grads, tvar), add gradient magnitude and update the metric logs. It will be then processed in summary_managers write_summaries method.


Q21: Does TFM support computing and reporting eval metrics separately on each dataset or should each custom task figure out how to do it?

Please find experiment config for single-task training and multi-task evaluation here.


Q22: I am training the new YOLOv7 model on my own dataset. But encountered OOM in tpu_worker after approximately 6k steps. Whereas with the COCO dataset, it works fine. How to debug this OOM issue?

Add prefetch_buffer_size in the config file. A known issue exists regarding the auto-tuning of the prefetch_buffer_size. You might consider setting a suitable value explicitly instead. Know more about prefetch_buffer_size here. Refer below Example.

train_data:
global_batch_size: 4096
dtype: 'bfloat16'
prefetch_buffer_size: 8
input_path: 'Input Path'
validation_data:
global_batch_size: 32
...

Q23: Is there a way to export a TF Model Garden model with arbitrary shape?

The user can set input_image_size to none if the model itself can be built with arbitrary image size.

Refer below Example.

export_saved_model_lib.export_inference_graph(
input_type='image_tensor',
batch_size=1,
input_image_size=[None, None],
params=exp_config,
checkpoint_path=tf.train.latest_checkpoint(model_dir),
export_dir=export_dir)


Q24: What is the number of images the model (for e.g. maskrcnn with resnet fpn) sees during training?

The number of images that is seen during training is train_steps * global_batch_size. The relationship between global_batch_size and train_steps can be explained as follows:

train_epochs = 400
train_steps = math.floor(train_epochs * num_train_examples/train_data.global_batch_size)

// steps_per_loop = steps_per_epochs
steps_per_epoch = math.floor(num_train_examples/train_data.global_batch_size)
validation_steps = math.floor(num_val_examples/validation_data.global_batch_size)

// number of training steps to run between evaluations.
validation_interval = steps_per_epoch

Assuming a train dataset with num_train_examples images, and validation dataset with num_val_examples images and the train_epochs is a hyperparameter that you need to choose. train_steps depends on train_epochs and num_train_examples.


Q25: Is there an early stopping option in Model Garden? Is there any documentation, or an example config?

Early stopping is not currently integrated into the Model Garden. An alternative approach is to set up the training pipeline to export the best model based on your specified criteria. The NewBestMetric class keeps track of the best metric value seen so far. Subsequently, you can train for an ample duration, and if signs of overfitting become apparent, you have the flexibility to halt the run accordingly. That works well for one-off experiments.

The best_checkpoint_eval_metric attribute of config_definition can be used for exporting the best checkpoint, specifying the evaluation metric the trainer should monitor. Refer to the YAML file.


Glossary

Acronym Meaning
TFM Tensorflow Models
FAQs Frequently Asked Questions
YAQ Yet Another Question
TF TensorFlow