23 KiB
Frequently Asked Questions
FAQs of TF-Vision
Q1: How to get started with Tensorflow Model Garden TF-Vision?
This user guide is a walkthrough on how to train and fine-tune models, and perform hyperparameter tuning in TF-Vision. For each model/task supported in TF-Vision, please refer to the corresponding tutorial to get more detailed instructions.
Q2: How to use the models under tensorflow_models/official/vision/?
-
Available models under TF-Vision: There is a good collection of models available in TF-Vision for various vision tasks: image classification, object detection, video classification, semantic segmentation and Instance segmentation. Please check this page to know more about our available models. We will keep adding new supports, and your suggestions are appreciated.
-
Fine-tune from a checkpoint: TF-Vision supports loading pretrained checkpoints for fine-tuning. It can be simply done by specifying
task.init_checkpointandtask.init_checkpoint_modulesin the task configuration. The value oftask.init_checkpoint_modulesdepends on the pretrained modules implementation which in general can be e.g. all, backbone, and/or decoder (for detection and segmentation). If set to all, all weights from the checkpoint will be loaded. If set to backbone, only weights in the backbone component will be loaded and other weights will be initialized from scratch. An example yaml file can be found here. -
Export SavedModel for serving: To export any TF 2.x models we trained, including
tf.keras.Modeland the plaintf.Module, we use thetf.saved_model.save()API. Our exporting library offers functionalities to export SavedModel for CPU/GPU/TPU serving.
Q3: How to fully/partially load pretrained checkpoints (e.g. backbone) to perform transfer learning using TF-Vision?
TF-Vision supports loading pretrained checkpoints for fine-tuning. It can be
simply done by specifying task.init_checkpoint and
task.init_checkpoint_modules in the task configuration. The value of
task.init_checkpoint_modules depends on the pretrained modules implementation
which in general can either be e.g. all, backbone, and/or decoder (for detection
and segmentation). If set to all, all weights from the checkpoint will be
loaded. Let’s use a concrete example for elaboration. Suppose the requirements
are:
-
Train a classification model with 10-class.
-
save off the checkpoint of the model from step
1( but only save the backbone before the last Conv2D + softmax). -
use the checkpoint from step
2to train a new classification model with 4 novel classes.
For 2, the model needs to specify the components to be saved in the checkpoint
in the
checkpoint_items.
For 3, you can specify the init_checkpoint and the init_checkpoint_modules ='backbone'. Then the new model with 4 classes will only initialize the
backbone
so that you can finetune the head. In this
example,
backbone is everything before the global average pooling layer for the
classification model.
Q4: How to export the tensorflow models trained using the TF-Vision package?
To export any TF 2.x models we trained, including tf.keras.Model and the plain
tf.Module, we use the tf.saved_model.save() API. Our
exporting library
offers functionalities to export SavedModel for CPU/GPU/TPU serving. Moreover,
with the exported SavedModel, it is possible to further convert it to a TFLite
model for on-device inference.
Q5: Where can I look for a config file and documentation for the TF-Vision pretrained models?
TF-Vision modeling library provides a collection of baselines and checkpoints for various vision tasks including e.g. image classification, object detection, video classification and segmentation. The supported pretrained models and corresponding config file can be found here. Since we are actively developing new models, you are also recommended to check our repository to find anything that has been added but not reflected in the documentation yet.
Q6: How to train a custom model for TF-Vision using models/official/vision?
We have provided an example project to demonstrate how to use TF Model Garden's building blocks to implement a new vision project from scratch. All the internal/external projects built on top of TFM can be found here for reference.
Q7: How to profile flops? Looking for a template code on profiling FLOPs on a tf2 saved model. Any suggestions?
Set log_model_flops_and_params to true when exporting a saved model to log
params and flops as
here.
Q8: Turning on regenerate_source_id in the mask_r_cnn data pipeline would slow down the input pipeline?
The regenrate_source_id will add some extra
computation
but rarely create the bottleneck. You can do a POC to see if the input pipeline
is the bottleneck or not.
Q9: Are pre-trained models trained without any data preprocessing (e.g. mean, variance, or [-1, 1]) , i.e. they expect inputs in the range 0.0, 255.0?
All the pre-trained models are trained with well structured input pipelines defined in data loaders, which typically includes e.g. normalization and augmentation. The normalization approach used is task dependent, and you are recommended to check each task’s corresponding input pipeline for confirmation:
- Classification: classification_input.py.
- Object Detection and Instance Segmentation: maskrcnn_input.py and retinanet_input.py.
- Semantic Segmentation:segmentation_input.py.
For example, the mean and std normalization is applied for classification tasks by default.
Q10: How does the model garden library write a summary? How to add image summary?
Here are the general steps to write a summary:
- The
save_summaryargument ofrun_experimentcontrols whether or not to write a summary to the folder [ref]. - Orbit controller writes the train/eval outputs to a folder with a summary
writer
[ref].
- It requires an
eval_summary_managerto write the summary [ref]. The defaulteval_summary_manageronly write scalar summary.
- It requires an
We have supported writing image summary to show predicted bounding boxes for RetinaNet task. It can be adapted to write other types of summary. Here are the steps:
-
We have created a custom summary manager that can write image summary [ref].
-
We optionally build the summary manager if the corresponding task is supported to write such summary [ref], and pass it into the trainer as
eval_summary_manager[ref]. -
In the task, we collect necessary predictions [ref] in
validation_step, update them inaggregate_logs[ref], and add visualization into returned logs inreduce_aggregated_logs[ref], so that the summary manager can identify such information and write it to summary. -
We also need to set
allow_image_summaryto True in task config to enable this [ref].
Q11: ViT Model: Running inference second time throws OOM error using the ViT model in inference only mode inside a colab with some modifications. It seems like we can only run inference once with it. The second time an input is fed, even if it's the same image, it runs out of GPU memory.
Check if there are any large intermediate tensors or objects that are still
alive from the previous inference. If you have any python variables that refer
to those tensors, then delete them. Also, you can import gc, and run garbage
collection through the command gc.collect().
Q12: Is there a way to add a post train_step process similar to aggregate_logs and reduce_aggregated_logs for the validation step? How to include the individual training losses i.e. L = L_1 + L_2 + ... + L_n as part of the plots?
To do this, you will need to create a custom trainer. And to include the
individual training losses, you will need to create a Mean metric for each of
the losses and then propagate loss value to this metric during the train step.
Indeed it depends whether you need to run these metrics on CPU, if not, you can
do alike maskrcnn:
define losses reference
and
define metrics reference.
Individual training losses should show up on Tensorboard if added in returned logs. Average precision is reported in reduce_aggregated_logs.
Q13: How to run task.eval_step (or task.train_step) in eager mode?
You can add tf.config.run_functions_eagerly = True in the main function to
enable eager mode. Refer
code
here.
Q14: Does TFM support computing and reporting eval metrics separately on each dataset or should each custom task figure out how to do it?
Please find experiment config for single-task training and multi-task evaluation here.
Q15: An experiment ran 30k steps and the user wants to run ~10k more steps starting from where he left off. What's the recommended way to do this? Does he need to run a new job for 10k train_steps, with the init checkpoint set to the last checkpoint of the previous run?
If your previous training is complete with 30k and you want to train an additional 10k, there are below ways:
- set
init_checkpointto the last saved checkpoint - set
model_dirto the training directory
Please be alert with the optimizer config. After you modify the training steps, the LR curve will change.
Also, if you start the training in the same model dir, you will lose checkpoints for the previous training run since we only keep the last 5. So if you are planning to experiment with fine-tuning, it is suggested to start a new run.
Check out these configs for storing the best checkpoint.
Q16: Does TF-Vision support multi workers with multi GPUs?
The prerequisite is to configure "MultiWorkerMirroredStrategy". The
tf.distribute.MultiWorkerMirroredStrategy implements synchronous distributed
training across multiple workers, each with potentially multiple GPUs. It
creates copies of all variables in the model on each device across all workers.
Please follow the guidelines
here.
Q17: When running multiple eval jobs with training jobs and modifying the model architecture under the task in the config yaml file using MultiEvalExperimentConfig, the eval jobs fail when loading the model. Is this an expected behavior ?
No, this is not expected behavior. The reason for the issue is that the eval
jobs are not reading the model architecture under the task config but from a
eval_task copy to reconstruct the model.
To address this issue, refrain from using the eval_task model
configurations.The model should be constructed from the task. The
MultiTaskEvaluator
class takes the eval data tasks and the model should be created
here.
Q18: What is the advised approach for determining whether it is in the training phase within the Task.build_losses() method?
Users can add a training argument in the build_losses() method. build_losses is invoked in either from train_step or validation_step, you can pass correct training arguments from each step.
Q19: How to mix two input datasets with fixed ratio in image classification training?
We have the implementation to support sampling from multiple training dataset
for all major tasks such as classification, retinanet, maskrcnn and segmentation
tasks. The create_combine_fn of
input_reader.py
creates and returns a combine_fn for dataset mixing and is called in the
build_inputs
method of the respective
tasks.
Refer sample config below:
train_data:
input_path:
d1: train1*,
d2: train2*,
weights:
d1: 0.8
d2: 0.2
Q20: How to add gradient magnitude logging to the metrics reported to TensorBoard if training from scratch using a model like mobilenet_imagenet?
You can add gradient magnitude logging into your metric log in the task class as a new dictionary key-value pair.
Please refer to the
Image Classification Task,
here you can obtain the pair of gradient and trainable_variables (grads,
tvar), add gradient magnitude and update the
metric logs.
It will be then processed in summary_manager’s
write_summaries
method.
Q21: Does TFM support computing and reporting eval metrics separately on each dataset or should each custom task figure out how to do it?
Please find experiment config for single-task training and multi-task evaluation here.
Q22: I am training the new YOLOv7 model on my own dataset. But encountered OOM in tpu_worker after approximately 6k steps. Whereas with the COCO dataset, it works fine. How to debug this OOM issue?
Add prefetch_buffer_size in the config file. A known issue exists regarding
the auto-tuning of the prefetch_buffer_size. You might consider setting a
suitable value explicitly instead. Know more about prefetch_buffer_size
here.
Refer below Example.
train_data:
global_batch_size: 4096
dtype: 'bfloat16'
prefetch_buffer_size: 8
input_path: 'Input Path'
validation_data:
global_batch_size: 32
...
Q23: Is there a way to export a TF Model Garden model with arbitrary shape?
The user can set input_image_size to none if the model itself can be built
with arbitrary image size.
Refer below Example.
export_saved_model_lib.export_inference_graph(
input_type='image_tensor',
batch_size=1,
input_image_size=[None, None],
params=exp_config,
checkpoint_path=tf.train.latest_checkpoint(model_dir),
export_dir=export_dir)
Q24: What is the number of images the model (for e.g. maskrcnn with resnet fpn) sees during training?
The number of images that is seen during training is train_steps * global_batch_size. The relationship between global_batch_size and train_steps can be explained as follows:
train_epochs = 400
train_steps = math.floor(train_epochs * num_train_examples/train_data.global_batch_size)
// steps_per_loop = steps_per_epochs
steps_per_epoch = math.floor(num_train_examples/train_data.global_batch_size)
validation_steps = math.floor(num_val_examples/validation_data.global_batch_size)
// number of training steps to run between evaluations.
validation_interval = steps_per_epoch
Assuming a train dataset with num_train_examples images, and validation
dataset with num_val_examples images and the train_epochs is a
hyperparameter that you need to choose. train_steps depends on train_epochs
and num_train_examples.
Q25: Is there an early stopping option in Model Garden? Is there any documentation, or an example config?
Early stopping is not currently integrated into the Model Garden. An alternative approach is to set up the training pipeline to export the best model based on your specified criteria. The NewBestMetric class keeps track of the best metric value seen so far. Subsequently, you can train for an ample duration, and if signs of overfitting become apparent, you have the flexibility to halt the run accordingly. That works well for one-off experiments.
The best_checkpoint_eval_metric attribute of
config_definition
can be used for exporting the best checkpoint, specifying the evaluation metric
the trainer should monitor. Refer to the
YAML
file.
Glossary
| Acronym | Meaning |
|---|---|
| TFM | Tensorflow Models |
| FAQs | Frequently Asked Questions |
| YAQ | Yet Another Question |
| TF | TensorFlow |