models/official/nlp/docs/faq.md

17 KiB

Frequently Asked Questions

Introduction

Goal of this document is to capture Frequently Asked Questions (FAQs) related to TensorFlow-Models-NLP (TF-NLP). The source of these questions is limited to external resources (GitHub, StackOverflow,Google groups etc).

FAQs of TF-NLP


Q1: How to cite TF-NLP as the libraries are used for research code bases externally?

If you use TensorFlow Model Garden in your research github repos, please cite this repository in your publication. The citation is at the following location.


Q2: How to Load NLP Pretrained Models ?

  • How to Initialize from Checkpoint: If you use the TF-NLP training library, you can specify the checkpoint path link directly when launching your job. For example, follow the BERT fine-tuning command, to initialize the model from the checkpoint specified by
    --params_override=task.init_checkpoint=PATH_TO_INIT_CKPT

  • How to load TF-HUB SavedModel: TF NLP's fine-tuning tasks such as question answering (SQuAD) and sentence prediction (GLUE) support loading a model from TF-HUB. These built-in tasks support a specific task.hub_module_url parameter. To set this parameter, follow the BERT fine-tuning command, and replace --params_override=task.init_checkpoint=... with
    --params_override=task.hub_module_url=TF_HUB_URL.


Q3: How do I go about changing the pretraining loss functions for BERT ?

You can change the loss function for the pretraining in the code here.


Q4: The transformer code extends keras.Model. Can I use the constructs like model.fit() for training as we do for any tf2/keras model? Are there any tutorials and starting points to set up the training and evaluation of a transformer model using TF-NLP?

Keras Model native fit() and predict() do not work for the seq2seq transformer model. TF model garden uses the workflow defined here.
The code defines the translation task.


Q5: Is there an easy way to set up a model server from a checkpoint (as opposed to an exported saved_model)?

Model server requires saved_model. If you just want to inspect the outputs, this colab can help.


Q6: Training with global batch size (4096) and local batch size (128) on 4x4 TPUs is very slow. Will the quality change by increasing TPUs to 8x8 with fixed local batch size (128) and global batch size (16392)?

Experiment configuration can be overridden by --params_override FLAG through the command line. It only supports scalars. Please find the implementation here.


Q7: Training with global batch size (4096) and local batch size (128) on 4x4 TPUs is very slow. Will the quality change by increasing TPUs to 8x8 with fixed local batch size (128) and global batch size (16392)?

The global batch size should be the key factor. As you increase the batch size, you may need to tune the Learning Rate to match the quality of the smaller batch size. If the task is retrieval it is recommended using the global softmax. An example can be found here.


Q8: In some TF NLP examples, the model output logits are casted into float32: Isn't logits already in the format of float?

For mixed precision training, the activations inside the model could be bfloat16/float16 format. The model output logits are casted into float32 to make sure the softmax and losses are calculated in float32. This is done to avoid any numeric issues that may occur if the intermediate tensor flowing from the softmax to the loss is float16 or bfloat16. You can also refer to the mixed precision guide for more information.


Q9: Is it possible to use gradient clipping in the optimizer used in the Bert encoder? If yes, Is there any sample on its usage ?

We have the gradient_clip_norm argument in AdamW. Also new Keras optimizers offer global_clipnorm, clipnorm and clipvalue as kwargs.

Please refer to the Example below:

optimizer:
  adamw:
    beta_1: 0.9
    beta_2: 0.999
    weight_decay_rate: 0.05
    gradient_clip_norm: 0.0
  type: adamw

Please find the bert paper using legacy implementation here[ref].


Q10: I am trying to create an embedding table with 4.7 million rows and 512 dimensions. However, the nlp.modeling.layers.OnDeviceEmbedding fails with the following error: UnknownError: Attempting to allocate 4.54G. That was not possible. There are 2.94G free.;
Is there a way to increase this capacity or alternatives to OnDeviceEmbedding that can work in the same framework?

The embedding with 4.7 million rows and 512 dimensions looks very big. This will be placed on the TPU tensor core.
Below tips might help:

  • Try to reduce the number of rows
  • Consider mixed_precision_dtype: 'bfloat16' training to reduce memory cost.

Q11: What is the difference between seq_length in glue_mnli_matched.yaml and max_position_embeddings in bert_en_uncased_base.yaml ? Why are they not the same?

seq_length is the padded input length and max_position_embeddings is the size of learned position embeddings. Seq_length value should be always less or equal to max_position_embeddings value (seq_length <= max_position_embeddings).


Q12: While running a model using the tf-nlp framework, it is noticed that when the number of validation steps (even by 10) is increased, the experiments get much slower. Is that expected?

This is not expected for 10 validation steps. Recommended tips below:

  • Increase the validation interval
  • Use --add_eval to start a side-car job for eval
  • Collect xprof for the eval job. It is known that tf2 eager execution is slow.

Q13: How to load checkpoints for the BERT model? Any recommendations on how to deal with the variables mismatch error?

We recommend using tf.train.Checkpoint and manage the objects (including inner layers) directly. The details on restoring the encoder weights can be found here. More on TF-NLP checkpoint tutorial is here
The variable mismatch error is due to the classifier_model not equal to the threephil model. The recommendation is using the same code and class of threephil model to read the checkpoint. The keras functional model cannot guarantee the python objects are matched if the model creation code is different.
More to read as: https://www.tensorflow.org/guide/checkpoint


Q14: Fail to save Bert2Bert model instance without passing the label input i.e. target_id ?

Bert2Bert needs input_ids, input_mask, segment_ids and target_ids to train. You should save the model with all features provided.

If you care about inference and there is no target_id, you should not use Keras model.save(). Keras does not support None as inputs. Instead, we directly define a tf.Module including the bert2bert core model and save the tf.function using tf.saved_model.save() API. Refer example for the translation task. Usually, the seq2seq model is not friendly to Keras assumptions.


Q15: How to fix the TPU Inference error with the Transformer?

The potential causes for the error may be having different inputs, and the batch size of one of them differs from the rest.

Here are some explanations and troubleshooting tips :

  • Resolve the batching issue by implementing signature batching
  • Address the dynamic dimension problem by setting max_batch_size and allowed_batch_sizes to 1.

Q16: Are there any models/methods that can improve the latency of the feed-forward neural network portion of the transformer encoder block (on CPU and GPU)?

There are sparsemixture and Conditional computation blocks to speed up. The Block sparse feedforward layer might be promising for performance purposes. This would work nicely on CPU and GPU since reshaping ops in this layer are free on CPU/GPUs. It offers speed-up for models of similar sizes (a caveat is we observed some quality drop with block sparse feedforward in the past).

Refer to Sparse Mixer encoder network and FNet encoder network for some more sparsemixture references.

Conditional computation is an AI model architecture where specific sections of the computational graph are activated based on input conditions. Models following this paradigm demonstrate efficiency, especially with increased model capacity or reduced inference latency.

Refer ExpandCondense tensor network layer and Gated linear feedforward layer for FFN blocks. The above mentioned techniques work really well with long sequence length.

Please refer to the additional notes below based on your specific use cases.

  • For small student models, we used only 1 expert and route much fewer tokens to the FFN expert.
  • We need to set routing_group_size so each routing combines the tokens in multiple sequences and selects for example 1/4 of the tokens.
  • This will work well in the case of distillation or when we can pretrain the model. There will be a quality gap because a lot of tokens skip the FFN computation.

Q17: How to obtain final layer embeddings from a model? Is there an example?

Refer to the call method of the Transformer-based BERT encoder network. The sequence_output is the last layer embeddings [batch_size, seq len, hidden size].


Q18: Is it possible to convert public TF hub models like sentence-t5 for TPU use?

The Inference Converter V2 deploys user-provided function(s) on the XLA device (TPU or XLA GPU) and optimizes them.


Q19: Is it possible to have a dynamic batch size for edit5 models using sampling modules?

This may depend on the decoding algorithm for beam_search, the source of the issue is at the sample initial time it needs to allocate the [batch_size, beam_size, ...] buffer so that batch size is fixed. However, note that it may not be easily achievable.

Users can also see that, in AutoMUM distillation sampling module which makes the batch size static.

Possibly, for greedy decoding, it can be done since it doesn't require the beam_size.


Q20: Is multi-label tagging distillation supported by text tagging distillation?

Currently the template is just doing basic things of per token binary classification. If you intend to perform multi-label classification for each token, it shouldn't be overly challenging. It mainly involves adjusting the number of classes and switching to a multi-label loss.


Q21: The TFM Bert intentionally utilizes an OnDeviceEmbedding. Is it possible to incorporate an option to implement CPU-forced embedding table ideas by putting embeddings for transformer models on CPU to save HBM memory?

For the optimization, users can just place word embeddings on cpu. Just utilizing the input_word_embeddings path in BertEncoderV2 class for optimizing HBM usage during serving is sufficient.


Q22: Is there a possibility of getting TF2 versions of Gemini/MUM? Basically, a checkpoint converter and a TF2-variant of instantiating the corresponding Transformer?

JAX is the way forward at the moment for Gemini.


Q23:Is it possible to perform MLM pretraining in text tagging as well??

The MLM functionality in text_tagging is currently not available.


Glossary

Acronym Meaning
TFM Tensorflow Models
FAQs Frequently Asked Questions
TF TensorFlow