Transformer Transducer Model¶
Transformer Transducer Model¶
-
class
openspeech.models.transformer_transducer.model.
TransformerTransducerModel
(configs: omegaconf.dictconfig.DictConfig, tokenizer: openspeech.tokenizers.tokenizer.Tokenizer)[source]¶ Transformer-Transducer is that every layer is identical for both audio and label encoders. Unlike the basic transformer structure, the audio encoder and label encoder are separate. So, the alignment is handled by a separate forward-backward process within the RNN-T architecture. And we replace the LSTM encoders in RNN-T architecture with Transformer encoders.
- Parameters
configs (DictConfig) – configuraion set
tokenizer (Tokenizer) – tokenizer is in charge of preparing the inputs for a model.
- Inputs:
inputs (torch.FloatTensor): A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size
(batch, seq_length, dimension)
. input_lengths (torch.LongTensor): The length of input tensor.(batch)
- Returns
Result of model predictions.
- Return type
outputs (dict)
-
greedy_decode
(encoder_outputs: torch.Tensor, max_length: int) → torch.Tensor[source]¶ Decode encoder_outputs.
- Parameters
encoder_outputs (torch.FloatTensor) – A output sequence of encoders. FloatTensor of size
(seq_length, dimension)
max_length (int) – max decoding time step
- Returns
model’s predictions.
- Return type
y_hats (torch.IntTensor)
Transformer Transducer Configuration¶
-
class
openspeech.models.transformer_transducer.configurations.
TransformerTransducerConfigs
(model_name: str = 'transformer_transducer', encoder_dim: int = 512, d_ff: int = 2048, num_audio_layers: int = 18, num_label_layers: int = 2, num_attention_heads: int = 8, audio_dropout_p: float = 0.1, label_dropout_p: float = 0.1, decoder_hidden_state_dim: int = 512, decoder_output_dim: int = 512, conv_kernel_size: int = 31, max_positional_length: int = 5000, optimizer: str = 'adam')[source]¶ This is the configuration class to store the configuration of a
TransformerTransducer
.It is used to initiated an TransformerTransducer model.
Configuration objects inherit from :class: ~openspeech.dataclass.configs.OpenspeechDataclass.
- Parameters
model_name (str) – Model name (default: transformer_transducer)
extractor (str) – The CNN feature extractor. (default: conv2d_subsample)
d_model (int) – Dimension of model. (default: 512)
d_ff (int) – Dimension of feed forward network. (default: 2048)
num_attention_heads (int) – The number of attention heads. (default: 8)
num_audio_layers (int) – The number of audio layers. (default: 18)
num_label_layers (int) – The number of label layers. (default: 2)
audio_dropout_p (float) – The dropout probability of encoder. (default: 0.1)
label_dropout_p (float) – The dropout probability of decoder. (default: 0.1)
decoder_hidden_state_dim (int) – Hidden state dimension of decoder (default: 512)
decoder_output_dim (int) – dimension of model output. (default: 512)
conv_kernel_size (int) – Kernel size of convolution layer. (default: 31)
max_positional_length (int) – Max length of positional encoding. (default: 5000)
optimizer (str) – Optimizer for training. (default: adam)