Transformer Transducer Model

Transformer Transducer Model

class openspeech.models.transformer_transducer.model.TransformerTransducerModel(configs: omegaconf.dictconfig.DictConfig, tokenizer: openspeech.tokenizers.tokenizer.Tokenizer)[source]

Transformer-Transducer is that every layer is identical for both audio and label encoders. Unlike the basic transformer structure, the audio encoder and label encoder are separate. So, the alignment is handled by a separate forward-backward process within the RNN-T architecture. And we replace the LSTM encoders in RNN-T architecture with Transformer encoders.

Parameters
  • configs (DictConfig) – configuraion set

  • tokenizer (Tokenizer) – tokenizer is in charge of preparing the inputs for a model.

Inputs:

inputs (torch.FloatTensor): A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension). input_lengths (torch.LongTensor): The length of input tensor. (batch)

Returns

Result of model predictions.

Return type

outputs (dict)

greedy_decode(encoder_outputs: torch.Tensor, max_length: int)torch.Tensor[source]

Decode encoder_outputs.

Parameters
  • encoder_outputs (torch.FloatTensor) – A output sequence of encoders. FloatTensor of size (seq_length, dimension)

  • max_length (int) – max decoding time step

Returns

model’s predictions.

Return type

y_hats (torch.IntTensor)

set_beam_decode(beam_size: int = 3, expand_beam: float = 2.3, state_beam: float = 4.6)[source]

Setting beam search decode

Transformer Transducer Configuration

class openspeech.models.transformer_transducer.configurations.TransformerTransducerConfigs(model_name: str = 'transformer_transducer', encoder_dim: int = 512, d_ff: int = 2048, num_audio_layers: int = 18, num_label_layers: int = 2, num_attention_heads: int = 8, audio_dropout_p: float = 0.1, label_dropout_p: float = 0.1, decoder_hidden_state_dim: int = 512, decoder_output_dim: int = 512, conv_kernel_size: int = 31, max_positional_length: int = 5000, optimizer: str = 'adam')[source]

This is the configuration class to store the configuration of a TransformerTransducer.

It is used to initiated an TransformerTransducer model.

Configuration objects inherit from :class: ~openspeech.dataclass.configs.OpenspeechDataclass.

Parameters
  • model_name (str) – Model name (default: transformer_transducer)

  • extractor (str) – The CNN feature extractor. (default: conv2d_subsample)

  • d_model (int) – Dimension of model. (default: 512)

  • d_ff (int) – Dimension of feed forward network. (default: 2048)

  • num_attention_heads (int) – The number of attention heads. (default: 8)

  • num_audio_layers (int) – The number of audio layers. (default: 18)

  • num_label_layers (int) – The number of label layers. (default: 2)

  • audio_dropout_p (float) – The dropout probability of encoder. (default: 0.1)

  • label_dropout_p (float) – The dropout probability of decoder. (default: 0.1)

  • decoder_hidden_state_dim (int) – Hidden state dimension of decoder (default: 512)

  • decoder_output_dim (int) – dimension of model output. (default: 512)

  • conv_kernel_size (int) – Kernel size of convolution layer. (default: 31)

  • max_positional_length (int) – Max length of positional encoding. (default: 5000)

  • optimizer (str) – Optimizer for training. (default: adam)