Encoders¶

Openspeech Encoder¶

class openspeech.encoders.openspeech_encoder.OpenspeechEncoder[source]¶

Base Interface of Openspeech Encoder.

Inputs:: inputs (torch.FloatTensor): A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension). input_lengths (torch.LongTensor): The length of input tensor. (batch)

count_parameters() → int [source]¶: Count parameters of encoders

forward(inputs: torch.Tensor, input_lengths: torch.Tensor)[source]¶

Forward propagate for encoders training.

Inputs:: inputs (torch.FloatTensor): A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension). input_lengths (torch.LongTensor): The length of input tensor. (batch)

update_dropout(dropout_p: float) → None [source]¶: Update dropout probability of encoders

Conformer Encoder¶

class openspeech.encoders.conformer_encoder.ConformerEncoder(num_classes: int, input_dim: int = 80, encoder_dim: int = 512, num_layers: int = 17, num_attention_heads: int = 8, feed_forward_expansion_factor: int = 4, conv_expansion_factor: int = 2, input_dropout_p: float = 0.1, feed_forward_dropout_p: float = 0.1, attention_dropout_p: float = 0.1, conv_dropout_p: float = 0.1, conv_kernel_size: int = 31, half_step_residual: bool = True, joint_ctc_attention: bool = True)[source]¶

Transformer models are good at capturing content-based global interactions, while CNNs exploit local features effectively. Conformer achieves the best of both worlds by studying how to combine convolution neural networks and transformers to model both local and global dependencies of an audio sequence in a parameter-efficient way.

Parameters

num_classes (int) – Number of classification
input_dim (int, optional) – Dimension of input vector
encoder_dim (int, optional) – Dimension of conformer encoders
num_layers (int, optional) – Number of conformer blocks
num_attention_heads (int, optional) – Number of attention heads
feed_forward_expansion_factor (int, optional) – Expansion factor of feed forward module
conv_expansion_factor (int, optional) – Expansion factor of conformer convolution module
feed_forward_dropout_p (float, optional) – Probability of feed forward module dropout
attention_dropout_p (float, optional) – Probability of attention module dropout
conv_dropout_p (float, optional) – Probability of conformer convolution module dropout
conv_kernel_size (int or tuple, optional) – Size of the convolving kernel
half_step_residual (bool) – Flag indication whether to use half step residual or not joint_ctc_attention (bool, optional): flag indication joint ctc attention or not

Inputs: inputs, input_lengths

inputs (batch, time, dim): Tensor containing input vector
input_lengths (batch): list of sequence input lengths

Returns: outputs, output_lengths

outputs (batch, out_channels, time): Tensor produces by conformer encoders.
output_lengths (batch): list of sequence output lengths

Reference:

Anmol Gulati et al: Conformer: Convolution-augmented Transformer for Speech Recognition https://arxiv.org/abs/2005.08100

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶

Forward propagate a inputs for encoders training.

Parameters

inputs (torch.FloatTensor) – A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).
input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

(Tensor, Tensor, Tensor)

outputs: A output sequence of encoders. FloatTensor of size (batch, seq_length, dimension)
encoder_logits: Log probability of encoders outputs will be passed to CTC Loss.
If joint_ctc_attention is False, return None.
output_lengths: The length of encoders outputs. (batch)

ContextNet Encoder¶

class openspeech.encoders.contextnet_encoder.ContextNetEncoder(num_classes: int, model_size: str = 'medium', input_dim: int = 80, num_layers: int = 5, kernel_size: int = 5, num_channels: int = 256, output_dim: int = 640, joint_ctc_attention: bool = False)[source]¶

ContextNetEncoder goes through 23 convolution blocks to convert to higher feature values.

Parameters

num_classes (int) – Number of classification
model_size (str, optional) – Size of the model[‘small’, ‘medium’, ‘large’] (default : ‘medium’)
input_dim (int, optional) – Dimension of input vector (default : 80)
num_layers (int, optional) – The number of convolutional layers (default : 5)
kernel_size (int, optional) – Value of convolution kernel size (default : 5)
num_channels (int, optional) – The number of channels in the convolution filter (default: 256)
output_dim (int, optional) – Dimension of encoder output vector (default: 640)
joint_ctc_attention (bool, optional) – flag indication joint ctc attention or not

Inputs: inputs, input_lengths

inputs: Parsed audio of batch size number FloatTensor of size (batch, seq_length, dimension)
input_lengths: Tensor representing the sequence length of the input (batch)

Returns: output, output_lengths

output: Tensor of encoder output FloatTensor of size
(batch, seq_length, dimension)
encoder_logits: Log probability of encoders outputs will be passed to CTC Loss.
If joint_ctc_attention is False, return None.
output_lengths: Tensor representing the length of the encoder output (batch)

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶

Forward propagate a inputs for audio encoder.

Parameters

**inputs** (torch.FloatTensor) – Parsed audio of batch size number FloatTensor of size (batch, seq_length, dimension)
**input_lengths** (torch.LongTensor) – Tensor representing the sequence length of the input LongTensor of size (batch)

Returns

Tensor of encoder output FloatTensor of size: (batch, seq_length, dimension)
encoder_logits (torch.FloatTensor): Log probability of encoders outputs will be passed to CTC Loss.: If joint_ctc_attention is False, return None.
output_lengths (torch.LongTensor): Tensor representing the length of the encoder output: LongTensor of size (batch)

Return type

output (torch.FloatTensor)

Convolutional LSTM Encoder¶

class openspeech.encoders.convolutional_lstm_encoder.ConvolutionalLSTMEncoder(input_dim: int, num_classes: int = None, hidden_state_dim: int = 512, dropout_p: float = 0.3, num_layers: int = 3, bidirectional: bool = True, rnn_type: str = 'lstm', extractor: str = 'vgg', conv_activation: str = 'hardtanh', joint_ctc_attention: bool = False)[source]¶

Converts low level speech signals into higher level features with convolutional extractor.

Parameters

input_dim (int) – dimension of input vector
num_classes (int) – number of classification
hidden_state_dim (int) – the number of features in the encoders hidden state h
num_layers (int, optional) – number of recurrent layers (default: 3)
bidirectional (bool, optional) – if True, becomes a bidirectional encoders (default: False)
extractor (str) – type of CNN extractor (default: vgg)
conv_activation (str) – activation function of convolutional extractor (default: hardtanh)
rnn_type (str, optional) – type of RNN cell (default: lstm)
dropout_p (float, optional) – dropout probability of encoders (default: 0.2)
joint_ctc_attention (bool, optional) – flag indication joint ctc attention or not

Inputs: inputs, input_lengths

inputs: list of sequences, whose length is the batch size and within which each sequence is list of tokens
input_lengths: list of sequence lengths

Returns: encoder_outputs, encoder_log__probs, output_lengths

encoder_outputs: tensor containing the encoded features of the input sequence
encoder_log__probs: tensor containing log probability for encoder_only loss
output_lengths: list of sequence lengths produced by Listener

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Forward propagate a inputs for encoders training.

Parameters

inputs (torch.FloatTensor) – A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).
input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

outputs: A output sequence of encoders. FloatTensor of size (batch, seq_length, dimension)
encoder_logits: Log probability of encoders outputs will be passed to CTC Loss.
If joint_ctc_attention is False, return None.
encoder_output_lengths: The length of encoders outputs. (batch)

Return type

(Tensor, Tensor, Tensor)

Convolutional Transformer Encoder¶

class openspeech.encoders.convolutional_transformer_encoder.ConvolutionalTransformerEncoder(num_classes: int, input_dim: int, extractor: str = 'vgg', d_model: int = 512, d_ff: int = 2048, num_layers: int = 6, num_heads: int = 8, dropout_p: float = 0.3, conv_activation: str = 'relu', joint_ctc_attention: bool = False)[source]¶

The TransformerEncoder is composed of a stack of N identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network.

Parameters

input_dim – dimension of feature vector
extractor (str) – convolutional extractor
d_model – dimension of model (default: 512)
d_ff – dimension of feed forward network (default: 2048)
num_layers – number of encoders layers (default: 6)
num_heads – number of attention heads (default: 8)
dropout_p (float, optional) – probability of dropout (default: 0.3)
conv_activation (str, optional) – activation function of convolutional extractor (default: hardtanh)
joint_ctc_attention (bool, optional) – flag indication joint ctc attention or not (default: False)

Inputs:

inputs: list of sequences, whose length is the batch size and within which each sequence is list of tokens
input_lengths: list of sequence lengths

Returns

outputs: A output sequence of encoders. FloatTensor of size (batch, seq_length, dimension)
encoder_logits: Log probability of encoders outputs will be passed to CTC Loss.
If joint_ctc_attention is False, return None.
output_lengths: The length of encoders outputs. (batch)

Return type

(Tensor, Tensor, Tensor)

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶

Forward propagate a inputs for encoders training.

Parameters

inputs (torch.FloatTensor) – A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).
input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

outputs: A output sequence of encoders. FloatTensor of size (batch, seq_length, dimension)
encoder_logits: Log probability of encoders outputs will be passed to CTC Loss.
If joint_ctc_attention is False, return None.
output_lengths: The length of encoders outputs. (batch)

Return type

(Tensor, Tensor, Tensor)

DeepSpeech2¶

class openspeech.encoders.deepspeech2.DeepSpeech2(input_dim: int, num_classes: int, rnn_type='gru', num_rnn_layers: int = 5, rnn_hidden_dim: int = 512, dropout_p: float = 0.1, bidirectional: bool = True, activation: str = 'hardtanh')[source]¶

DeepSpeech2 is a set of speech recognition models based on Baidu DeepSpeech2. DeepSpeech2 is trained with CTC loss.

Parameters

input_dim (int) – dimension of input vector
num_classes (int) – number of classfication
rnn_type (str, optional) – type of RNN cell (default: gru)
num_rnn_layers (int, optional) – number of recurrent layers (default: 5)
rnn_hidden_dim (int) – the number of features in the hidden state h
dropout_p (float, optional) – dropout probability (default: 0.1)
bidirectional (bool, optional) – if True, becomes a bidirectional encoders (defulat: True)
activation (str) – type of activation function (default: hardtanh)

Inputs: inputs, input_lengths

inputs: list of sequences, whose length is the batch size and within which each sequence is list of tokens
input_lengths: list of sequence lengths

Returns

predicted_log_prob (torch.FloatTensor)s: Log probability of model predictions.
output_lengths (torch.LongTensor): The length of output tensor (batch)

Return type

(Tensor, Tensor)

Reference:: Dario Amodei et al.: Deep Speech 2: End-to-End Speech Recognition in English and Mandarin https://arxiv.org/abs/1512.02595

count_parameters() → int [source]¶: Count parameters of encoders

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Forward propagate a inputs for encoder_only training.

Parameters

inputs (torch.FloatTensor) – A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).
input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

predicted_log_prob (torch.FloatTensor)s: Log probability of model predictions.
output_lengths (torch.LongTensor): The length of output tensor (batch)

Return type

(Tensor, Tensor)

update_dropout(dropout_p: float) → None [source]¶: Update dropout probability of encoders

Jasper¶

class openspeech.encoders.jasper.Jasper(configs: omegaconf.dictconfig.DictConfig, input_dim: int, num_classes: int)[source]¶

Jasper (Just Another Speech Recognizer), an ASR model comprised of 54 layers proposed by NVIDIA. Jasper achieved sub 3 percent word error rate (WER) on the LibriSpeech dataset.

Parameters

num_classes (int) – number of classification
version (str) – version of jasper. Marked as BxR: B - number of blocks, R - number of sub-blocks

Inputs: inputs, input_lengths, residual

inputs: tensor contains input sequence vector
input_lengths: tensor contains sequence lengths

Returns

outputs (torch.FloatTensor): Log probability of model predictions. (batch, seq_length, num_classes)
output_lengths (torch.LongTensor): The length of output tensor (batch)

Return type

(Tensor, Tensor)

Reference:: Jason Li. et al.: Jasper: An End-to-End Convolutional Neural Acoustic Model https://arxiv.org/pdf/1904.03288.pdf

count_parameters() → int [source]¶: Count parameters of model

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Forward propagate a inputs for encoder_only training.

Parameters

inputs (torch.FloatTensor) – A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).
input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

outputs (torch.FloatTensor): Log probability of model predictions. (batch, seq_length, num_classes)
output_lengths (torch.LongTensor): The length of output tensor (batch)

Return type

(Tensor, Tensor)

update_dropout(dropout_p: float) → None [source]¶: Update dropout probability of model

LSTM Encoder¶

class openspeech.encoders.lstm_encoder.LSTMEncoder(input_dim: int, num_classes: int = None, hidden_state_dim: int = 512, dropout_p: float = 0.3, num_layers: int = 3, bidirectional: bool = True, rnn_type: str = 'lstm', joint_ctc_attention: bool = False)[source]¶

Converts low level speech signals into higher level features

Parameters

input_dim (int) – dimension of input vector
num_classes (int) – number of classification
hidden_state_dim (int) – the number of features in the encoders hidden state h
num_layers (int, optional) – number of recurrent layers (default: 3)
bidirectional (bool, optional) – if True, becomes a bidirectional encoders (default: False)
rnn_type (str, optional) – type of RNN cell (default: lstm)
dropout_p (float, optional) – dropout probability of encoders (default: 0.2)
joint_ctc_attention (bool, optional) – flag indication joint ctc attention or not

Inputs: inputs, input_lengths

inputs: list of sequences, whose length is the batch size and within which each sequence is list of tokens
input_lengths: list of sequence lengths

Returns

outputs: A output sequence of encoders. FloatTensor of size (batch, seq_length, dimension)
encoder_logits: Log probability of encoders outputs will be passed to CTC Loss.
If joint_ctc_attention is False, return None.
encoder_output_lengths: The length of encoders outputs. (batch)

Return type

(Tensor, Tensor, Tensor)

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Forward propagate a inputs for encoders training.

Parameters

inputs (torch.FloatTensor) – A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).
input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

outputs: A output sequence of encoders. FloatTensor of size (batch, seq_length, dimension)
encoder_logits: Log probability of encoders outputs will be passed to CTC Loss.
If joint_ctc_attention is False, return None.
encoder_output_lengths: The length of encoders outputs. (batch)

Return type

(Tensor, Tensor, Tensor)

QuartzNet¶

class openspeech.encoders.quartznet.QuartzNet(configs: omegaconf.dictconfig.DictConfig, input_dim: int, num_classes: int)[source]¶

QuartzNet is fully convolutional automatic speech recognition model. The model is composed of multiple blocks with residual connections between them. Each block consists of one or more modules with 1D time-channel separable convolutional layers, batch normalization, and ReLU layers. It is trained with CTC loss.

Parameters

configs (DictConfig) – hydra configuration set.
input_dim (int) – dimension of input.
num_classes (int) – number of classification.

Inputs:: inputs (torch.FloatTensor): A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension). input_lengths (torch.LongTensor): The length of input tensor. (batch)

Returns

outputs (torch.FloatTensor): Log probability of model predictions. (batch, seq_length, num_classes)
output_lengths (torch.LongTensor): The length of output tensor (batch)

Return type

(Tensor, Tensor)

Reference:: Samuel Kriman et al.: QUARTZNET: DEEP AUTOMATIC SPEECH RECOGNITION WITH 1D TIME-CHANNEL SEPARABLE CONVOLUTIONS. https://arxiv.org/abs/1910.10261.pdf

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Forward propagate a inputs for encoder_only training.

Parameters

inputs (torch.FloatTensor) – A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).
input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

outputs (torch.FloatTensor): Log probability of model predictions. (batch, seq_length, num_classes)
output_lengths (torch.LongTensor): The length of output tensor (batch)

Return type

(Tensor, Tensor)

RNN Transducer Encoder¶

class openspeech.encoders.rnn_transducer_encoder.RNNTransducerEncoder(input_dim: int, hidden_state_dim: int, output_dim: int, num_layers: int, rnn_type: str = 'lstm', dropout_p: float = 0.2, bidirectional: bool = True)[source]¶

Encoder of RNN-Transducer.

Parameters

input_dim (int) – dimension of input vector
hidden_state_dim (int, optional) – hidden state dimension of encoders (default: 320)
output_dim (int, optional) – output dimension of encoders and decoders (default: 512)
num_layers (int, optional) – number of encoders layers (default: 4)
rnn_type (str, optional) – type of rnn cell (default: lstm)
bidirectional (bool, optional) – if True, becomes a bidirectional encoders (default: True)

Inputs: inputs, input_lengths: inputs (torch.FloatTensor): A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension). input_lengths (torch.LongTensor): The length of input tensor. (batch)

Returns

(Tensor, Tensor)

outputs (torch.FloatTensor): A output sequence of encoders. FloatTensor of size
(batch, seq_length, dimension)
hidden_states (torch.FloatTensor): A hidden state of encoders. FloatTensor of size
(batch, seq_length, dimension)

Reference:: A Graves: Sequence Transduction with Recurrent Neural Networks https://arxiv.org/abs/1211.3711.pdf

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Forward propagate a inputs for encoders training.

Parameters

inputs (torch.FloatTensor) – A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).
input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

(Tensor, Tensor)

outputs (torch.FloatTensor): A output sequence of encoders. FloatTensor of size
(batch, seq_length, dimension)
output_lengths (torch.LongTensor): The length of output tensor. (batch)

Transformer Encoder¶

class openspeech.encoders.transformer_encoder.TransformerEncoder(num_classes: int, input_dim: int = 80, d_model: int = 512, d_ff: int = 2048, num_layers: int = 6, num_heads: int = 8, dropout_p: float = 0.3, joint_ctc_attention: bool = False)[source]¶

The TransformerEncoder is composed of a stack of N identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network.

Parameters

input_dim – dimension of feature vector
d_model – dimension of model (default: 512)
d_ff – dimension of feed forward network (default: 2048)
num_layers – number of encoders layers (default: 6)
num_heads – number of attention heads (default: 8)
dropout_p – probability of dropout (default: 0.3)
joint_ctc_attention (bool, optional) – flag indication joint ctc attention or not

Inputs:

inputs: list of sequences, whose length is the batch size and within which each sequence is list of tokens
input_lengths: list of sequence lengths

Returns

outputs: A output sequence of encoders. FloatTensor of size (batch, seq_length, dimension)
encoder_logits: Log probability of encoders outputs will be passed to CTC Loss.
If joint_ctc_attention is False, return None. (batch, seq_length, num_classes)
output_lengths: The length of encoders outputs. (batch)

Return type

(Tensor, Tensor, Tensor)

Reference:: Ashish Vaswani et al.: Attention Is All You Need https://arxiv.org/abs/1706.03762

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶

Forward propagate a inputs for encoders training.

Parameters

inputs (torch.FloatTensor) – A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).
input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

outputs: A output sequence of encoders. FloatTensor of size (batch, seq_length, dimension)
encoder_logits: Log probability of encoders outputs will be passed to CTC Loss.
If joint_ctc_attention is False, return None. (batch, seq_length, num_classes)
output_lengths: The length of encoders outputs. (batch)

Return type

(Tensor, Tensor, Tensor)

class openspeech.encoders.transformer_encoder.TransformerEncoderLayer(d_model: int = 512, num_heads: int = 8, d_ff: int = 2048, dropout_p: float = 0.3)[source]¶

EncoderLayer is made up of self-attention and feedforward network. This standard encoders layer is based on the paper “Attention Is All You Need”.

Parameters

d_model – dimension of model (default: 512)
num_heads – number of attention heads (default: 8)
d_ff – dimension of feed forward network (default: 2048)
dropout_p – probability of dropout (default: 0.3)

Inputs:: inputs (torch.FloatTensor): input sequence of transformer encoder layer self_attn_mask (torch.BoolTensor): mask of self attention

Returns

(Tensor, Tensor)

outputs (torch.FloatTensor): output of transformer encoder layer
attn (torch.FloatTensor): attention of transformer encoder layer

forward(inputs: torch.Tensor, self_attn_mask: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]¶

Forward propagate of transformer encoder layer.

Inputs:: inputs (torch.FloatTensor): input sequence of transformer encoder layer self_attn_mask (torch.BoolTensor): mask of self attention

Returns: output of transformer encoder layer attn (torch.FloatTensor): attention of transformer encoder layer
Return type: outputs (torch.FloatTensor)

Transformer Transducer Encoder¶

class openspeech.encoders.transformer_transducer_encoder.TransformerTransducerEncoder(input_size: int = 80, model_dim: int = 512, d_ff: int = 2048, num_layers: int = 18, num_heads: int = 8, dropout: float = 0.1, max_positional_length: int = 5000)[source]¶

Converts the audio signal to higher feature values

Parameters

input_size (int) – dimension of input vector (default : 80)
model_dim (int) – the number of features in the audio encoder (default : 512)
d_ff (int) – the number of features in the feed forward layers (default : 2048)
num_layers (int) – the number of audio encoder layers (default: 18)
num_heads (int) – the number of heads in the multi-head attention (default: 8)
dropout (float) – dropout probability of audio encoder (default: 0.1)
max_positional_length (int) – Maximum length to use for positional encoding (default : 5000)

Inputs: inputs, inputs_lens

inputs: Parsed audio of batch size number
inputs_lens: Tensor of sequence lengths

Returns

(batch, seq_length, dimension) * input_lengths (torch.LongTensor): (batch)

Return type

outputs (torch.FloatTensor)

Reference:: Qian Zhang et al.: Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss https://arxiv.org/abs/2002.02562

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Forward propagate a inputs for audio encoder.

Parameters

inputs (torch.FloatTensor) – A input sequence passed to audio encoder. Typically inputs will be a padded FloatTensor of size (batch, seq_length, dimension).
input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

(batch, seq_length, dimension) ** input_lengths**(Tensor): (batch)

Return type

outputs (Tensor)

class openspeech.encoders.transformer_transducer_encoder.TransformerTransducerEncoderLayer(model_dim: int = 512, d_ff: int = 2048, num_heads: int = 8, dropout: float = 0.1)[source]¶

Repeated layers common to audio encoders and label encoders

Parameters

model_dim (int) – the number of features in the encoder (default : 512)
d_ff (int) – the number of features in the feed forward layers (default : 2048)
num_heads (int) – the number of heads in the multi-head attention (default: 8)
dropout (float) – dropout probability of encoder layer (default: 0.1)

Inputs: inputs, self_attn_mask

inputs: Audio feature or label feature
self_attn_mask: Self attention mask to use in multi-head attention

Returns: outputs, attn_distribution

(Tensor, Tensor)

outputs (torch.FloatTensor): Tensor containing higher (audio, label) feature values
attn_distribution (torch.FloatTensor): Attention distribution in multi-head attention

forward(inputs: torch.Tensor, self_attn_mask: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor][source]¶

Forward propagate a inputs for label encoder.

Parameters

inputs – A input sequence passed to encoder layer. (batch, seq_length, dimension)
self_attn_mask – Self attention mask to cover up padding (batch, seq_length, seq_length)

Returns

(batch, seq_length, dimension) attn_distribution (Tensor): (batch, seq_length, seq_length)

Return type

outputs (Tensor)