Encoders

Openspeech Encoder

class openspeech.encoders.openspeech_encoder.OpenspeechEncoder[source]

Base Interface of Openspeech Encoder.

Inputs:

inputs (torch.FloatTensor): A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension). input_lengths (torch.LongTensor): The length of input tensor. (batch)

count_parameters()int[source]

Count parameters of encoders

forward(inputs: torch.Tensor, input_lengths: torch.Tensor)[source]

Forward propagate for encoders training.

Inputs:

inputs (torch.FloatTensor): A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension). input_lengths (torch.LongTensor): The length of input tensor. (batch)

update_dropout(dropout_p: float)None[source]

Update dropout probability of encoders

Conformer Encoder

class openspeech.encoders.conformer_encoder.ConformerEncoder(num_classes: int, input_dim: int = 80, encoder_dim: int = 512, num_layers: int = 17, num_attention_heads: int = 8, feed_forward_expansion_factor: int = 4, conv_expansion_factor: int = 2, input_dropout_p: float = 0.1, feed_forward_dropout_p: float = 0.1, attention_dropout_p: float = 0.1, conv_dropout_p: float = 0.1, conv_kernel_size: int = 31, half_step_residual: bool = True, joint_ctc_attention: bool = True)[source]

Transformer models are good at capturing content-based global interactions, while CNNs exploit local features effectively. Conformer achieves the best of both worlds by studying how to combine convolution neural networks and transformers to model both local and global dependencies of an audio sequence in a parameter-efficient way.

Parameters
  • num_classes (int) – Number of classification

  • input_dim (int, optional) – Dimension of input vector

  • encoder_dim (int, optional) – Dimension of conformer encoders

  • num_layers (int, optional) – Number of conformer blocks

  • num_attention_heads (int, optional) – Number of attention heads

  • feed_forward_expansion_factor (int, optional) – Expansion factor of feed forward module

  • conv_expansion_factor (int, optional) – Expansion factor of conformer convolution module

  • feed_forward_dropout_p (float, optional) – Probability of feed forward module dropout

  • attention_dropout_p (float, optional) – Probability of attention module dropout

  • conv_dropout_p (float, optional) – Probability of conformer convolution module dropout

  • conv_kernel_size (int or tuple, optional) – Size of the convolving kernel

  • half_step_residual (bool) – Flag indication whether to use half step residual or not joint_ctc_attention (bool, optional): flag indication joint ctc attention or not

Inputs: inputs, input_lengths
  • inputs (batch, time, dim): Tensor containing input vector

  • input_lengths (batch): list of sequence input lengths

Returns: outputs, output_lengths
  • outputs (batch, out_channels, time): Tensor produces by conformer encoders.

  • output_lengths (batch): list of sequence output lengths

Reference:

Anmol Gulati et al: Conformer: Convolution-augmented Transformer for Speech Recognition https://arxiv.org/abs/2005.08100

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]

Forward propagate a inputs for encoders training.

Parameters
  • inputs (torch.FloatTensor) – A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).

  • input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

(Tensor, Tensor, Tensor)

  • outputs: A output sequence of encoders. FloatTensor of size (batch, seq_length, dimension)

  • encoder_logits: Log probability of encoders outputs will be passed to CTC Loss.

    If joint_ctc_attention is False, return None.

  • output_lengths: The length of encoders outputs. (batch)

ContextNet Encoder

class openspeech.encoders.contextnet_encoder.ContextNetEncoder(num_classes: int, model_size: str = 'medium', input_dim: int = 80, num_layers: int = 5, kernel_size: int = 5, num_channels: int = 256, output_dim: int = 640, joint_ctc_attention: bool = False)[source]

ContextNetEncoder goes through 23 convolution blocks to convert to higher feature values.

Parameters
  • num_classes (int) – Number of classification

  • model_size (str, optional) – Size of the model[‘small’, ‘medium’, ‘large’] (default : ‘medium’)

  • input_dim (int, optional) – Dimension of input vector (default : 80)

  • num_layers (int, optional) – The number of convolutional layers (default : 5)

  • kernel_size (int, optional) – Value of convolution kernel size (default : 5)

  • num_channels (int, optional) – The number of channels in the convolution filter (default: 256)

  • output_dim (int, optional) – Dimension of encoder output vector (default: 640)

  • joint_ctc_attention (bool, optional) – flag indication joint ctc attention or not

Inputs: inputs, input_lengths
  • inputs: Parsed audio of batch size number FloatTensor of size (batch, seq_length, dimension)

  • input_lengths: Tensor representing the sequence length of the input (batch)

Returns: output, output_lengths
  • output: Tensor of encoder output FloatTensor of size

    (batch, seq_length, dimension)

  • encoder_logits: Log probability of encoders outputs will be passed to CTC Loss.

    If joint_ctc_attention is False, return None.

  • output_lengths: Tensor representing the length of the encoder output (batch)

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]

Forward propagate a inputs for audio encoder.

Parameters
  • **inputs** (torch.FloatTensor) – Parsed audio of batch size number FloatTensor of size (batch, seq_length, dimension)

  • **input_lengths** (torch.LongTensor) – Tensor representing the sequence length of the input LongTensor of size (batch)

Returns

Tensor of encoder output FloatTensor of size

(batch, seq_length, dimension)

encoder_logits (torch.FloatTensor): Log probability of encoders outputs will be passed to CTC Loss.

If joint_ctc_attention is False, return None.

output_lengths (torch.LongTensor): Tensor representing the length of the encoder output

LongTensor of size (batch)

Return type

output (torch.FloatTensor)

Convolutional LSTM Encoder

class openspeech.encoders.convolutional_lstm_encoder.ConvolutionalLSTMEncoder(input_dim: int, num_classes: int = None, hidden_state_dim: int = 512, dropout_p: float = 0.3, num_layers: int = 3, bidirectional: bool = True, rnn_type: str = 'lstm', extractor: str = 'vgg', conv_activation: str = 'hardtanh', joint_ctc_attention: bool = False)[source]

Converts low level speech signals into higher level features with convolutional extractor.

Parameters
  • input_dim (int) – dimension of input vector

  • num_classes (int) – number of classification

  • hidden_state_dim (int) – the number of features in the encoders hidden state h

  • num_layers (int, optional) – number of recurrent layers (default: 3)

  • bidirectional (bool, optional) – if True, becomes a bidirectional encoders (default: False)

  • extractor (str) – type of CNN extractor (default: vgg)

  • conv_activation (str) – activation function of convolutional extractor (default: hardtanh)

  • rnn_type (str, optional) – type of RNN cell (default: lstm)

  • dropout_p (float, optional) – dropout probability of encoders (default: 0.2)

  • joint_ctc_attention (bool, optional) – flag indication joint ctc attention or not

Inputs: inputs, input_lengths
  • inputs: list of sequences, whose length is the batch size and within which each sequence is list of tokens

  • input_lengths: list of sequence lengths

Returns: encoder_outputs, encoder_log__probs, output_lengths
  • encoder_outputs: tensor containing the encoded features of the input sequence

  • encoder_log__probs: tensor containing log probability for encoder_only loss

  • output_lengths: list of sequence lengths produced by Listener

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Forward propagate a inputs for encoders training.

Parameters
  • inputs (torch.FloatTensor) – A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).

  • input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

  • outputs: A output sequence of encoders. FloatTensor of size (batch, seq_length, dimension)

  • encoder_logits: Log probability of encoders outputs will be passed to CTC Loss.

    If joint_ctc_attention is False, return None.

  • encoder_output_lengths: The length of encoders outputs. (batch)

Return type

(Tensor, Tensor, Tensor)

Convolutional Transformer Encoder

class openspeech.encoders.convolutional_transformer_encoder.ConvolutionalTransformerEncoder(num_classes: int, input_dim: int, extractor: str = 'vgg', d_model: int = 512, d_ff: int = 2048, num_layers: int = 6, num_heads: int = 8, dropout_p: float = 0.3, conv_activation: str = 'relu', joint_ctc_attention: bool = False)[source]

The TransformerEncoder is composed of a stack of N identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network.

Parameters
  • input_dim – dimension of feature vector

  • extractor (str) – convolutional extractor

  • d_model – dimension of model (default: 512)

  • d_ff – dimension of feed forward network (default: 2048)

  • num_layers – number of encoders layers (default: 6)

  • num_heads – number of attention heads (default: 8)

  • dropout_p (float, optional) – probability of dropout (default: 0.3)

  • conv_activation (str, optional) – activation function of convolutional extractor (default: hardtanh)

  • joint_ctc_attention (bool, optional) – flag indication joint ctc attention or not (default: False)

Inputs:
  • inputs: list of sequences, whose length is the batch size and within which each sequence is list of tokens

  • input_lengths: list of sequence lengths

Returns

  • outputs: A output sequence of encoders. FloatTensor of size (batch, seq_length, dimension)

  • encoder_logits: Log probability of encoders outputs will be passed to CTC Loss.

    If joint_ctc_attention is False, return None.

  • output_lengths: The length of encoders outputs. (batch)

Return type

(Tensor, Tensor, Tensor)

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]

Forward propagate a inputs for encoders training.

Parameters
  • inputs (torch.FloatTensor) – A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).

  • input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

  • outputs: A output sequence of encoders. FloatTensor of size (batch, seq_length, dimension)

  • encoder_logits: Log probability of encoders outputs will be passed to CTC Loss.

    If joint_ctc_attention is False, return None.

  • output_lengths: The length of encoders outputs. (batch)

Return type

(Tensor, Tensor, Tensor)

DeepSpeech2

class openspeech.encoders.deepspeech2.DeepSpeech2(input_dim: int, num_classes: int, rnn_type='gru', num_rnn_layers: int = 5, rnn_hidden_dim: int = 512, dropout_p: float = 0.1, bidirectional: bool = True, activation: str = 'hardtanh')[source]

DeepSpeech2 is a set of speech recognition models based on Baidu DeepSpeech2. DeepSpeech2 is trained with CTC loss.

Parameters
  • input_dim (int) – dimension of input vector

  • num_classes (int) – number of classfication

  • rnn_type (str, optional) – type of RNN cell (default: gru)

  • num_rnn_layers (int, optional) – number of recurrent layers (default: 5)

  • rnn_hidden_dim (int) – the number of features in the hidden state h

  • dropout_p (float, optional) – dropout probability (default: 0.1)

  • bidirectional (bool, optional) – if True, becomes a bidirectional encoders (defulat: True)

  • activation (str) – type of activation function (default: hardtanh)

Inputs: inputs, input_lengths
  • inputs: list of sequences, whose length is the batch size and within which each sequence is list of tokens

  • input_lengths: list of sequence lengths

Returns

  • predicted_log_prob (torch.FloatTensor)s: Log probability of model predictions.

  • output_lengths (torch.LongTensor): The length of output tensor (batch)

Return type

(Tensor, Tensor)

Reference:

Dario Amodei et al.: Deep Speech 2: End-to-End Speech Recognition in English and Mandarin https://arxiv.org/abs/1512.02595

count_parameters()int[source]

Count parameters of encoders

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Forward propagate a inputs for encoder_only training.

Parameters
  • inputs (torch.FloatTensor) – A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).

  • input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

  • predicted_log_prob (torch.FloatTensor)s: Log probability of model predictions.

  • output_lengths (torch.LongTensor): The length of output tensor (batch)

Return type

(Tensor, Tensor)

update_dropout(dropout_p: float)None[source]

Update dropout probability of encoders

Jasper

class openspeech.encoders.jasper.Jasper(configs: omegaconf.dictconfig.DictConfig, input_dim: int, num_classes: int)[source]

Jasper (Just Another Speech Recognizer), an ASR model comprised of 54 layers proposed by NVIDIA. Jasper achieved sub 3 percent word error rate (WER) on the LibriSpeech dataset.

Parameters
  • num_classes (int) – number of classification

  • version (str) – version of jasper. Marked as BxR: B - number of blocks, R - number of sub-blocks

Inputs: inputs, input_lengths, residual
  • inputs: tensor contains input sequence vector

  • input_lengths: tensor contains sequence lengths

Returns

  • outputs (torch.FloatTensor): Log probability of model predictions. (batch, seq_length, num_classes)

  • output_lengths (torch.LongTensor): The length of output tensor (batch)

Return type

(Tensor, Tensor)

Reference:

Jason Li. et al.: Jasper: An End-to-End Convolutional Neural Acoustic Model https://arxiv.org/pdf/1904.03288.pdf

count_parameters()int[source]

Count parameters of model

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Forward propagate a inputs for encoder_only training.

Parameters
  • inputs (torch.FloatTensor) – A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).

  • input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

  • outputs (torch.FloatTensor): Log probability of model predictions. (batch, seq_length, num_classes)

  • output_lengths (torch.LongTensor): The length of output tensor (batch)

Return type

(Tensor, Tensor)

update_dropout(dropout_p: float)None[source]

Update dropout probability of model

LSTM Encoder

class openspeech.encoders.lstm_encoder.LSTMEncoder(input_dim: int, num_classes: int = None, hidden_state_dim: int = 512, dropout_p: float = 0.3, num_layers: int = 3, bidirectional: bool = True, rnn_type: str = 'lstm', joint_ctc_attention: bool = False)[source]

Converts low level speech signals into higher level features

Parameters
  • input_dim (int) – dimension of input vector

  • num_classes (int) – number of classification

  • hidden_state_dim (int) – the number of features in the encoders hidden state h

  • num_layers (int, optional) – number of recurrent layers (default: 3)

  • bidirectional (bool, optional) – if True, becomes a bidirectional encoders (default: False)

  • rnn_type (str, optional) – type of RNN cell (default: lstm)

  • dropout_p (float, optional) – dropout probability of encoders (default: 0.2)

  • joint_ctc_attention (bool, optional) – flag indication joint ctc attention or not

Inputs: inputs, input_lengths
  • inputs: list of sequences, whose length is the batch size and within which each sequence is list of tokens

  • input_lengths: list of sequence lengths

Returns

  • outputs: A output sequence of encoders. FloatTensor of size (batch, seq_length, dimension)

  • encoder_logits: Log probability of encoders outputs will be passed to CTC Loss.

    If joint_ctc_attention is False, return None.

  • encoder_output_lengths: The length of encoders outputs. (batch)

Return type

(Tensor, Tensor, Tensor)

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]

Forward propagate a inputs for encoders training.

Parameters
  • inputs (torch.FloatTensor) – A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).

  • input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

  • outputs: A output sequence of encoders. FloatTensor of size (batch, seq_length, dimension)

  • encoder_logits: Log probability of encoders outputs will be passed to CTC Loss.

    If joint_ctc_attention is False, return None.

  • encoder_output_lengths: The length of encoders outputs. (batch)

Return type

(Tensor, Tensor, Tensor)

QuartzNet

class openspeech.encoders.quartznet.QuartzNet(configs: omegaconf.dictconfig.DictConfig, input_dim: int, num_classes: int)[source]

QuartzNet is fully convolutional automatic speech recognition model. The model is composed of multiple blocks with residual connections between them. Each block consists of one or more modules with 1D time-channel separable convolutional layers, batch normalization, and ReLU layers. It is trained with CTC loss.

Parameters
  • configs (DictConfig) – hydra configuration set.

  • input_dim (int) – dimension of input.

  • num_classes (int) – number of classification.

Inputs:

inputs (torch.FloatTensor): A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension). input_lengths (torch.LongTensor): The length of input tensor. (batch)

Returns

  • outputs (torch.FloatTensor): Log probability of model predictions. (batch, seq_length, num_classes)

  • output_lengths (torch.LongTensor): The length of output tensor (batch)

Return type

(Tensor, Tensor)

Reference:

Samuel Kriman et al.: QUARTZNET: DEEP AUTOMATIC SPEECH RECOGNITION WITH 1D TIME-CHANNEL SEPARABLE CONVOLUTIONS. https://arxiv.org/abs/1910.10261.pdf

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Forward propagate a inputs for encoder_only training.

Parameters
  • inputs (torch.FloatTensor) – A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).

  • input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

  • outputs (torch.FloatTensor): Log probability of model predictions. (batch, seq_length, num_classes)

  • output_lengths (torch.LongTensor): The length of output tensor (batch)

Return type

(Tensor, Tensor)

RNN Transducer Encoder

class openspeech.encoders.rnn_transducer_encoder.RNNTransducerEncoder(input_dim: int, hidden_state_dim: int, output_dim: int, num_layers: int, rnn_type: str = 'lstm', dropout_p: float = 0.2, bidirectional: bool = True)[source]

Encoder of RNN-Transducer.

Parameters
  • input_dim (int) – dimension of input vector

  • hidden_state_dim (int, optional) – hidden state dimension of encoders (default: 320)

  • output_dim (int, optional) – output dimension of encoders and decoders (default: 512)

  • num_layers (int, optional) – number of encoders layers (default: 4)

  • rnn_type (str, optional) – type of rnn cell (default: lstm)

  • bidirectional (bool, optional) – if True, becomes a bidirectional encoders (default: True)

Inputs: inputs, input_lengths

inputs (torch.FloatTensor): A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension). input_lengths (torch.LongTensor): The length of input tensor. (batch)

Returns

(Tensor, Tensor)

  • outputs (torch.FloatTensor): A output sequence of encoders. FloatTensor of size

    (batch, seq_length, dimension)

  • hidden_states (torch.FloatTensor): A hidden state of encoders. FloatTensor of size

    (batch, seq_length, dimension)

Reference:

A Graves: Sequence Transduction with Recurrent Neural Networks https://arxiv.org/abs/1211.3711.pdf

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Forward propagate a inputs for encoders training.

Parameters
  • inputs (torch.FloatTensor) – A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).

  • input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

(Tensor, Tensor)

  • outputs (torch.FloatTensor): A output sequence of encoders. FloatTensor of size

    (batch, seq_length, dimension)

  • output_lengths (torch.LongTensor): The length of output tensor. (batch)

Transformer Encoder

class openspeech.encoders.transformer_encoder.TransformerEncoder(num_classes: int, input_dim: int = 80, d_model: int = 512, d_ff: int = 2048, num_layers: int = 6, num_heads: int = 8, dropout_p: float = 0.3, joint_ctc_attention: bool = False)[source]

The TransformerEncoder is composed of a stack of N identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network.

Parameters
  • input_dim – dimension of feature vector

  • d_model – dimension of model (default: 512)

  • d_ff – dimension of feed forward network (default: 2048)

  • num_layers – number of encoders layers (default: 6)

  • num_heads – number of attention heads (default: 8)

  • dropout_p – probability of dropout (default: 0.3)

  • joint_ctc_attention (bool, optional) – flag indication joint ctc attention or not

Inputs:
  • inputs: list of sequences, whose length is the batch size and within which each sequence is list of tokens

  • input_lengths: list of sequence lengths

Returns

  • outputs: A output sequence of encoders. FloatTensor of size (batch, seq_length, dimension)

  • encoder_logits: Log probability of encoders outputs will be passed to CTC Loss.

    If joint_ctc_attention is False, return None. (batch, seq_length, num_classes)

  • output_lengths: The length of encoders outputs. (batch)

Return type

(Tensor, Tensor, Tensor)

Reference:

Ashish Vaswani et al.: Attention Is All You Need https://arxiv.org/abs/1706.03762

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]

Forward propagate a inputs for encoders training.

Parameters
  • inputs (torch.FloatTensor) – A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size (batch, seq_length, dimension).

  • input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

  • outputs: A output sequence of encoders. FloatTensor of size (batch, seq_length, dimension)

  • encoder_logits: Log probability of encoders outputs will be passed to CTC Loss.

    If joint_ctc_attention is False, return None. (batch, seq_length, num_classes)

  • output_lengths: The length of encoders outputs. (batch)

Return type

(Tensor, Tensor, Tensor)

class openspeech.encoders.transformer_encoder.TransformerEncoderLayer(d_model: int = 512, num_heads: int = 8, d_ff: int = 2048, dropout_p: float = 0.3)[source]

EncoderLayer is made up of self-attention and feedforward network. This standard encoders layer is based on the paper “Attention Is All You Need”.

Parameters
  • d_model – dimension of model (default: 512)

  • num_heads – number of attention heads (default: 8)

  • d_ff – dimension of feed forward network (default: 2048)

  • dropout_p – probability of dropout (default: 0.3)

Inputs:

inputs (torch.FloatTensor): input sequence of transformer encoder layer self_attn_mask (torch.BoolTensor): mask of self attention

Returns

(Tensor, Tensor)

  • outputs (torch.FloatTensor): output of transformer encoder layer

  • attn (torch.FloatTensor): attention of transformer encoder layer

forward(inputs: torch.Tensor, self_attn_mask: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]

Forward propagate of transformer encoder layer.

Inputs:

inputs (torch.FloatTensor): input sequence of transformer encoder layer self_attn_mask (torch.BoolTensor): mask of self attention

Returns

output of transformer encoder layer attn (torch.FloatTensor): attention of transformer encoder layer

Return type

outputs (torch.FloatTensor)

Transformer Transducer Encoder

class openspeech.encoders.transformer_transducer_encoder.TransformerTransducerEncoder(input_size: int = 80, model_dim: int = 512, d_ff: int = 2048, num_layers: int = 18, num_heads: int = 8, dropout: float = 0.1, max_positional_length: int = 5000)[source]

Converts the audio signal to higher feature values

Parameters
  • input_size (int) – dimension of input vector (default : 80)

  • model_dim (int) – the number of features in the audio encoder (default : 512)

  • d_ff (int) – the number of features in the feed forward layers (default : 2048)

  • num_layers (int) – the number of audio encoder layers (default: 18)

  • num_heads (int) – the number of heads in the multi-head attention (default: 8)

  • dropout (float) – dropout probability of audio encoder (default: 0.1)

  • max_positional_length (int) – Maximum length to use for positional encoding (default : 5000)

Inputs: inputs, inputs_lens
  • inputs: Parsed audio of batch size number

  • inputs_lens: Tensor of sequence lengths

Returns

(batch, seq_length, dimension) * input_lengths (torch.LongTensor): (batch)

Return type

  • outputs (torch.FloatTensor)

Reference:

Qian Zhang et al.: Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss https://arxiv.org/abs/2002.02562

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Forward propagate a inputs for audio encoder.

Parameters
  • inputs (torch.FloatTensor) – A input sequence passed to audio encoder. Typically inputs will be a padded FloatTensor of size (batch, seq_length, dimension).

  • input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

(batch, seq_length, dimension) ** input_lengths**(Tensor): (batch)

Return type

outputs (Tensor)

class openspeech.encoders.transformer_transducer_encoder.TransformerTransducerEncoderLayer(model_dim: int = 512, d_ff: int = 2048, num_heads: int = 8, dropout: float = 0.1)[source]

Repeated layers common to audio encoders and label encoders

Parameters
  • model_dim (int) – the number of features in the encoder (default : 512)

  • d_ff (int) – the number of features in the feed forward layers (default : 2048)

  • num_heads (int) – the number of heads in the multi-head attention (default: 8)

  • dropout (float) – dropout probability of encoder layer (default: 0.1)

Inputs: inputs, self_attn_mask
  • inputs: Audio feature or label feature

  • self_attn_mask: Self attention mask to use in multi-head attention

Returns: outputs, attn_distribution

(Tensor, Tensor)

  • outputs (torch.FloatTensor): Tensor containing higher (audio, label) feature values

  • attn_distribution (torch.FloatTensor): Attention distribution in multi-head attention

forward(inputs: torch.Tensor, self_attn_mask: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor][source]

Forward propagate a inputs for label encoder.

Parameters
  • inputs – A input sequence passed to encoder layer. (batch, seq_length, dimension)

  • self_attn_mask – Self attention mask to cover up padding (batch, seq_length, seq_length)

Returns

(batch, seq_length, dimension) attn_distribution (Tensor): (batch, seq_length, seq_length)

Return type

outputs (Tensor)