Encoders¶
Openspeech Encoder¶
-
class
openspeech.encoders.openspeech_encoder.
OpenspeechEncoder
[source]¶ Base Interface of Openspeech Encoder.
- Inputs:
inputs (torch.FloatTensor): A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size
(batch, seq_length, dimension)
. input_lengths (torch.LongTensor): The length of input tensor.(batch)
-
forward
(inputs: torch.Tensor, input_lengths: torch.Tensor)[source]¶ Forward propagate for encoders training.
- Inputs:
inputs (torch.FloatTensor): A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size
(batch, seq_length, dimension)
. input_lengths (torch.LongTensor): The length of input tensor.(batch)
Conformer Encoder¶
-
class
openspeech.encoders.conformer_encoder.
ConformerEncoder
(num_classes: int, input_dim: int = 80, encoder_dim: int = 512, num_layers: int = 17, num_attention_heads: int = 8, feed_forward_expansion_factor: int = 4, conv_expansion_factor: int = 2, input_dropout_p: float = 0.1, feed_forward_dropout_p: float = 0.1, attention_dropout_p: float = 0.1, conv_dropout_p: float = 0.1, conv_kernel_size: int = 31, half_step_residual: bool = True, joint_ctc_attention: bool = True)[source]¶ Transformer models are good at capturing content-based global interactions, while CNNs exploit local features effectively. Conformer achieves the best of both worlds by studying how to combine convolution neural networks and transformers to model both local and global dependencies of an audio sequence in a parameter-efficient way.
- Parameters
num_classes (int) – Number of classification
input_dim (int, optional) – Dimension of input vector
encoder_dim (int, optional) – Dimension of conformer encoders
num_layers (int, optional) – Number of conformer blocks
num_attention_heads (int, optional) – Number of attention heads
feed_forward_expansion_factor (int, optional) – Expansion factor of feed forward module
conv_expansion_factor (int, optional) – Expansion factor of conformer convolution module
feed_forward_dropout_p (float, optional) – Probability of feed forward module dropout
attention_dropout_p (float, optional) – Probability of attention module dropout
conv_dropout_p (float, optional) – Probability of conformer convolution module dropout
conv_kernel_size (int or tuple, optional) – Size of the convolving kernel
half_step_residual (bool) – Flag indication whether to use half step residual or not joint_ctc_attention (bool, optional): flag indication joint ctc attention or not
- Inputs: inputs, input_lengths
inputs (batch, time, dim): Tensor containing input vector
input_lengths (batch): list of sequence input lengths
- Returns: outputs, output_lengths
outputs (batch, out_channels, time): Tensor produces by conformer encoders.
output_lengths (batch): list of sequence output lengths
- Reference:
Anmol Gulati et al: Conformer: Convolution-augmented Transformer for Speech Recognition https://arxiv.org/abs/2005.08100
-
forward
(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Forward propagate a inputs for encoders training.
- Parameters
inputs (torch.FloatTensor) – A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size
(batch, seq_length, dimension)
.input_lengths (torch.LongTensor) – The length of input tensor.
(batch)
- Returns
(Tensor, Tensor, Tensor)
outputs: A output sequence of encoders. FloatTensor of size
(batch, seq_length, dimension)
- encoder_logits: Log probability of encoders outputs will be passed to CTC Loss.
If joint_ctc_attention is False, return None.
output_lengths: The length of encoders outputs.
(batch)
ContextNet Encoder¶
-
class
openspeech.encoders.contextnet_encoder.
ContextNetEncoder
(num_classes: int, model_size: str = 'medium', input_dim: int = 80, num_layers: int = 5, kernel_size: int = 5, num_channels: int = 256, output_dim: int = 640, joint_ctc_attention: bool = False)[source]¶ ContextNetEncoder goes through 23 convolution blocks to convert to higher feature values.
- Parameters
num_classes (int) – Number of classification
model_size (str, optional) – Size of the model[‘small’, ‘medium’, ‘large’] (default : ‘medium’)
input_dim (int, optional) – Dimension of input vector (default : 80)
num_layers (int, optional) – The number of convolutional layers (default : 5)
kernel_size (int, optional) – Value of convolution kernel size (default : 5)
num_channels (int, optional) – The number of channels in the convolution filter (default: 256)
output_dim (int, optional) – Dimension of encoder output vector (default: 640)
joint_ctc_attention (bool, optional) – flag indication joint ctc attention or not
- Inputs: inputs, input_lengths
inputs: Parsed audio of batch size number FloatTensor of size
(batch, seq_length, dimension)
input_lengths: Tensor representing the sequence length of the input
(batch)
- Returns: output, output_lengths
- output: Tensor of encoder output FloatTensor of size
(batch, seq_length, dimension)
- encoder_logits: Log probability of encoders outputs will be passed to CTC Loss.
If joint_ctc_attention is False, return None.
output_lengths: Tensor representing the length of the encoder output
(batch)
-
forward
(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Forward propagate a inputs for audio encoder.
- Parameters
**inputs** (torch.FloatTensor) – Parsed audio of batch size number FloatTensor of size
(batch, seq_length, dimension)
**input_lengths** (torch.LongTensor) – Tensor representing the sequence length of the input LongTensor of size
(batch)
- Returns
- Tensor of encoder output FloatTensor of size
(batch, seq_length, dimension)
- encoder_logits (torch.FloatTensor): Log probability of encoders outputs will be passed to CTC Loss.
If joint_ctc_attention is False, return None.
- output_lengths (torch.LongTensor): Tensor representing the length of the encoder output
LongTensor of size
(batch)
- Return type
output (torch.FloatTensor)
Convolutional LSTM Encoder¶
-
class
openspeech.encoders.convolutional_lstm_encoder.
ConvolutionalLSTMEncoder
(input_dim: int, num_classes: int = None, hidden_state_dim: int = 512, dropout_p: float = 0.3, num_layers: int = 3, bidirectional: bool = True, rnn_type: str = 'lstm', extractor: str = 'vgg', conv_activation: str = 'hardtanh', joint_ctc_attention: bool = False)[source]¶ Converts low level speech signals into higher level features with convolutional extractor.
- Parameters
input_dim (int) – dimension of input vector
num_classes (int) – number of classification
hidden_state_dim (int) – the number of features in the encoders hidden state h
num_layers (int, optional) – number of recurrent layers (default: 3)
bidirectional (bool, optional) – if True, becomes a bidirectional encoders (default: False)
extractor (str) – type of CNN extractor (default: vgg)
conv_activation (str) – activation function of convolutional extractor (default: hardtanh)
rnn_type (str, optional) – type of RNN cell (default: lstm)
dropout_p (float, optional) – dropout probability of encoders (default: 0.2)
joint_ctc_attention (bool, optional) – flag indication joint ctc attention or not
- Inputs: inputs, input_lengths
inputs: list of sequences, whose length is the batch size and within which each sequence is list of tokens
input_lengths: list of sequence lengths
- Returns: encoder_outputs, encoder_log__probs, output_lengths
encoder_outputs: tensor containing the encoded features of the input sequence
encoder_log__probs: tensor containing log probability for encoder_only loss
output_lengths: list of sequence lengths produced by Listener
-
forward
(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Forward propagate a inputs for encoders training.
- Parameters
inputs (torch.FloatTensor) – A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size
(batch, seq_length, dimension)
.input_lengths (torch.LongTensor) – The length of input tensor.
(batch)
- Returns
outputs: A output sequence of encoders. FloatTensor of size
(batch, seq_length, dimension)
- encoder_logits: Log probability of encoders outputs will be passed to CTC Loss.
If joint_ctc_attention is False, return None.
encoder_output_lengths: The length of encoders outputs.
(batch)
- Return type
(Tensor, Tensor, Tensor)
Convolutional Transformer Encoder¶
-
class
openspeech.encoders.convolutional_transformer_encoder.
ConvolutionalTransformerEncoder
(num_classes: int, input_dim: int, extractor: str = 'vgg', d_model: int = 512, d_ff: int = 2048, num_layers: int = 6, num_heads: int = 8, dropout_p: float = 0.3, conv_activation: str = 'relu', joint_ctc_attention: bool = False)[source]¶ The TransformerEncoder is composed of a stack of N identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network.
- Parameters
input_dim – dimension of feature vector
extractor (str) – convolutional extractor
d_model – dimension of model (default: 512)
d_ff – dimension of feed forward network (default: 2048)
num_layers – number of encoders layers (default: 6)
num_heads – number of attention heads (default: 8)
dropout_p (float, optional) – probability of dropout (default: 0.3)
conv_activation (str, optional) – activation function of convolutional extractor (default: hardtanh)
joint_ctc_attention (bool, optional) – flag indication joint ctc attention or not (default: False)
- Inputs:
inputs: list of sequences, whose length is the batch size and within which each sequence is list of tokens
input_lengths: list of sequence lengths
- Returns
outputs: A output sequence of encoders. FloatTensor of size
(batch, seq_length, dimension)
- encoder_logits: Log probability of encoders outputs will be passed to CTC Loss.
If joint_ctc_attention is False, return None.
output_lengths: The length of encoders outputs.
(batch)
- Return type
(Tensor, Tensor, Tensor)
-
forward
(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Forward propagate a inputs for encoders training.
- Parameters
inputs (torch.FloatTensor) – A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size
(batch, seq_length, dimension)
.input_lengths (torch.LongTensor) – The length of input tensor.
(batch)
- Returns
outputs: A output sequence of encoders. FloatTensor of size
(batch, seq_length, dimension)
- encoder_logits: Log probability of encoders outputs will be passed to CTC Loss.
If joint_ctc_attention is False, return None.
output_lengths: The length of encoders outputs.
(batch)
- Return type
(Tensor, Tensor, Tensor)
DeepSpeech2¶
-
class
openspeech.encoders.deepspeech2.
DeepSpeech2
(input_dim: int, num_classes: int, rnn_type='gru', num_rnn_layers: int = 5, rnn_hidden_dim: int = 512, dropout_p: float = 0.1, bidirectional: bool = True, activation: str = 'hardtanh')[source]¶ DeepSpeech2 is a set of speech recognition models based on Baidu DeepSpeech2. DeepSpeech2 is trained with CTC loss.
- Parameters
input_dim (int) – dimension of input vector
num_classes (int) – number of classfication
rnn_type (str, optional) – type of RNN cell (default: gru)
num_rnn_layers (int, optional) – number of recurrent layers (default: 5)
rnn_hidden_dim (int) – the number of features in the hidden state h
dropout_p (float, optional) – dropout probability (default: 0.1)
bidirectional (bool, optional) – if True, becomes a bidirectional encoders (defulat: True)
activation (str) – type of activation function (default: hardtanh)
- Inputs: inputs, input_lengths
inputs: list of sequences, whose length is the batch size and within which each sequence is list of tokens
input_lengths: list of sequence lengths
- Returns
predicted_log_prob (torch.FloatTensor)s: Log probability of model predictions.
output_lengths (torch.LongTensor): The length of output tensor
(batch)
- Return type
(Tensor, Tensor)
- Reference:
Dario Amodei et al.: Deep Speech 2: End-to-End Speech Recognition in English and Mandarin https://arxiv.org/abs/1512.02595
-
forward
(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Forward propagate a inputs for encoder_only training.
- Parameters
inputs (torch.FloatTensor) – A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size
(batch, seq_length, dimension)
.input_lengths (torch.LongTensor) – The length of input tensor.
(batch)
- Returns
predicted_log_prob (torch.FloatTensor)s: Log probability of model predictions.
output_lengths (torch.LongTensor): The length of output tensor
(batch)
- Return type
(Tensor, Tensor)
Jasper¶
-
class
openspeech.encoders.jasper.
Jasper
(configs: omegaconf.dictconfig.DictConfig, input_dim: int, num_classes: int)[source]¶ Jasper (Just Another Speech Recognizer), an ASR model comprised of 54 layers proposed by NVIDIA. Jasper achieved sub 3 percent word error rate (WER) on the LibriSpeech dataset.
- Parameters
- Inputs: inputs, input_lengths, residual
inputs: tensor contains input sequence vector
input_lengths: tensor contains sequence lengths
- Returns
outputs (torch.FloatTensor): Log probability of model predictions.
(batch, seq_length, num_classes)
output_lengths (torch.LongTensor): The length of output tensor
(batch)
- Return type
(Tensor, Tensor)
- Reference:
Jason Li. et al.: Jasper: An End-to-End Convolutional Neural Acoustic Model https://arxiv.org/pdf/1904.03288.pdf
-
forward
(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Forward propagate a inputs for encoder_only training.
- Parameters
inputs (torch.FloatTensor) – A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size
(batch, seq_length, dimension)
.input_lengths (torch.LongTensor) – The length of input tensor.
(batch)
- Returns
outputs (torch.FloatTensor): Log probability of model predictions.
(batch, seq_length, num_classes)
output_lengths (torch.LongTensor): The length of output tensor
(batch)
- Return type
(Tensor, Tensor)
LSTM Encoder¶
-
class
openspeech.encoders.lstm_encoder.
LSTMEncoder
(input_dim: int, num_classes: int = None, hidden_state_dim: int = 512, dropout_p: float = 0.3, num_layers: int = 3, bidirectional: bool = True, rnn_type: str = 'lstm', joint_ctc_attention: bool = False)[source]¶ Converts low level speech signals into higher level features
- Parameters
input_dim (int) – dimension of input vector
num_classes (int) – number of classification
hidden_state_dim (int) – the number of features in the encoders hidden state h
num_layers (int, optional) – number of recurrent layers (default: 3)
bidirectional (bool, optional) – if True, becomes a bidirectional encoders (default: False)
rnn_type (str, optional) – type of RNN cell (default: lstm)
dropout_p (float, optional) – dropout probability of encoders (default: 0.2)
joint_ctc_attention (bool, optional) – flag indication joint ctc attention or not
- Inputs: inputs, input_lengths
inputs: list of sequences, whose length is the batch size and within which each sequence is list of tokens
input_lengths: list of sequence lengths
- Returns
outputs: A output sequence of encoders. FloatTensor of size
(batch, seq_length, dimension)
- encoder_logits: Log probability of encoders outputs will be passed to CTC Loss.
If joint_ctc_attention is False, return None.
encoder_output_lengths: The length of encoders outputs.
(batch)
- Return type
(Tensor, Tensor, Tensor)
-
forward
(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Forward propagate a inputs for encoders training.
- Parameters
inputs (torch.FloatTensor) – A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size
(batch, seq_length, dimension)
.input_lengths (torch.LongTensor) – The length of input tensor.
(batch)
- Returns
outputs: A output sequence of encoders. FloatTensor of size
(batch, seq_length, dimension)
- encoder_logits: Log probability of encoders outputs will be passed to CTC Loss.
If joint_ctc_attention is False, return None.
encoder_output_lengths: The length of encoders outputs.
(batch)
- Return type
(Tensor, Tensor, Tensor)
QuartzNet¶
-
class
openspeech.encoders.quartznet.
QuartzNet
(configs: omegaconf.dictconfig.DictConfig, input_dim: int, num_classes: int)[source]¶ QuartzNet is fully convolutional automatic speech recognition model. The model is composed of multiple blocks with residual connections between them. Each block consists of one or more modules with 1D time-channel separable convolutional layers, batch normalization, and ReLU layers. It is trained with CTC loss.
- Parameters
- Inputs:
inputs (torch.FloatTensor): A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size
(batch, seq_length, dimension)
. input_lengths (torch.LongTensor): The length of input tensor.(batch)
- Returns
outputs (torch.FloatTensor): Log probability of model predictions.
(batch, seq_length, num_classes)
output_lengths (torch.LongTensor): The length of output tensor
(batch)
- Return type
(Tensor, Tensor)
- Reference:
Samuel Kriman et al.: QUARTZNET: DEEP AUTOMATIC SPEECH RECOGNITION WITH 1D TIME-CHANNEL SEPARABLE CONVOLUTIONS. https://arxiv.org/abs/1910.10261.pdf
-
forward
(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Forward propagate a inputs for encoder_only training.
- Parameters
inputs (torch.FloatTensor) – A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size
(batch, seq_length, dimension)
.input_lengths (torch.LongTensor) – The length of input tensor.
(batch)
- Returns
outputs (torch.FloatTensor): Log probability of model predictions.
(batch, seq_length, num_classes)
output_lengths (torch.LongTensor): The length of output tensor
(batch)
- Return type
(Tensor, Tensor)
RNN Transducer Encoder¶
-
class
openspeech.encoders.rnn_transducer_encoder.
RNNTransducerEncoder
(input_dim: int, hidden_state_dim: int, output_dim: int, num_layers: int, rnn_type: str = 'lstm', dropout_p: float = 0.2, bidirectional: bool = True)[source]¶ Encoder of RNN-Transducer.
- Parameters
input_dim (int) – dimension of input vector
hidden_state_dim (int, optional) – hidden state dimension of encoders (default: 320)
output_dim (int, optional) – output dimension of encoders and decoders (default: 512)
num_layers (int, optional) – number of encoders layers (default: 4)
rnn_type (str, optional) – type of rnn cell (default: lstm)
bidirectional (bool, optional) – if True, becomes a bidirectional encoders (default: True)
- Inputs: inputs, input_lengths
inputs (torch.FloatTensor): A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size
(batch, seq_length, dimension)
. input_lengths (torch.LongTensor): The length of input tensor.(batch)
- Returns
(Tensor, Tensor)
- outputs (torch.FloatTensor): A output sequence of encoders. FloatTensor of size
(batch, seq_length, dimension)
- hidden_states (torch.FloatTensor): A hidden state of encoders. FloatTensor of size
(batch, seq_length, dimension)
- Reference:
A Graves: Sequence Transduction with Recurrent Neural Networks https://arxiv.org/abs/1211.3711.pdf
-
forward
(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Forward propagate a inputs for encoders training.
- Parameters
inputs (torch.FloatTensor) – A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size
(batch, seq_length, dimension)
.input_lengths (torch.LongTensor) – The length of input tensor.
(batch)
- Returns
(Tensor, Tensor)
- outputs (torch.FloatTensor): A output sequence of encoders. FloatTensor of size
(batch, seq_length, dimension)
output_lengths (torch.LongTensor): The length of output tensor.
(batch)
Transformer Encoder¶
-
class
openspeech.encoders.transformer_encoder.
TransformerEncoder
(num_classes: int, input_dim: int = 80, d_model: int = 512, d_ff: int = 2048, num_layers: int = 6, num_heads: int = 8, dropout_p: float = 0.3, joint_ctc_attention: bool = False)[source]¶ The TransformerEncoder is composed of a stack of N identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network.
- Parameters
input_dim – dimension of feature vector
d_model – dimension of model (default: 512)
d_ff – dimension of feed forward network (default: 2048)
num_layers – number of encoders layers (default: 6)
num_heads – number of attention heads (default: 8)
dropout_p – probability of dropout (default: 0.3)
joint_ctc_attention (bool, optional) – flag indication joint ctc attention or not
- Inputs:
inputs: list of sequences, whose length is the batch size and within which each sequence is list of tokens
input_lengths: list of sequence lengths
- Returns
outputs: A output sequence of encoders. FloatTensor of size
(batch, seq_length, dimension)
- encoder_logits: Log probability of encoders outputs will be passed to CTC Loss.
If joint_ctc_attention is False, return None.
(batch, seq_length, num_classes)
output_lengths: The length of encoders outputs.
(batch)
- Return type
(Tensor, Tensor, Tensor)
- Reference:
Ashish Vaswani et al.: Attention Is All You Need https://arxiv.org/abs/1706.03762
-
forward
(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Forward propagate a inputs for encoders training.
- Parameters
inputs (torch.FloatTensor) – A input sequence passed to encoders. Typically for inputs this will be a padded FloatTensor of size
(batch, seq_length, dimension)
.input_lengths (torch.LongTensor) – The length of input tensor.
(batch)
- Returns
outputs: A output sequence of encoders. FloatTensor of size
(batch, seq_length, dimension)
- encoder_logits: Log probability of encoders outputs will be passed to CTC Loss.
If joint_ctc_attention is False, return None.
(batch, seq_length, num_classes)
output_lengths: The length of encoders outputs.
(batch)
- Return type
(Tensor, Tensor, Tensor)
-
class
openspeech.encoders.transformer_encoder.
TransformerEncoderLayer
(d_model: int = 512, num_heads: int = 8, d_ff: int = 2048, dropout_p: float = 0.3)[source]¶ EncoderLayer is made up of self-attention and feedforward network. This standard encoders layer is based on the paper “Attention Is All You Need”.
- Parameters
d_model – dimension of model (default: 512)
num_heads – number of attention heads (default: 8)
d_ff – dimension of feed forward network (default: 2048)
dropout_p – probability of dropout (default: 0.3)
- Inputs:
inputs (torch.FloatTensor): input sequence of transformer encoder layer self_attn_mask (torch.BoolTensor): mask of self attention
- Returns
(Tensor, Tensor)
outputs (torch.FloatTensor): output of transformer encoder layer
attn (torch.FloatTensor): attention of transformer encoder layer
-
forward
(inputs: torch.Tensor, self_attn_mask: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Forward propagate of transformer encoder layer.
- Inputs:
inputs (torch.FloatTensor): input sequence of transformer encoder layer self_attn_mask (torch.BoolTensor): mask of self attention
- Returns
output of transformer encoder layer attn (torch.FloatTensor): attention of transformer encoder layer
- Return type
outputs (torch.FloatTensor)
Transformer Transducer Encoder¶
-
class
openspeech.encoders.transformer_transducer_encoder.
TransformerTransducerEncoder
(input_size: int = 80, model_dim: int = 512, d_ff: int = 2048, num_layers: int = 18, num_heads: int = 8, dropout: float = 0.1, max_positional_length: int = 5000)[source]¶ Converts the audio signal to higher feature values
- Parameters
input_size (int) – dimension of input vector (default : 80)
model_dim (int) – the number of features in the audio encoder (default : 512)
d_ff (int) – the number of features in the feed forward layers (default : 2048)
num_layers (int) – the number of audio encoder layers (default: 18)
num_heads (int) – the number of heads in the multi-head attention (default: 8)
dropout (float) – dropout probability of audio encoder (default: 0.1)
max_positional_length (int) – Maximum length to use for positional encoding (default : 5000)
- Inputs: inputs, inputs_lens
inputs: Parsed audio of batch size number
inputs_lens: Tensor of sequence lengths
- Returns
(batch, seq_length, dimension)
* input_lengths (torch.LongTensor):(batch)
- Return type
outputs (torch.FloatTensor)
- Reference:
Qian Zhang et al.: Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss https://arxiv.org/abs/2002.02562
-
forward
(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Forward propagate a inputs for audio encoder.
- Parameters
inputs (torch.FloatTensor) – A input sequence passed to audio encoder. Typically inputs will be a padded FloatTensor of size
(batch, seq_length, dimension)
.input_lengths (torch.LongTensor) – The length of input tensor.
(batch)
- Returns
(batch, seq_length, dimension)
** input_lengths**(Tensor):(batch)
- Return type
outputs (Tensor)
-
class
openspeech.encoders.transformer_transducer_encoder.
TransformerTransducerEncoderLayer
(model_dim: int = 512, d_ff: int = 2048, num_heads: int = 8, dropout: float = 0.1)[source]¶ Repeated layers common to audio encoders and label encoders
- Parameters
model_dim (int) – the number of features in the encoder (default : 512)
d_ff (int) – the number of features in the feed forward layers (default : 2048)
num_heads (int) – the number of heads in the multi-head attention (default: 8)
dropout (float) – dropout probability of encoder layer (default: 0.1)
- Inputs: inputs, self_attn_mask
inputs: Audio feature or label feature
self_attn_mask: Self attention mask to use in multi-head attention
- Returns: outputs, attn_distribution
(Tensor, Tensor)
outputs (torch.FloatTensor): Tensor containing higher (audio, label) feature values
attn_distribution (torch.FloatTensor): Attention distribution in multi-head attention
-
forward
(inputs: torch.Tensor, self_attn_mask: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Forward propagate a inputs for label encoder.
- Parameters
inputs – A input sequence passed to encoder layer.
(batch, seq_length, dimension)
self_attn_mask – Self attention mask to cover up padding
(batch, seq_length, seq_length)
- Returns
(batch, seq_length, dimension)
attn_distribution (Tensor):(batch, seq_length, seq_length)
- Return type
outputs (Tensor)