Add Normalization¶
(sublayer: torch.nn.modules.module.Module, d_model: int = 512)[source]¶ Add & Normalization layer proposed in “Attention Is All You Need”. Transformer employ a residual connection around each of the two sub-layers, (Multi-Head Attention & Feed-Forward) followed by layer normalization.
Additive Attention¶
(dim: int)[source]¶ Applies a additive attention (bahdanau) mechanism on the output features from the decoders. Additive attention proposed in “Neural Machine Translation by Jointly Learning to Align and Translate” paper.
- Parameters
dim (int) – dimension of model
- Inputs: query, key, value
query (batch_size, q_len, hidden_dim): tensor containing the output features from the decoders.
key (batch, k_len, d_model): tensor containing projection vector for encoders.
value (batch_size, v_len, hidden_dim): tensor containing features of the encoded input sequence.
- Returns: context, attn
context: tensor containing the context vector from attention mechanism.
attn: tensor containing the alignment from the encoders outputs.
BatchNorm ReLU RNN¶
(input_size: int, hidden_state_dim: int = 512, rnn_type: str = 'gru', bidirectional: bool = True, dropout_p: float = 0.1)[source]¶ Recurrent neural network with batch normalization layer & ReLU activation function.
- Parameters
input_size (int) – size of input
hidden_state_dim (int) – the number of features in the hidden state h
rnn_type (str, optional) – type of RNN cell (default: gru)
bidirectional (bool, optional) – if True, becomes a bidirectional encoders (defulat: True)
dropout_p (float, optional) – dropout probability (default: 0.1)
- Inputs: inputs, input_lengths
inputs (batch, time, dim): Tensor containing input vectors
input_lengths: Tensor containing containing sequence lengths
- Returns: outputs
outputs: Tensor produced by the BNReluRNN module
Conformer Attention Module¶
(d_model: int, num_heads: int, dropout_p: float = 0.1)[source]¶ Conformer employ multi-headed self-attention (MHSA) while integrating an important technique from Transformer-XL, the relative sinusoidal positional encoding scheme. The relative positional encoding allows the self-attention module to generalize better on different input length and the resulting encoders is more robust to the variance of the utterance length. Conformer use prenorm residual units with dropout which helps training and regularizing deeper models.
- Parameters
- Inputs: inputs, mask
inputs (batch, time, dim): Tensor containing input vector
mask (batch, 1, time2) or (batch, time1, time2): Tensor containing indices to be masked
- Returns
Tensor produces by relative multi headed self attention module.
- Return type
outputs (batch, time, dim)
(inputs: torch.Tensor, mask: Optional[torch.Tensor] = None) → torch.Tensor[source]¶ Forward propagate of conformer’s multi-headed self attention module.
- Inputs: inputs, mask
inputs (batch, time, dim): Tensor containing input vector
mask (batch, 1, time2) or (batch, time1, time2): Tensor containing indices to be masked
- Returns
Tensor produces by relative multi headed self attention module.
- Return type
outputs (batch, time, dim)
Conformer Block¶
(encoder_dim: int = 512, num_attention_heads: int = 8, feed_forward_expansion_factor: int = 4, conv_expansion_factor: int = 2, feed_forward_dropout_p: float = 0.1, attention_dropout_p: float = 0.1, conv_dropout_p: float = 0.1, conv_kernel_size: int = 31, half_step_residual: bool = True)[source]¶ Conformer block contains two Feed Forward modules sandwiching the Multi-Headed Self-Attention module and the Convolution module. This sandwich structure is inspired by Macaron-Net, which proposes replacing the original feed-forward layer in the Transformer block into two half-step feed-forward layers, one before the attention layer and one after.
- Parameters
encoder_dim (int, optional) – Dimension of conformer encoders
num_attention_heads (int, optional) – Number of attention heads
feed_forward_expansion_factor (int, optional) – Expansion factor of feed forward module
conv_expansion_factor (int, optional) – Expansion factor of conformer convolution module
feed_forward_dropout_p (float, optional) – Probability of feed forward module dropout
attention_dropout_p (float, optional) – Probability of attention module dropout
conv_dropout_p (float, optional) – Probability of conformer convolution module dropout
conv_kernel_size (int or tuple, optional) – Size of the convolving kernel
half_step_residual (bool) – Flag indication whether to use half step residual or not
- Inputs: inputs
inputs (batch, time, dim): Tensor containing input vector
- Returns: outputs
outputs (batch, time, dim): Tensor produces by conformer block.
Conformer Convolution Module¶
(in_channels: int, kernel_size: int = 31, expansion_factor: int = 2, dropout_p: float = 0.1)[source]¶ Conformer convolution module starts with a pointwise convolution and a gated linear unit (GLU). This is followed by a single 1-D depthwise convolution layer. Batchnorm is deployed just after the convolution to aid training deep models.
- Parameters
- Inputs: inputs
inputs (batch, time, dim): Tensor contains input sequences
- Outputs: outputs
outputs (batch, time, dim): Tensor produces by conformer convolution module.
(inputs: torch.Tensor) → torch.Tensor[source]¶ Forward propagate of conformer’s convolution module.
- Inputs: inputs
inputs (batch, time, dim): Tensor contains input sequences
- Outputs: outputs
outputs (batch, time, dim): Tensor produces by conformer convolution module.
Conformer Feed-Forward Module¶
(encoder_dim: int = 512, expansion_factor: int = 4, dropout_p: float = 0.1)[source]¶ Conformer Feed Forward Module follow pre-norm residual units and apply layer normalization within the residual unit and on the input before the first linear layer. This module also apply Swish activation and dropout, which helps regularizing the network.
- Parameters
- Inputs: inputs
inputs (batch, time, dim): Tensor contains input sequences
- Outputs: outputs
outputs (batch, time, dim): Tensor produces by feed forward module.
(inputs: torch.Tensor) → torch.Tensor[source]¶ Forward propagate of conformer’s feed-forward module.
- Inputs: inputs
inputs (batch, time, dim): Tensor contains input sequences
- Outputs: outputs
outputs (batch, time, dim): Tensor produces by feed forward module.
Conv2d Extractor¶
(input_dim: int, activation: str = 'hardtanh')[source]¶ Provides inteface of convolutional extractor.
Do not use this class directly, use one of the sub classes. Define the ‘self.conv’ class variable.
- Inputs: inputs, input_lengths
inputs (batch, time, dim): Tensor containing input vectors
input_lengths: Tensor containing containing sequence lengths
- Returns: outputs, output_lengths
outputs: Tensor produced by the convolution
output_lengths: Tensor containing sequence lengths produced by the convolution
(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ inputs: torch.FloatTensor (batch, time, dimension) input_lengths: torch.IntTensor (batch)
Conv2d Subsampling¶
(input_dim: int, in_channels: int, out_channels: int, activation: str = 'relu')[source]¶ Convolutional 2D subsampling (to 1/4 length)
- Parameters
- Inputs: inputs
inputs (batch, time, dim): Tensor containing sequence of inputs
input_lengths (batch): list of sequence input lengths
- Returns: outputs, output_lengths
outputs (batch, time, dim): Tensor produced by the convolution
output_lengths (batch): list of sequence output lengths
(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ inputs: torch.FloatTensor (batch, time, dimension) input_lengths: torch.IntTensor (batch)
Conv Group Shuffle¶
DeepSpeech2 Extractor¶
(input_dim: int, in_channels: int = 1, out_channels: int = 32, activation: str = 'hardtanh')[source]¶ DeepSpeech2 extractor for automatic speech recognition described in “Deep Speech 2: End-to-End Speech Recognition in English and Mandarin” paper -
- Parameters
- Inputs: inputs, input_lengths
inputs (batch, time, dim): Tensor containing input vectors
input_lengths: Tensor containing containing sequence lengths
- Returns: outputs, output_lengths
outputs: Tensor produced by the convolution
output_lengths: Tensor containing sequence lengths produced by the convolution
(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ inputs: torch.FloatTensor (batch, time, dimension) input_lengths: torch.IntTensor (batch)
Depthwise Conv1d¶
(in_channels: int, out_channels: int, kernel_size: int, stride: int = 1, padding: int = 0, bias: bool = False)[source]¶ When groups == in_channels and out_channels == K * in_channels, where K is a positive integer, this operation is termed in literature as depthwise convolution.
- Parameters
in_channels (int) – Number of channels in the input
out_channels (int) – Number of channels produced by the convolution
stride (int, optional) – Stride of the convolution. Default: 1
padding (int or tuple, optional) – Zero-padding added to both sides of the input. Default: 0
bias (bool, optional) – If True, adds a learnable bias to the output. Default: True
- Inputs: inputs
inputs (batch, in_channels, time): Tensor containing input vector
- Returns: outputs
outputs (batch, out_channels, time): Tensor produces by depthwise 1-D convolution.
Dot-product Attention¶
(dim: int, scale: bool = True)[source]¶ Scaled Dot-Product Attention proposed in “Attention Is All You Need” Compute the dot products of the query with all keys, divide each by sqrt(dim), and apply a softmax function to obtain the weights on the values
- Args: dim, mask
dim (int): dimension of attention mask (torch.Tensor): tensor containing indices to be masked
- Inputs: query, key, value, mask
query (batch, q_len, d_model): tensor containing projection vector for decoders.
key (batch, k_len, d_model): tensor containing projection vector for encoders.
value (batch, v_len, d_model): tensor containing features of the encoded input sequence.
mask (-): tensor containing indices to be masked
- Returns: context, attn
context: tensor containing the context vector from attention mechanism.
attn: tensor containing the attention (alignment) from the encoders outputs.
Jasper Block¶
(num_sub_blocks: int, in_channels: int, out_channels: int, kernel_size: int, stride: int = 1, dilation: int = 1, bias: bool = True, dropout_p: float = 0.2, activation: str = 'relu')[source]¶ Jasper Block: The Jasper Block consists of R Jasper sub-block.
- Parameters
num_sub_blocks (int) – number of sub block
in_channels (int) – number of channels in the input feature
out_channels (int) – number of channels produced by the convolution
kernel_size (int) – size of the convolving kernel
stride (int) – stride of the convolution. (default: 1)
dilation (int) – spacing between kernel elements. (default: 1)
bias (bool) – if True, adds a learnable bias to the output. (default: True)
dropout_p (float) – probability of dropout
activation (str) – activation function
- Inputs: inputs, input_lengths, residual
inputs: tensor contains input sequence vector
input_lengths: tensor contains sequence lengths
residual: tensor contains residual vector
- Returns: output, output_lengths
(torch.FloatTensor, torch.LongTensor)
output (torch.FloatTensor): tensor contains output sequence vector
output_lengths (torch.LongTensor): tensor contains output sequence lengths
(inputs: torch.Tensor, input_lengths: torch.Tensor, residual: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Forward propagate of jasper block.
- Inputs: inputs, input_lengths, residual
inputs: tensor contains input sequence vector
input_lengths: tensor contains sequence lengths
residual: tensor contains residual vector
- Returns: output, output_lengths
(torch.FloatTensor, torch.LongTensor)
output (torch.FloatTensor): tensor contains output sequence vector
output_lengths (torch.LongTensor): tensor contains output sequence lengths
Jasper Sub Block¶
(in_channels: int, out_channels: int, kernel_size: int, stride: int = 1, dilation: int = 1, padding: int = 0, bias: bool = False, dropout_p: float = 0.2, activation: str = 'relu')[source]¶ Jasper sub-block applies the following operations: a 1D-convolution, batch norm, ReLU, and dropout.
- Parameters
in_channels (int) – number of channels in the input feature
out_channels (int) – number of channels produced by the convolution
kernel_size (int) – size of the convolving kernel
stride (int) – stride of the convolution. (default: 1)
dilation (int) – spacing between kernel elements. (default: 1)
padding (int) – zero-padding added to both sides of the input. (default: 0)
bias (bool) – if True, adds a learnable bias to the output. (default: False)
dropout_p (float) – probability of dropout
activation (str) – activation function
- Inputs: inputs, input_lengths, residual
inputs: tensor contains input sequence vector
input_lengths: tensor contains sequence lengths
residual: tensor contains residual vector
- Returns: output, output_lengths
output (torch.FloatTensor): tensor contains output sequence vector
output_lengths (torch.LongTensor): tensor contains output sequence lengths
(inputs: torch.Tensor, input_lengths: torch.Tensor, residual: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Forward propagate of conformer’s subblock.
- Inputs: inputs, input_lengths, residual
inputs: tensor contains input sequence vector
input_lengths: tensor contains sequence lengths
residual: tensor contains residual vector
- Returns: output, output_lengths
output (torch.FloatTensor): tensor contains output sequence vector
output_lengths (torch.LongTensor): tensor contains output sequence lengths
Location Aware Attention¶
(dim: int = 1024, attn_dim: int = 1024, smoothing: bool = False)[source]¶ Applies a location-aware attention mechanism on the output features from the decoders. Location-aware attention proposed in “Attention-Based Models for Speech Recognition” paper. The location-aware attention mechanism is performing well in speech recognition tasks. We refer to implementation of ClovaCall Attention style.
- Parameters
- Inputs: query, value, last_attn
query (batch, q_len, hidden_dim): tensor containing the output features from the decoders.
value (batch, v_len, hidden_dim): tensor containing features of the encoded input sequence.
last_attn (batch_size, v_len): tensor containing previous timestep`s attention (alignment)
- Returns: output, attn
output (batch, output_len, dimensions): tensor containing the feature from encoders outputs
attn (batch * num_heads, v_len): tensor containing the attention (alignment) from the encoders outputs.
- Reference:
Jan Chorowski et al.: Attention-Based Models for Speech Recognition.
Mask Conv1d¶
(in_channels: int, out_channels: int, kernel_size: int, stride: int = 1, padding: int = 0, dilation: int = 1, groups: int = 1, bias: bool = False)[source]¶ 1D convolution with masking
- Parameters
in_channels (int) – Number of channels in the input vector
out_channels (int) – Number of channels produced by the convolution
stride (int) – Stride of the convolution. Default: 1
padding (int) – Zero-padding added to both sides of the input. Default: 0
dilation (int) – Spacing between kernel elements. Default: 1
groups (int) – Number of blocked connections from input channels to output channels. Default: 1
bias (bool) – If True, adds a learnable bias to the output. Default: True
- Inputs: inputs, seq_lengths
inputs (torch.FloatTensor): The input of size (batch, dimension, time)
seq_lengths (torch.IntTensor): The actual length of each sequence in the batch
- Returns: output, seq_lengths
output: Masked output from the conv1d
seq_lengths: Sequence length of output from the conv1d
(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ inputs: (batch, dimension, time) input_lengths: (batch)
Mask Conv2d¶
(sequential: torch.nn.modules.container.Sequential)[source]¶ Masking Convolutional Neural Network
Adds padding to the output of the module based on the given lengths. This is to ensure that the results of the model do not change when batch sizes change during inference. Input needs to be in the shape of (batch_size, channel, hidden_dim, seq_len)
Refer to Copyright (c) 2017 Sean Naren MIT License
- Parameters
sequential (torch.nn) – sequential list of convolution layer
- Inputs: inputs, seq_lengths
inputs (torch.FloatTensor): The input of size BxCxHxT
seq_lengths (torch.IntTensor): The actual length of each sequence in the batch
- Returns: output, seq_lengths
output: Masked output from the sequential
seq_lengths: Sequence length of output from the sequential
Multi-Head Attention¶
(dim: int = 512, num_heads: int = 8)[source]¶ Multi-Head Attention proposed in “Attention Is All You Need” Instead of performing a single attention function with d_model-dimensional keys, values, and queries, project the queries, keys and values h times with different, learned linear projections to d_head dimensions. These are concatenated and once again projected, resulting in the final values. Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.
- MultiHead(Q, K, V) = Concat(head_1, …, head_h) · W_o
where head_i = Attention(Q · W_q, K · W_k, V · W_v)
- Parameters
- Inputs: query, key, value, mask
query (batch, q_len, d_model): tensor containing projection vector for decoders.
key (batch, k_len, d_model): tensor containing projection vector for encoders.
value (batch, v_len, d_model): tensor containing features of the encoded input sequence.
mask (-): tensor containing indices to be masked
- Returns: output, attn
output (batch, output_len, dimensions): tensor containing the attended output features.
attn (batch * num_heads, v_len): tensor containing the attention (alignment) from the encoders outputs.
Pointwise Conv1d¶
(in_channels: int, out_channels: int, stride: int = 1, padding: int = 0, bias: bool = True)[source]¶ When kernel size == 1 conv1d, this operation is termed in literature as pointwise convolution. This operation often used to match dimensions.
- Parameters
in_channels (int) – Number of channels in the input
out_channels (int) – Number of channels produced by the convolution
stride (int, optional) – Stride of the convolution. Default: 1
padding (int or tuple, optional) – Zero-padding added to both sides of the input. Default: 0
bias (bool, optional) – If True, adds a learnable bias to the output. Default: True
- Inputs: inputs
inputs (batch, in_channels, time): Tensor containing input vector
- Returns: outputs
outputs (batch, out_channels, time): Tensor produces by pointwise 1-D convolution.
Positional Encoding¶
(d_model: int = 512, max_len: int = 5000)[source]¶ Positional Encoding proposed in “Attention Is All You Need”. Since transformer contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must add some positional information.
- “Attention Is All You Need” use sine and cosine functions of different frequencies:
PE_(pos, 2i) = sin(pos / power(10000, 2i / d_model)) PE_(pos, 2i+1) = cos(pos / power(10000, 2i / d_model))
Position-wise Feed-Forward¶
(d_model: int = 512, d_ff: int = 2048, dropout_p: float = 0.3)[source]¶ Position-wise Feedforward Networks proposed in “Attention Is All You Need”. Fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between. Another way of describing this is as two convolutions with kernel size 1.
QuartzNet Block¶
(num_sub_blocks: int, in_channels: int, out_channels: int, kernel_size: int, bias: bool = True)[source]¶ QuartzNet’s design is based on the Jasper architecture, which is a convolutional model trained with Connectionist Temporal Classification (CTC) loss. The main novelty in QuartzNet’s architecture is that QuartzNet replaced the 1D convolutions with 1D time-channel separable convolutions, an implementation of depthwise separable convolutions.
- Inputs: inputs, input_lengths
inputs (torch.FloatTensor): tensor contains input sequence vector input_lengths (torch.LongTensor): tensor contains sequence lengths
- Returns: output, output_lengths
(torch.FloatTensor, torch.LongTensor)
output (torch.FloatTensor): tensor contains output sequence vector
output_lengths (torch.LongTensor): tensor contains output sequence lengths
(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Forward propagate of QuartzNet block.
- Inputs: inputs, input_lengths
inputs (torch.FloatTensor): tensor contains input sequence vector input_lengths (torch.LongTensor): tensor contains sequence lengths
- Returns: output, output_lengths
(torch.FloatTensor, torch.LongTensor)
output (torch.FloatTensor): tensor contains output sequence vector
output_lengths (torch.LongTensor): tensor contains output sequence lengths
QuartzNet Sub Block¶
(in_channels: int, out_channels: int, kernel_size: int, bias: bool = False, padding: int = 0, groups: int = 1)[source]¶ QuartzNet sub-block applies the following operations: a 1D-convolution, batch norm, ReLU, and dropout.
- Parameters
in_channels (int) – number of channels in the input feature
out_channels (int) – number of channels produced by the convolution
kernel_size (int) – size of the convolving kernel
padding (int) – zero-padding added to both sides of the input. (default: 0)
bias (bool) – if True, adds a learnable bias to the output. (default: False)
- Inputs: inputs, input_lengths, residual
inputs: tensor contains input sequence vector
input_lengths: tensor contains sequence lengths
residual: tensor contains residual vector
- Returns: output, output_lengths
output (torch.FloatTensor): tensor contains output sequence vector
output_lengths (torch.LongTensor): tensor contains output sequence lengths
Relative Multi-Head Attention¶
(dim: int = 512, num_heads: int = 16, dropout_p: float = 0.1)[source]¶ Multi-head attention with relative positional encoding. This concept was proposed in the “Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context”
- Parameters
- Inputs: query, key, value, pos_embedding, mask
query (batch, time, dim): Tensor containing query vector
key (batch, time, dim): Tensor containing key vector
value (batch, time, dim): Tensor containing value vector
pos_embedding (batch, time, dim): Positional embedding tensor
mask (batch, 1, time2) or (batch, time1, time2): Tensor containing indices to be masked
- Returns
Tensor produces by relative multi head attention module.
- Return type
Residual Connection Module¶
Time-Channel Separable Conv1d¶
(in_channels: int, out_channels: int, kernel_size: int = 1, padding: int = 0, groups: int = 1, bias: bool = True)[source]¶ The total number of weights for a time-channel separable convolution block is K × cin + cin × cout weights. Since K is generally several times smaller than cout, most weights are concentrated in the pointwise convolution part.
Transformer Embedding¶
(num_embeddings: int, pad_id: int, d_model: int = 512)[source]¶ Embedding layer. Similarly to other sequence transduction models, transformer use learned embeddings to convert the input tokens and output tokens to vectors of dimension d_model. In the embedding layers, transformer multiply those weights by sqrt(d_model)
- Parameters
- Inputs:
inputs (torch.FloatTensor): input of embedding layer
- Returns
output of embedding layer
- Return type
outputs (torch.FloatTensor)
(inputs: torch.Tensor) → torch.Tensor[source]¶ Forward propagate of embedding layer.
- Inputs:
inputs (torch.FloatTensor): input of embedding layer
- Returns
output of embedding layer
- Return type
outputs (torch.FloatTensor)
VGG Extractor¶
(input_dim: int, in_channels: int = 1, out_channels: int = 64, 128, activation: str = 'hardtanh')[source]¶ VGG extractor for automatic speech recognition described in “Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM” paper -
- Parameters
- Inputs: inputs, input_lengths
inputs (batch, time, dim): Tensor containing input vectors
input_lengths: Tensor containing containing sequence lengths
- Returns: outputs, output_lengths
outputs: Tensor produced by the convolution
output_lengths: Tensor containing sequence lengths produced by the convolution
(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ inputs: torch.FloatTensor (batch, time, dimension) input_lengths: torch.IntTensor (batch)
(in_features: int, out_features: int, bias: bool = True)[source]¶ Wrapper class of torch.nn.Linear Weight initialize by xavier initialization and bias initialize to zeros.