Modules

Add Normalization

class openspeech.modules.add_normalization.AddNorm(sublayer: torch.nn.modules.module.Module, d_model: int = 512)[source]

Add & Normalization layer proposed in “Attention Is All You Need”. Transformer employ a residual connection around each of the two sub-layers, (Multi-Head Attention & Feed-Forward) followed by layer normalization.

Additive Attention

class openspeech.modules.additive_attention.AdditiveAttention(dim: int)[source]

Applies a additive attention (bahdanau) mechanism on the output features from the decoders. Additive attention proposed in “Neural Machine Translation by Jointly Learning to Align and Translate” paper.

Parameters

dim (int) – dimension of model

Inputs: query, key, value
  • query (batch_size, q_len, hidden_dim): tensor containing the output features from the decoders.

  • key (batch, k_len, d_model): tensor containing projection vector for encoders.

  • value (batch_size, v_len, hidden_dim): tensor containing features of the encoded input sequence.

Returns: context, attn
  • context: tensor containing the context vector from attention mechanism.

  • attn: tensor containing the alignment from the encoders outputs.

BatchNorm ReLU RNN

class openspeech.modules.batchnorm_relu_rnn.BNReluRNN(input_size: int, hidden_state_dim: int = 512, rnn_type: str = 'gru', bidirectional: bool = True, dropout_p: float = 0.1)[source]

Recurrent neural network with batch normalization layer & ReLU activation function.

Parameters
  • input_size (int) – size of input

  • hidden_state_dim (int) – the number of features in the hidden state h

  • rnn_type (str, optional) – type of RNN cell (default: gru)

  • bidirectional (bool, optional) – if True, becomes a bidirectional encoders (defulat: True)

  • dropout_p (float, optional) – dropout probability (default: 0.1)

Inputs: inputs, input_lengths
  • inputs (batch, time, dim): Tensor containing input vectors

  • input_lengths: Tensor containing containing sequence lengths

Returns: outputs
  • outputs: Tensor produced by the BNReluRNN module

Conformer Attention Module

class openspeech.modules.conformer_attention_module.MultiHeadedSelfAttentionModule(d_model: int, num_heads: int, dropout_p: float = 0.1)[source]

Conformer employ multi-headed self-attention (MHSA) while integrating an important technique from Transformer-XL, the relative sinusoidal positional encoding scheme. The relative positional encoding allows the self-attention module to generalize better on different input length and the resulting encoders is more robust to the variance of the utterance length. Conformer use prenorm residual units with dropout which helps training and regularizing deeper models.

Parameters
  • d_model (int) – The dimension of model

  • num_heads (int) – The number of attention heads.

  • dropout_p (float) – probability of dropout

Inputs: inputs, mask
  • inputs (batch, time, dim): Tensor containing input vector

  • mask (batch, 1, time2) or (batch, time1, time2): Tensor containing indices to be masked

Returns

Tensor produces by relative multi headed self attention module.

Return type

  • outputs (batch, time, dim)

forward(inputs: torch.Tensor, mask: Optional[torch.Tensor] = None)torch.Tensor[source]

Forward propagate of conformer’s multi-headed self attention module.

Inputs: inputs, mask
  • inputs (batch, time, dim): Tensor containing input vector

  • mask (batch, 1, time2) or (batch, time1, time2): Tensor containing indices to be masked

Returns

Tensor produces by relative multi headed self attention module.

Return type

  • outputs (batch, time, dim)

Conformer Block

class openspeech.modules.conformer_block.ConformerBlock(encoder_dim: int = 512, num_attention_heads: int = 8, feed_forward_expansion_factor: int = 4, conv_expansion_factor: int = 2, feed_forward_dropout_p: float = 0.1, attention_dropout_p: float = 0.1, conv_dropout_p: float = 0.1, conv_kernel_size: int = 31, half_step_residual: bool = True)[source]

Conformer block contains two Feed Forward modules sandwiching the Multi-Headed Self-Attention module and the Convolution module. This sandwich structure is inspired by Macaron-Net, which proposes replacing the original feed-forward layer in the Transformer block into two half-step feed-forward layers, one before the attention layer and one after.

Parameters
  • encoder_dim (int, optional) – Dimension of conformer encoders

  • num_attention_heads (int, optional) – Number of attention heads

  • feed_forward_expansion_factor (int, optional) – Expansion factor of feed forward module

  • conv_expansion_factor (int, optional) – Expansion factor of conformer convolution module

  • feed_forward_dropout_p (float, optional) – Probability of feed forward module dropout

  • attention_dropout_p (float, optional) – Probability of attention module dropout

  • conv_dropout_p (float, optional) – Probability of conformer convolution module dropout

  • conv_kernel_size (int or tuple, optional) – Size of the convolving kernel

  • half_step_residual (bool) – Flag indication whether to use half step residual or not

Inputs: inputs
  • inputs (batch, time, dim): Tensor containing input vector

Returns: outputs
  • outputs (batch, time, dim): Tensor produces by conformer block.

Conformer Convolution Module

class openspeech.modules.conformer_convolution_module.ConformerConvModule(in_channels: int, kernel_size: int = 31, expansion_factor: int = 2, dropout_p: float = 0.1)[source]

Conformer convolution module starts with a pointwise convolution and a gated linear unit (GLU). This is followed by a single 1-D depthwise convolution layer. Batchnorm is deployed just after the convolution to aid training deep models.

Parameters
  • in_channels (int) – Number of channels in the input

  • kernel_size (int or tuple, optional) – Size of the convolving kernel Default: 31

  • dropout_p (float, optional) – probability of dropout

Inputs: inputs

inputs (batch, time, dim): Tensor contains input sequences

Outputs: outputs

outputs (batch, time, dim): Tensor produces by conformer convolution module.

forward(inputs: torch.Tensor)torch.Tensor[source]

Forward propagate of conformer’s convolution module.

Inputs: inputs

inputs (batch, time, dim): Tensor contains input sequences

Outputs: outputs

outputs (batch, time, dim): Tensor produces by conformer convolution module.

Conformer Feed-Forward Module

class openspeech.modules.conformer_feed_forward_module.FeedForwardModule(encoder_dim: int = 512, expansion_factor: int = 4, dropout_p: float = 0.1)[source]

Conformer Feed Forward Module follow pre-norm residual units and apply layer normalization within the residual unit and on the input before the first linear layer. This module also apply Swish activation and dropout, which helps regularizing the network.

Parameters
  • encoder_dim (int) – Dimension of conformer encoders

  • expansion_factor (int) – Expansion factor of feed forward module.

  • dropout_p (float) – Ratio of dropout

Inputs: inputs
  • inputs (batch, time, dim): Tensor contains input sequences

Outputs: outputs
  • outputs (batch, time, dim): Tensor produces by feed forward module.

forward(inputs: torch.Tensor)torch.Tensor[source]

Forward propagate of conformer’s feed-forward module.

Inputs: inputs
  • inputs (batch, time, dim): Tensor contains input sequences

Outputs: outputs
  • outputs (batch, time, dim): Tensor produces by feed forward module.

Conv2d Extractor

class openspeech.modules.conv2d_extractor.Conv2dExtractor(input_dim: int, activation: str = 'hardtanh')[source]

Provides inteface of convolutional extractor.

Note

Do not use this class directly, use one of the sub classes. Define the ‘self.conv’ class variable.

Inputs: inputs, input_lengths
  • inputs (batch, time, dim): Tensor containing input vectors

  • input_lengths: Tensor containing containing sequence lengths

Returns: outputs, output_lengths
  • outputs: Tensor produced by the convolution

  • output_lengths: Tensor containing sequence lengths produced by the convolution

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

inputs: torch.FloatTensor (batch, time, dimension) input_lengths: torch.IntTensor (batch)

Conv2d Subsampling

class openspeech.modules.conv2d_subsampling.Conv2dSubsampling(input_dim: int, in_channels: int, out_channels: int, activation: str = 'relu')[source]

Convolutional 2D subsampling (to 1/4 length)

Parameters
  • input_dim (int) – Dimension of input vector

  • in_channels (int) – Number of channels in the input vector

  • out_channels (int) – Number of channels produced by the convolution

  • activation (str) – Activation function

Inputs: inputs
  • inputs (batch, time, dim): Tensor containing sequence of inputs

  • input_lengths (batch): list of sequence input lengths

Returns: outputs, output_lengths
  • outputs (batch, time, dim): Tensor produced by the convolution

  • output_lengths (batch): list of sequence output lengths

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

inputs: torch.FloatTensor (batch, time, dimension) input_lengths: torch.IntTensor (batch)

Conv Base

class openspeech.modules.conv_base.BaseConv1d[source]

Base convolution module.

Conv Group Shuffle

class openspeech.modules.conv_group_shuffle.ConvGroupShuffle(groups, channels)[source]

Convolution group shuffle module.

DeepSpeech2 Extractor

class openspeech.modules.deepspeech2_extractor.DeepSpeech2Extractor(input_dim: int, in_channels: int = 1, out_channels: int = 32, activation: str = 'hardtanh')[source]

DeepSpeech2 extractor for automatic speech recognition described in “Deep Speech 2: End-to-End Speech Recognition in English and Mandarin” paper - https://arxiv.org/abs/1512.02595

Parameters
  • input_dim (int) – Dimension of input vector

  • in_channels (int) – Number of channels in the input vector

  • out_channels (int) – Number of channels produced by the convolution

  • activation (str) – Activation function

Inputs: inputs, input_lengths
  • inputs (batch, time, dim): Tensor containing input vectors

  • input_lengths: Tensor containing containing sequence lengths

Returns: outputs, output_lengths
  • outputs: Tensor produced by the convolution

  • output_lengths: Tensor containing sequence lengths produced by the convolution

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

inputs: torch.FloatTensor (batch, time, dimension) input_lengths: torch.IntTensor (batch)

Depthwise Conv1d

class openspeech.modules.depthwise_conv1d.DepthwiseConv1d(in_channels: int, out_channels: int, kernel_size: int, stride: int = 1, padding: int = 0, bias: bool = False)[source]

When groups == in_channels and out_channels == K * in_channels, where K is a positive integer, this operation is termed in literature as depthwise convolution.

Parameters
  • in_channels (int) – Number of channels in the input

  • out_channels (int) – Number of channels produced by the convolution

  • kernel_size (int or tuple) – Size of the convolving kernel

  • stride (int, optional) – Stride of the convolution. Default: 1

  • padding (int or tuple, optional) – Zero-padding added to both sides of the input. Default: 0

  • bias (bool, optional) – If True, adds a learnable bias to the output. Default: True

Inputs: inputs
  • inputs (batch, in_channels, time): Tensor containing input vector

Returns: outputs
  • outputs (batch, out_channels, time): Tensor produces by depthwise 1-D convolution.

Dot-product Attention

class openspeech.modules.dot_product_attention.DotProductAttention(dim: int, scale: bool = True)[source]

Scaled Dot-Product Attention proposed in “Attention Is All You Need” Compute the dot products of the query with all keys, divide each by sqrt(dim), and apply a softmax function to obtain the weights on the values

Args: dim, mask

dim (int): dimension of attention mask (torch.Tensor): tensor containing indices to be masked

Inputs: query, key, value, mask
  • query (batch, q_len, d_model): tensor containing projection vector for decoders.

  • key (batch, k_len, d_model): tensor containing projection vector for encoders.

  • value (batch, v_len, d_model): tensor containing features of the encoded input sequence.

  • mask (-): tensor containing indices to be masked

Returns: context, attn
  • context: tensor containing the context vector from attention mechanism.

  • attn: tensor containing the attention (alignment) from the encoders outputs.

GLU

class openspeech.modules.glu.GLU(dim: int)[source]

The gating mechanism is called Gated Linear Units (GLU), which was first introduced for natural language processing in the paper “Language Modeling with Gated Convolutional Networks”

Jasper Block

class openspeech.modules.jasper_block.JasperBlock(num_sub_blocks: int, in_channels: int, out_channels: int, kernel_size: int, stride: int = 1, dilation: int = 1, bias: bool = True, dropout_p: float = 0.2, activation: str = 'relu')[source]

Jasper Block: The Jasper Block consists of R Jasper sub-block.

Parameters
  • num_sub_blocks (int) – number of sub block

  • in_channels (int) – number of channels in the input feature

  • out_channels (int) – number of channels produced by the convolution

  • kernel_size (int) – size of the convolving kernel

  • stride (int) – stride of the convolution. (default: 1)

  • dilation (int) – spacing between kernel elements. (default: 1)

  • bias (bool) – if True, adds a learnable bias to the output. (default: True)

  • dropout_p (float) – probability of dropout

  • activation (str) – activation function

Inputs: inputs, input_lengths, residual
  • inputs: tensor contains input sequence vector

  • input_lengths: tensor contains sequence lengths

  • residual: tensor contains residual vector

Returns: output, output_lengths

(torch.FloatTensor, torch.LongTensor)

  • output (torch.FloatTensor): tensor contains output sequence vector

  • output_lengths (torch.LongTensor): tensor contains output sequence lengths

forward(inputs: torch.Tensor, input_lengths: torch.Tensor, residual: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Forward propagate of jasper block.

Inputs: inputs, input_lengths, residual
  • inputs: tensor contains input sequence vector

  • input_lengths: tensor contains sequence lengths

  • residual: tensor contains residual vector

Returns: output, output_lengths

(torch.FloatTensor, torch.LongTensor)

  • output (torch.FloatTensor): tensor contains output sequence vector

  • output_lengths (torch.LongTensor): tensor contains output sequence lengths

Jasper Sub Block

class openspeech.modules.jasper_subblock.JasperSubBlock(in_channels: int, out_channels: int, kernel_size: int, stride: int = 1, dilation: int = 1, padding: int = 0, bias: bool = False, dropout_p: float = 0.2, activation: str = 'relu')[source]

Jasper sub-block applies the following operations: a 1D-convolution, batch norm, ReLU, and dropout.

Parameters
  • in_channels (int) – number of channels in the input feature

  • out_channels (int) – number of channels produced by the convolution

  • kernel_size (int) – size of the convolving kernel

  • stride (int) – stride of the convolution. (default: 1)

  • dilation (int) – spacing between kernel elements. (default: 1)

  • padding (int) – zero-padding added to both sides of the input. (default: 0)

  • bias (bool) – if True, adds a learnable bias to the output. (default: False)

  • dropout_p (float) – probability of dropout

  • activation (str) – activation function

Inputs: inputs, input_lengths, residual
  • inputs: tensor contains input sequence vector

  • input_lengths: tensor contains sequence lengths

  • residual: tensor contains residual vector

Returns: output, output_lengths
  • output (torch.FloatTensor): tensor contains output sequence vector

  • output_lengths (torch.LongTensor): tensor contains output sequence lengths

forward(inputs: torch.Tensor, input_lengths: torch.Tensor, residual: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor][source]

Forward propagate of conformer’s subblock.

Inputs: inputs, input_lengths, residual
  • inputs: tensor contains input sequence vector

  • input_lengths: tensor contains sequence lengths

  • residual: tensor contains residual vector

Returns: output, output_lengths
  • output (torch.FloatTensor): tensor contains output sequence vector

  • output_lengths (torch.LongTensor): tensor contains output sequence lengths

Location Aware Attention

class openspeech.modules.location_aware_attention.LocationAwareAttention(dim: int = 1024, attn_dim: int = 1024, smoothing: bool = False)[source]

Applies a location-aware attention mechanism on the output features from the decoders. Location-aware attention proposed in “Attention-Based Models for Speech Recognition” paper. The location-aware attention mechanism is performing well in speech recognition tasks. We refer to implementation of ClovaCall Attention style.

Parameters
  • dim (int) – dimension of model

  • attn_dim (int) – dimension of attention

  • smoothing (bool) – flag indication whether to use smoothing or not.

Inputs: query, value, last_attn
  • query (batch, q_len, hidden_dim): tensor containing the output features from the decoders.

  • value (batch, v_len, hidden_dim): tensor containing features of the encoded input sequence.

  • last_attn (batch_size, v_len): tensor containing previous timestep`s attention (alignment)

Returns: output, attn
  • output (batch, output_len, dimensions): tensor containing the feature from encoders outputs

  • attn (batch * num_heads, v_len): tensor containing the attention (alignment) from the encoders outputs.

Reference:

Jan Chorowski et al.: Attention-Based Models for Speech Recognition. https://arxiv.org/abs/1506.07503

Mask

openspeech.modules.mask.get_attn_pad_mask(inputs, input_lengths, expand_length)[source]

mask position is set to 1

Mask Conv1d

class openspeech.modules.mask_conv1d.MaskConv1d(in_channels: int, out_channels: int, kernel_size: int, stride: int = 1, padding: int = 0, dilation: int = 1, groups: int = 1, bias: bool = False)[source]

1D convolution with masking

Parameters
  • in_channels (int) – Number of channels in the input vector

  • out_channels (int) – Number of channels produced by the convolution

  • kernel_size (int or tuple) – Size of the convolving kernel

  • stride (int) – Stride of the convolution. Default: 1

  • padding (int) – Zero-padding added to both sides of the input. Default: 0

  • dilation (int) – Spacing between kernel elements. Default: 1

  • groups (int) – Number of blocked connections from input channels to output channels. Default: 1

  • bias (bool) – If True, adds a learnable bias to the output. Default: True

Inputs: inputs, seq_lengths
  • inputs (torch.FloatTensor): The input of size (batch, dimension, time)

  • seq_lengths (torch.IntTensor): The actual length of each sequence in the batch

Returns: output, seq_lengths
  • output: Masked output from the conv1d

  • seq_lengths: Sequence length of output from the conv1d

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

inputs: (batch, dimension, time) input_lengths: (batch)

Mask Conv2d

class openspeech.modules.mask_conv2d.MaskConv2d(sequential: torch.nn.modules.container.Sequential)[source]

Masking Convolutional Neural Network

Adds padding to the output of the module based on the given lengths. This is to ensure that the results of the model do not change when batch sizes change during inference. Input needs to be in the shape of (batch_size, channel, hidden_dim, seq_len)

Refer to https://github.com/SeanNaren/deepspeech.pytorch/blob/master/model.py Copyright (c) 2017 Sean Naren MIT License

Parameters

sequential (torch.nn) – sequential list of convolution layer

Inputs: inputs, seq_lengths
  • inputs (torch.FloatTensor): The input of size BxCxHxT

  • seq_lengths (torch.IntTensor): The actual length of each sequence in the batch

Returns: output, seq_lengths
  • output: Masked output from the sequential

  • seq_lengths: Sequence length of output from the sequential

Multi-Head Attention

class openspeech.modules.multi_head_attention.MultiHeadAttention(dim: int = 512, num_heads: int = 8)[source]

Multi-Head Attention proposed in “Attention Is All You Need” Instead of performing a single attention function with d_model-dimensional keys, values, and queries, project the queries, keys and values h times with different, learned linear projections to d_head dimensions. These are concatenated and once again projected, resulting in the final values. Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

MultiHead(Q, K, V) = Concat(head_1, …, head_h) · W_o

where head_i = Attention(Q · W_q, K · W_k, V · W_v)

Parameters
  • dim (int) – The dimension of model (default: 512)

  • num_heads (int) – The number of attention heads. (default: 8)

Inputs: query, key, value, mask
  • query (batch, q_len, d_model): tensor containing projection vector for decoders.

  • key (batch, k_len, d_model): tensor containing projection vector for encoders.

  • value (batch, v_len, d_model): tensor containing features of the encoded input sequence.

  • mask (-): tensor containing indices to be masked

Returns: output, attn
  • output (batch, output_len, dimensions): tensor containing the attended output features.

  • attn (batch * num_heads, v_len): tensor containing the attention (alignment) from the encoders outputs.

Pointwise Conv1d

class openspeech.modules.pointwise_conv1d.PointwiseConv1d(in_channels: int, out_channels: int, stride: int = 1, padding: int = 0, bias: bool = True)[source]

When kernel size == 1 conv1d, this operation is termed in literature as pointwise convolution. This operation often used to match dimensions.

Parameters
  • in_channels (int) – Number of channels in the input

  • out_channels (int) – Number of channels produced by the convolution

  • stride (int, optional) – Stride of the convolution. Default: 1

  • padding (int or tuple, optional) – Zero-padding added to both sides of the input. Default: 0

  • bias (bool, optional) – If True, adds a learnable bias to the output. Default: True

Inputs: inputs
  • inputs (batch, in_channels, time): Tensor containing input vector

Returns: outputs
  • outputs (batch, out_channels, time): Tensor produces by pointwise 1-D convolution.

Positional Encoding

class openspeech.modules.positional_encoding.PositionalEncoding(d_model: int = 512, max_len: int = 5000)[source]

Positional Encoding proposed in “Attention Is All You Need”. Since transformer contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must add some positional information.

“Attention Is All You Need” use sine and cosine functions of different frequencies:

PE_(pos, 2i) = sin(pos / power(10000, 2i / d_model)) PE_(pos, 2i+1) = cos(pos / power(10000, 2i / d_model))

Position-wise Feed-Forward

class openspeech.modules.positionwise_feed_forward.PositionwiseFeedForward(d_model: int = 512, d_ff: int = 2048, dropout_p: float = 0.3)[source]

Position-wise Feedforward Networks proposed in “Attention Is All You Need”. Fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between. Another way of describing this is as two convolutions with kernel size 1.

QuartzNet Block

class openspeech.modules.quartznet_block.QuartzNetBlock(num_sub_blocks: int, in_channels: int, out_channels: int, kernel_size: int, bias: bool = True)[source]

QuartzNet’s design is based on the Jasper architecture, which is a convolutional model trained with Connectionist Temporal Classification (CTC) loss. The main novelty in QuartzNet’s architecture is that QuartzNet replaced the 1D convolutions with 1D time-channel separable convolutions, an implementation of depthwise separable convolutions.

Inputs: inputs, input_lengths

inputs (torch.FloatTensor): tensor contains input sequence vector input_lengths (torch.LongTensor): tensor contains sequence lengths

Returns: output, output_lengths

(torch.FloatTensor, torch.LongTensor)

  • output (torch.FloatTensor): tensor contains output sequence vector

  • output_lengths (torch.LongTensor): tensor contains output sequence lengths

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Forward propagate of QuartzNet block.

Inputs: inputs, input_lengths

inputs (torch.FloatTensor): tensor contains input sequence vector input_lengths (torch.LongTensor): tensor contains sequence lengths

Returns: output, output_lengths

(torch.FloatTensor, torch.LongTensor)

  • output (torch.FloatTensor): tensor contains output sequence vector

  • output_lengths (torch.LongTensor): tensor contains output sequence lengths

QuartzNet Sub Block

class openspeech.modules.quartznet_subblock.QuartzNetSubBlock(in_channels: int, out_channels: int, kernel_size: int, bias: bool = False, padding: int = 0, groups: int = 1)[source]

QuartzNet sub-block applies the following operations: a 1D-convolution, batch norm, ReLU, and dropout.

Parameters
  • in_channels (int) – number of channels in the input feature

  • out_channels (int) – number of channels produced by the convolution

  • kernel_size (int) – size of the convolving kernel

  • padding (int) – zero-padding added to both sides of the input. (default: 0)

  • bias (bool) – if True, adds a learnable bias to the output. (default: False)

Inputs: inputs, input_lengths, residual
  • inputs: tensor contains input sequence vector

  • input_lengths: tensor contains sequence lengths

  • residual: tensor contains residual vector

Returns: output, output_lengths
  • output (torch.FloatTensor): tensor contains output sequence vector

  • output_lengths (torch.LongTensor): tensor contains output sequence lengths

Relative Multi-Head Attention

class openspeech.modules.relative_multi_head_attention.RelativeMultiHeadAttention(dim: int = 512, num_heads: int = 16, dropout_p: float = 0.1)[source]

Multi-head attention with relative positional encoding. This concept was proposed in the “Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context”

Parameters
  • dim (int) – The dimension of model

  • num_heads (int) – The number of attention heads.

  • dropout_p (float) – probability of dropout

Inputs: query, key, value, pos_embedding, mask
  • query (batch, time, dim): Tensor containing query vector

  • key (batch, time, dim): Tensor containing key vector

  • value (batch, time, dim): Tensor containing value vector

  • pos_embedding (batch, time, dim): Positional embedding tensor

  • mask (batch, 1, time2) or (batch, time1, time2): Tensor containing indices to be masked

Returns

Tensor produces by relative multi head attention module.

Return type

  • outputs

Residual Connection Module

class openspeech.modules.residual_connection_module.ResidualConnectionModule(module: torch.nn.modules.module.Module, module_factor: float = 1.0, input_factor: float = 1.0)[source]

Residual Connection Module. outputs = (module(inputs) x module_factor + inputs x input_factor)

Swish

class openspeech.modules.swish.Swish[source]

Swish is a smooth, non-monotonic function that consistently matches or outperforms ReLU on deep networks applied to a variety of challenging domains such as Image classification and Machine translation.

Time-Channel Separable Conv1d

class openspeech.modules.time_channel_separable_conv1d.TimeChannelSeparableConv1d(in_channels: int, out_channels: int, kernel_size: int = 1, padding: int = 0, groups: int = 1, bias: bool = True)[source]

The total number of weights for a time-channel separable convolution block is K × cin + cin × cout weights. Since K is generally several times smaller than cout, most weights are concentrated in the pointwise convolution part.

Transformer Embedding

class openspeech.modules.transformer_embedding.TransformerEmbedding(num_embeddings: int, pad_id: int, d_model: int = 512)[source]

Embedding layer. Similarly to other sequence transduction models, transformer use learned embeddings to convert the input tokens and output tokens to vectors of dimension d_model. In the embedding layers, transformer multiply those weights by sqrt(d_model)

Parameters
  • num_embeddings (int) – the number of embedding size

  • pad_id (int) – identification of pad token

  • d_model (int) – dimension of model

Inputs:

inputs (torch.FloatTensor): input of embedding layer

Returns

output of embedding layer

Return type

outputs (torch.FloatTensor)

forward(inputs: torch.Tensor)torch.Tensor[source]

Forward propagate of embedding layer.

Inputs:

inputs (torch.FloatTensor): input of embedding layer

Returns

output of embedding layer

Return type

outputs (torch.FloatTensor)

VGG Extractor

class openspeech.modules.vgg_extractor.VGGExtractor(input_dim: int, in_channels: int = 1, out_channels: int = 64, 128, activation: str = 'hardtanh')[source]

VGG extractor for automatic speech recognition described in “Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM” paper - https://arxiv.org/pdf/1706.02737.pdf

Parameters
  • input_dim (int) – Dimension of input vector

  • in_channels (int) – Number of channels in the input image

  • out_channels (int or tuple) – Number of channels produced by the convolution

  • activation (str) – Activation function

Inputs: inputs, input_lengths
  • inputs (batch, time, dim): Tensor containing input vectors

  • input_lengths: Tensor containing containing sequence lengths

Returns: outputs, output_lengths
  • outputs: Tensor produced by the convolution

  • output_lengths: Tensor containing sequence lengths produced by the convolution

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

inputs: torch.FloatTensor (batch, time, dimension) input_lengths: torch.IntTensor (batch)

Wrapper

class openspeech.modules.wrapper.Linear(in_features: int, out_features: int, bias: bool = True)[source]

Wrapper class of torch.nn.Linear Weight initialize by xavier initialization and bias initialize to zeros.

class openspeech.modules.wrapper.Transpose(shape: tuple)[source]

Wrapper class of torch.transpose() for Sequential module.

class openspeech.modules.wrapper.View(shape: tuple, contiguous: bool = False)[source]

Wrapper class of torch.view() for Sequential module.