Decoders¶

Openspeech Decoder¶

class openspeech.decoders.openspeech_decoder.OpenspeechDecoder[source]¶

Interface of OpenSpeech decoder.

count_parameters() → int [source]¶: Count parameters of decoders

update_dropout(dropout_p: float) → None [source]¶: Update dropout probability of decoders

LSTM Attention Decoder¶

class openspeech.decoders.lstm_attention_decoder.LSTMAttentionDecoder(num_classes: int, max_length: int = 150, hidden_state_dim: int = 1024, pad_id: int = 0, sos_id: int = 1, eos_id: int = 2, attn_mechanism: str = 'multi-head', num_heads: int = 4, num_layers: int = 2, rnn_type: str = 'lstm', dropout_p: float = 0.3)[source]¶

Converts higher level features (from encoders) into output utterances by specifying a probability distribution over sequences of characters.

Parameters

num_classes (int) – number of classification
hidden_state_dim (int) – the number of features in the decoders hidden state h
num_layers (int, optional) – number of recurrent layers (default: 2)
rnn_type (str, optional) – type of RNN cell (default: lstm)
pad_id (int, optional) – index of the pad symbol (default: 0)
sos_id (int, optional) – index of the start of sentence symbol (default: 1)
eos_id (int, optional) – index of the end of sentence symbol (default: 2)
attn_mechanism (str, optional) – type of attention mechanism (default: multi-head)
num_heads (int, optional) – number of attention heads. (default: 4)
dropout_p (float, optional) – dropout probability of decoders (default: 0.2)

Inputs: inputs, encoder_outputs, teacher_forcing_ratio

inputs (batch, seq_len, input_size): list of sequences, whose length is the batch size and within which each sequence is a list of token IDs. It is used for teacher forcing when provided. (default None)
encoder_outputs (batch, seq_len, hidden_state_dim): tensor with containing the outputs of the encoders. Used for attention mechanism (default is None).
teacher_forcing_ratio (float): The probability that teacher forcing will be used. A random number is drawn uniformly from 0-1 for every decoding token, and if the sample is smaller than the given value, teacher forcing would be used (default is 0).

Returns: logits

logits (torch.FloatTensor) : log probabilities of model’s prediction

forward(encoder_outputs: torch.Tensor, targets: Optional[torch.Tensor] = None, encoder_output_lengths: Optional[torch.Tensor] = None, teacher_forcing_ratio: float = 1.0) → torch.Tensor [source]¶

Forward propagate a encoder_outputs for training.

Parameters

targets (torch.LongTensr) – A target sequence passed to decoders. IntTensor of size (batch, seq_length)
encoder_outputs (torch.FloatTensor) – A output sequence of encoders. FloatTensor of size (batch, seq_length, dimension)
encoder_output_lengths – The length of encoders outputs. (batch)
teacher_forcing_ratio (float) – ratio of teacher forcing

Returns

Log probability of model predictions.

Return type

logits (torch.FloatTensor)

RNN Transducer Decoder¶

class openspeech.decoders.rnn_transducer_decoder.RNNTransducerDecoder(num_classes: int, hidden_state_dim: int, output_dim: int, num_layers: int, rnn_type: str = 'lstm', pad_id: int = 0, sos_id: int = 1, eos_id: int = 2, dropout_p: float = 0.2)[source]¶

Decoder of RNN-Transducer

Parameters

num_classes (int) – number of classification
hidden_state_dim (int, optional) – hidden state dimension of decoders (default: 512)
output_dim (int, optional) – output dimension of encoders and decoders (default: 512)
num_layers (int, optional) – number of decoders layers (default: 1)
rnn_type (str, optional) – type of rnn cell (default: lstm)
sos_id (int, optional) – start of sentence identification
eos_id (int, optional) – end of sentence identification
dropout_p (float, optional) – dropout probability of decoders

Inputs: inputs, input_lengths: inputs (torch.LongTensor): A target sequence passed to decoders. IntTensor of size (batch, seq_length) input_lengths (torch.LongTensor): The length of input tensor. (batch) hidden_states (torch.FloatTensor): A previous hidden state of decoders. FloatTensor of size (batch, seq_length, dimension)

Returns

decoder_outputs (torch.FloatTensor): A output sequence of decoders. FloatTensor of size
(batch, seq_length, dimension)
hidden_states (torch.FloatTensor): A hidden state of decoders. FloatTensor of size
(batch, seq_length, dimension)

Return type

(Tensor, Tensor)

Reference:: A Graves: Sequence Transduction with Recurrent Neural Networks https://arxiv.org/abs/1211.3711.pdf

forward(inputs: torch.Tensor, input_lengths: torch.Tensor = None, hidden_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]¶

Forward propage a inputs (targets) for training.

Inputs:: inputs (torch.LongTensor): A input sequence passed to label encoder. Typically inputs will be a padded LongTensor of size (batch, target_length) input_lengths (torch.LongTensor): The length of input tensor. (batch) hidden_states (torch.FloatTensor): Previous hidden states.

Returns

outputs (torch.FloatTensor): A output sequence of decoders. FloatTensor of size
(batch, seq_length, dimension)
hidden_states (torch.FloatTensor): A hidden state of decoders. FloatTensor of size
(batch, seq_length, dimension)

Return type

(Tensor, Tensor)

Transformer Decoder¶

class openspeech.decoders.transformer_decoder.TransformerDecoder(num_classes: int, d_model: int = 512, d_ff: int = 512, num_layers: int = 6, num_heads: int = 8, dropout_p: float = 0.3, pad_id: int = 0, sos_id: int = 1, eos_id: int = 2, max_length: int = 128)[source]¶

The TransformerDecoder is composed of a stack of N identical layers. Each layer has three sub-layers. The first is a multi-head self-attention mechanism, and the second is a multi-head attention mechanism, third is a feed-forward network.

Parameters

num_classes – umber of classes
d_model – dimension of model
d_ff – dimension of feed forward network
num_layers – number of layers
num_heads – number of attention heads
dropout_p – probability of dropout
pad_id (int, optional) – index of the pad symbol (default: 0)
sos_id (int, optional) – index of the start of sentence symbol (default: 1)
eos_id (int, optional) – index of the end of sentence symbol (default: 2)
max_length (int) – max decoding length

forward(encoder_outputs: torch.Tensor, targets: Optional[torch.LongTensor] = None, encoder_output_lengths: torch.Tensor = None, target_lengths: torch.Tensor = None, teacher_forcing_ratio: float = 1.0) → torch.Tensor [source]¶

Forward propagate a encoder_outputs for training.

Parameters

targets (torch.LongTensor) – A target sequence passed to decoders. IntTensor of size (batch, seq_length)
encoder_outputs (torch.FloatTensor) – A output sequence of encoders. FloatTensor of size (batch, seq_length, dimension)
encoder_output_lengths (torch.LongTensor) – The length of encoders outputs. (batch)
teacher_forcing_ratio (float) – ratio of teacher forcing

Returns

Log probability of model predictions.

Return type

logits (torch.FloatTensor)

class openspeech.decoders.transformer_decoder.TransformerDecoderLayer(d_model: int = 512, num_heads: int = 8, d_ff: int = 2048, dropout_p: float = 0.3)[source]¶

DecoderLayer is made up of self-attention, multi-head attention and feedforward network. This standard decoders layer is based on the paper “Attention Is All You Need”.

Parameters

d_model – dimension of model (default: 512)
num_heads – number of attention heads (default: 8)
d_ff – dimension of feed forward network (default: 2048)
dropout_p – probability of dropout (default: 0.3)

Inputs:: inputs (torch.FloatTensor): input sequence of transformer decoder layer encoder_outputs (torch.FloatTensor): outputs of encoder self_attn_mask (torch.BoolTensor): mask of self attention encoder_output_mask (torch.BoolTensor): mask of encoder outputs

Returns

(Tensor, Tensor, Tensor)

outputs (torch.FloatTensor): output of transformer decoder layer
self_attn (torch.FloatTensor): output of self attention
encoder_attn (torch.FloatTensor): output of encoder attention

Reference:: Ashish Vaswani et al.: Attention Is All You Need https://arxiv.org/abs/1706.03762

forward(inputs: torch.Tensor, encoder_outputs: torch.Tensor, self_attn_mask: Optional[torch.Tensor] = None, encoder_attn_mask: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶

Forward propagate transformer decoder layer.

Inputs:: inputs (torch.FloatTensor): input sequence of transformer decoder layer encoder_outputs (torch.FloatTensor): outputs of encoder self_attn_mask (torch.BoolTensor): mask of self attention encoder_output_mask (torch.BoolTensor): mask of encoder outputs

Returns: output of transformer decoder layer self_attn (torch.FloatTensor): output of self attention encoder_attn (torch.FloatTensor): output of encoder attention
Return type: outputs (torch.FloatTensor)

Transformer Transducer Decoder¶

class openspeech.decoders.transformer_transducer_decoder.TransformerTransducerDecoder(num_classes: int, model_dim: int = 512, d_ff: int = 2048, num_layers: int = 2, num_heads: int = 8, dropout: float = 0.1, max_positional_length: int = 5000, pad_id: int = 0, sos_id: int = 1, eos_id: int = 2)[source]¶

Converts the label to higher feature values

Parameters

num_classes (int) – the number of vocabulary
model_dim (int) – the number of features in the label encoder (default : 512)
d_ff (int) – the number of features in the feed forward layers (default : 2048)
num_layers (int) – the number of label encoder layers (default: 2)
num_heads (int) – the number of heads in the multi-head attention (default: 8)
dropout (float) – dropout probability of label encoder (default: 0.1)
max_positional_length (int) – Maximum length to use for positional encoding (default : 5000)
pad_id (int) – index of padding (default: 0)
sos_id (int) – index of the start of sentence (default: 1)
eos_id (int) – index of the end of sentence (default: 2)

Inputs: inputs, inputs_lens

inputs: Ground truth of batch size number
inputs_lens: Tensor of target lengths

Returns

(torch.FloatTensor, torch.FloatTensor)

outputs (torch.FloatTensor): (batch, seq_length, dimension)
input_lengths (torch.FloatTensor): (batch)

Reference:: Qian Zhang et al.: Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss https://arxiv.org/abs/2002.02562

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Forward propagate a inputs for label encoder.

Parameters

inputs (torch.LongTensor) – A input sequence passed to label encoder. Typically inputs will be a padded LongTensor of size (batch, target_length)
input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

(batch, seq_length, dimension) * output_lengths (Tensor): (batch)

Return type

outputs (Tensor)