Decoders

Openspeech Decoder

class openspeech.decoders.openspeech_decoder.OpenspeechDecoder[source]

Interface of OpenSpeech decoder.

count_parameters()int[source]

Count parameters of decoders

update_dropout(dropout_p: float)None[source]

Update dropout probability of decoders

LSTM Attention Decoder

class openspeech.decoders.lstm_attention_decoder.LSTMAttentionDecoder(num_classes: int, max_length: int = 150, hidden_state_dim: int = 1024, pad_id: int = 0, sos_id: int = 1, eos_id: int = 2, attn_mechanism: str = 'multi-head', num_heads: int = 4, num_layers: int = 2, rnn_type: str = 'lstm', dropout_p: float = 0.3)[source]

Converts higher level features (from encoders) into output utterances by specifying a probability distribution over sequences of characters.

Parameters
  • num_classes (int) – number of classification

  • hidden_state_dim (int) – the number of features in the decoders hidden state h

  • num_layers (int, optional) – number of recurrent layers (default: 2)

  • rnn_type (str, optional) – type of RNN cell (default: lstm)

  • pad_id (int, optional) – index of the pad symbol (default: 0)

  • sos_id (int, optional) – index of the start of sentence symbol (default: 1)

  • eos_id (int, optional) – index of the end of sentence symbol (default: 2)

  • attn_mechanism (str, optional) – type of attention mechanism (default: multi-head)

  • num_heads (int, optional) – number of attention heads. (default: 4)

  • dropout_p (float, optional) – dropout probability of decoders (default: 0.2)

Inputs: inputs, encoder_outputs, teacher_forcing_ratio
  • inputs (batch, seq_len, input_size): list of sequences, whose length is the batch size and within which each sequence is a list of token IDs. It is used for teacher forcing when provided. (default None)

  • encoder_outputs (batch, seq_len, hidden_state_dim): tensor with containing the outputs of the encoders. Used for attention mechanism (default is None).

  • teacher_forcing_ratio (float): The probability that teacher forcing will be used. A random number is drawn uniformly from 0-1 for every decoding token, and if the sample is smaller than the given value, teacher forcing would be used (default is 0).

Returns: logits
  • logits (torch.FloatTensor) : log probabilities of model’s prediction

forward(encoder_outputs: torch.Tensor, targets: Optional[torch.Tensor] = None, encoder_output_lengths: Optional[torch.Tensor] = None, teacher_forcing_ratio: float = 1.0)torch.Tensor[source]

Forward propagate a encoder_outputs for training.

Parameters
  • targets (torch.LongTensr) – A target sequence passed to decoders. IntTensor of size (batch, seq_length)

  • encoder_outputs (torch.FloatTensor) – A output sequence of encoders. FloatTensor of size (batch, seq_length, dimension)

  • encoder_output_lengths – The length of encoders outputs. (batch)

  • teacher_forcing_ratio (float) – ratio of teacher forcing

Returns

Log probability of model predictions.

Return type

  • logits (torch.FloatTensor)

RNN Transducer Decoder

class openspeech.decoders.rnn_transducer_decoder.RNNTransducerDecoder(num_classes: int, hidden_state_dim: int, output_dim: int, num_layers: int, rnn_type: str = 'lstm', pad_id: int = 0, sos_id: int = 1, eos_id: int = 2, dropout_p: float = 0.2)[source]

Decoder of RNN-Transducer

Parameters
  • num_classes (int) – number of classification

  • hidden_state_dim (int, optional) – hidden state dimension of decoders (default: 512)

  • output_dim (int, optional) – output dimension of encoders and decoders (default: 512)

  • num_layers (int, optional) – number of decoders layers (default: 1)

  • rnn_type (str, optional) – type of rnn cell (default: lstm)

  • sos_id (int, optional) – start of sentence identification

  • eos_id (int, optional) – end of sentence identification

  • dropout_p (float, optional) – dropout probability of decoders

Inputs: inputs, input_lengths

inputs (torch.LongTensor): A target sequence passed to decoders. IntTensor of size (batch, seq_length) input_lengths (torch.LongTensor): The length of input tensor. (batch) hidden_states (torch.FloatTensor): A previous hidden state of decoders. FloatTensor of size (batch, seq_length, dimension)

Returns

  • decoder_outputs (torch.FloatTensor): A output sequence of decoders. FloatTensor of size

    (batch, seq_length, dimension)

  • hidden_states (torch.FloatTensor): A hidden state of decoders. FloatTensor of size

    (batch, seq_length, dimension)

Return type

(Tensor, Tensor)

Reference:

A Graves: Sequence Transduction with Recurrent Neural Networks https://arxiv.org/abs/1211.3711.pdf

forward(inputs: torch.Tensor, input_lengths: torch.Tensor = None, hidden_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]

Forward propage a inputs (targets) for training.

Inputs:

inputs (torch.LongTensor): A input sequence passed to label encoder. Typically inputs will be a padded LongTensor of size (batch, target_length) input_lengths (torch.LongTensor): The length of input tensor. (batch) hidden_states (torch.FloatTensor): Previous hidden states.

Returns

  • outputs (torch.FloatTensor): A output sequence of decoders. FloatTensor of size

    (batch, seq_length, dimension)

  • hidden_states (torch.FloatTensor): A hidden state of decoders. FloatTensor of size

    (batch, seq_length, dimension)

Return type

(Tensor, Tensor)

Transformer Decoder

class openspeech.decoders.transformer_decoder.TransformerDecoder(num_classes: int, d_model: int = 512, d_ff: int = 512, num_layers: int = 6, num_heads: int = 8, dropout_p: float = 0.3, pad_id: int = 0, sos_id: int = 1, eos_id: int = 2, max_length: int = 128)[source]

The TransformerDecoder is composed of a stack of N identical layers. Each layer has three sub-layers. The first is a multi-head self-attention mechanism, and the second is a multi-head attention mechanism, third is a feed-forward network.

Parameters
  • num_classes – umber of classes

  • d_model – dimension of model

  • d_ff – dimension of feed forward network

  • num_layers – number of layers

  • num_heads – number of attention heads

  • dropout_p – probability of dropout

  • pad_id (int, optional) – index of the pad symbol (default: 0)

  • sos_id (int, optional) – index of the start of sentence symbol (default: 1)

  • eos_id (int, optional) – index of the end of sentence symbol (default: 2)

  • max_length (int) – max decoding length

forward(encoder_outputs: torch.Tensor, targets: Optional[torch.LongTensor] = None, encoder_output_lengths: torch.Tensor = None, target_lengths: torch.Tensor = None, teacher_forcing_ratio: float = 1.0)torch.Tensor[source]

Forward propagate a encoder_outputs for training.

Parameters
  • targets (torch.LongTensor) – A target sequence passed to decoders. IntTensor of size (batch, seq_length)

  • encoder_outputs (torch.FloatTensor) – A output sequence of encoders. FloatTensor of size (batch, seq_length, dimension)

  • encoder_output_lengths (torch.LongTensor) – The length of encoders outputs. (batch)

  • teacher_forcing_ratio (float) – ratio of teacher forcing

Returns

Log probability of model predictions.

Return type

  • logits (torch.FloatTensor)

class openspeech.decoders.transformer_decoder.TransformerDecoderLayer(d_model: int = 512, num_heads: int = 8, d_ff: int = 2048, dropout_p: float = 0.3)[source]

DecoderLayer is made up of self-attention, multi-head attention and feedforward network. This standard decoders layer is based on the paper “Attention Is All You Need”.

Parameters
  • d_model – dimension of model (default: 512)

  • num_heads – number of attention heads (default: 8)

  • d_ff – dimension of feed forward network (default: 2048)

  • dropout_p – probability of dropout (default: 0.3)

Inputs:

inputs (torch.FloatTensor): input sequence of transformer decoder layer encoder_outputs (torch.FloatTensor): outputs of encoder self_attn_mask (torch.BoolTensor): mask of self attention encoder_output_mask (torch.BoolTensor): mask of encoder outputs

Returns

(Tensor, Tensor, Tensor)

  • outputs (torch.FloatTensor): output of transformer decoder layer

  • self_attn (torch.FloatTensor): output of self attention

  • encoder_attn (torch.FloatTensor): output of encoder attention

Reference:

Ashish Vaswani et al.: Attention Is All You Need https://arxiv.org/abs/1706.03762

forward(inputs: torch.Tensor, encoder_outputs: torch.Tensor, self_attn_mask: Optional[torch.Tensor] = None, encoder_attn_mask: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]

Forward propagate transformer decoder layer.

Inputs:

inputs (torch.FloatTensor): input sequence of transformer decoder layer encoder_outputs (torch.FloatTensor): outputs of encoder self_attn_mask (torch.BoolTensor): mask of self attention encoder_output_mask (torch.BoolTensor): mask of encoder outputs

Returns

output of transformer decoder layer self_attn (torch.FloatTensor): output of self attention encoder_attn (torch.FloatTensor): output of encoder attention

Return type

outputs (torch.FloatTensor)

Transformer Transducer Decoder

class openspeech.decoders.transformer_transducer_decoder.TransformerTransducerDecoder(num_classes: int, model_dim: int = 512, d_ff: int = 2048, num_layers: int = 2, num_heads: int = 8, dropout: float = 0.1, max_positional_length: int = 5000, pad_id: int = 0, sos_id: int = 1, eos_id: int = 2)[source]

Converts the label to higher feature values

Parameters
  • num_classes (int) – the number of vocabulary

  • model_dim (int) – the number of features in the label encoder (default : 512)

  • d_ff (int) – the number of features in the feed forward layers (default : 2048)

  • num_layers (int) – the number of label encoder layers (default: 2)

  • num_heads (int) – the number of heads in the multi-head attention (default: 8)

  • dropout (float) – dropout probability of label encoder (default: 0.1)

  • max_positional_length (int) – Maximum length to use for positional encoding (default : 5000)

  • pad_id (int) – index of padding (default: 0)

  • sos_id (int) – index of the start of sentence (default: 1)

  • eos_id (int) – index of the end of sentence (default: 2)

Inputs: inputs, inputs_lens
  • inputs: Ground truth of batch size number

  • inputs_lens: Tensor of target lengths

Returns

(torch.FloatTensor, torch.FloatTensor)

  • outputs (torch.FloatTensor): (batch, seq_length, dimension)

  • input_lengths (torch.FloatTensor): (batch)

Reference:

Qian Zhang et al.: Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss https://arxiv.org/abs/2002.02562

forward(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]

Forward propagate a inputs for label encoder.

Parameters
  • inputs (torch.LongTensor) – A input sequence passed to label encoder. Typically inputs will be a padded LongTensor of size (batch, target_length)

  • input_lengths (torch.LongTensor) – The length of input tensor. (batch)

Returns

(batch, seq_length, dimension) * output_lengths (Tensor): (batch)

Return type

  • outputs (Tensor)