Decoders¶
Openspeech Decoder¶
LSTM Attention Decoder¶
-
class
openspeech.decoders.lstm_attention_decoder.
LSTMAttentionDecoder
(num_classes: int, max_length: int = 150, hidden_state_dim: int = 1024, pad_id: int = 0, sos_id: int = 1, eos_id: int = 2, attn_mechanism: str = 'multi-head', num_heads: int = 4, num_layers: int = 2, rnn_type: str = 'lstm', dropout_p: float = 0.3)[source]¶ Converts higher level features (from encoders) into output utterances by specifying a probability distribution over sequences of characters.
- Parameters
num_classes (int) – number of classification
hidden_state_dim (int) – the number of features in the decoders hidden state h
num_layers (int, optional) – number of recurrent layers (default: 2)
rnn_type (str, optional) – type of RNN cell (default: lstm)
pad_id (int, optional) – index of the pad symbol (default: 0)
sos_id (int, optional) – index of the start of sentence symbol (default: 1)
eos_id (int, optional) – index of the end of sentence symbol (default: 2)
attn_mechanism (str, optional) – type of attention mechanism (default: multi-head)
num_heads (int, optional) – number of attention heads. (default: 4)
dropout_p (float, optional) – dropout probability of decoders (default: 0.2)
- Inputs: inputs, encoder_outputs, teacher_forcing_ratio
inputs (batch, seq_len, input_size): list of sequences, whose length is the batch size and within which each sequence is a list of token IDs. It is used for teacher forcing when provided. (default None)
encoder_outputs (batch, seq_len, hidden_state_dim): tensor with containing the outputs of the encoders. Used for attention mechanism (default is None).
teacher_forcing_ratio (float): The probability that teacher forcing will be used. A random number is drawn uniformly from 0-1 for every decoding token, and if the sample is smaller than the given value, teacher forcing would be used (default is 0).
- Returns: logits
logits (torch.FloatTensor) : log probabilities of model’s prediction
-
forward
(encoder_outputs: torch.Tensor, targets: Optional[torch.Tensor] = None, encoder_output_lengths: Optional[torch.Tensor] = None, teacher_forcing_ratio: float = 1.0) → torch.Tensor[source]¶ Forward propagate a encoder_outputs for training.
- Parameters
targets (torch.LongTensr) – A target sequence passed to decoders. IntTensor of size
(batch, seq_length)
encoder_outputs (torch.FloatTensor) – A output sequence of encoders. FloatTensor of size
(batch, seq_length, dimension)
encoder_output_lengths – The length of encoders outputs.
(batch)
teacher_forcing_ratio (float) – ratio of teacher forcing
- Returns
Log probability of model predictions.
- Return type
logits (torch.FloatTensor)
RNN Transducer Decoder¶
-
class
openspeech.decoders.rnn_transducer_decoder.
RNNTransducerDecoder
(num_classes: int, hidden_state_dim: int, output_dim: int, num_layers: int, rnn_type: str = 'lstm', pad_id: int = 0, sos_id: int = 1, eos_id: int = 2, dropout_p: float = 0.2)[source]¶ Decoder of RNN-Transducer
- Parameters
num_classes (int) – number of classification
hidden_state_dim (int, optional) – hidden state dimension of decoders (default: 512)
output_dim (int, optional) – output dimension of encoders and decoders (default: 512)
num_layers (int, optional) – number of decoders layers (default: 1)
rnn_type (str, optional) – type of rnn cell (default: lstm)
sos_id (int, optional) – start of sentence identification
eos_id (int, optional) – end of sentence identification
dropout_p (float, optional) – dropout probability of decoders
- Inputs: inputs, input_lengths
inputs (torch.LongTensor): A target sequence passed to decoders. IntTensor of size
(batch, seq_length)
input_lengths (torch.LongTensor): The length of input tensor.(batch)
hidden_states (torch.FloatTensor): A previous hidden state of decoders. FloatTensor of size(batch, seq_length, dimension)
- Returns
- decoder_outputs (torch.FloatTensor): A output sequence of decoders. FloatTensor of size
(batch, seq_length, dimension)
- hidden_states (torch.FloatTensor): A hidden state of decoders. FloatTensor of size
(batch, seq_length, dimension)
- Return type
(Tensor, Tensor)
- Reference:
A Graves: Sequence Transduction with Recurrent Neural Networks https://arxiv.org/abs/1211.3711.pdf
-
forward
(inputs: torch.Tensor, input_lengths: torch.Tensor = None, hidden_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Forward propage a inputs (targets) for training.
- Inputs:
inputs (torch.LongTensor): A input sequence passed to label encoder. Typically inputs will be a padded LongTensor of size
(batch, target_length)
input_lengths (torch.LongTensor): The length of input tensor.(batch)
hidden_states (torch.FloatTensor): Previous hidden states.
- Returns
- outputs (torch.FloatTensor): A output sequence of decoders. FloatTensor of size
(batch, seq_length, dimension)
- hidden_states (torch.FloatTensor): A hidden state of decoders. FloatTensor of size
(batch, seq_length, dimension)
- Return type
(Tensor, Tensor)
Transformer Decoder¶
-
class
openspeech.decoders.transformer_decoder.
TransformerDecoder
(num_classes: int, d_model: int = 512, d_ff: int = 512, num_layers: int = 6, num_heads: int = 8, dropout_p: float = 0.3, pad_id: int = 0, sos_id: int = 1, eos_id: int = 2, max_length: int = 128)[source]¶ The TransformerDecoder is composed of a stack of N identical layers. Each layer has three sub-layers. The first is a multi-head self-attention mechanism, and the second is a multi-head attention mechanism, third is a feed-forward network.
- Parameters
num_classes – umber of classes
d_model – dimension of model
d_ff – dimension of feed forward network
num_layers – number of layers
num_heads – number of attention heads
dropout_p – probability of dropout
pad_id (int, optional) – index of the pad symbol (default: 0)
sos_id (int, optional) – index of the start of sentence symbol (default: 1)
eos_id (int, optional) – index of the end of sentence symbol (default: 2)
max_length (int) – max decoding length
-
forward
(encoder_outputs: torch.Tensor, targets: Optional[torch.LongTensor] = None, encoder_output_lengths: torch.Tensor = None, target_lengths: torch.Tensor = None, teacher_forcing_ratio: float = 1.0) → torch.Tensor[source]¶ Forward propagate a encoder_outputs for training.
- Parameters
targets (torch.LongTensor) – A target sequence passed to decoders. IntTensor of size
(batch, seq_length)
encoder_outputs (torch.FloatTensor) – A output sequence of encoders. FloatTensor of size
(batch, seq_length, dimension)
encoder_output_lengths (torch.LongTensor) – The length of encoders outputs.
(batch)
teacher_forcing_ratio (float) – ratio of teacher forcing
- Returns
Log probability of model predictions.
- Return type
logits (torch.FloatTensor)
-
class
openspeech.decoders.transformer_decoder.
TransformerDecoderLayer
(d_model: int = 512, num_heads: int = 8, d_ff: int = 2048, dropout_p: float = 0.3)[source]¶ DecoderLayer is made up of self-attention, multi-head attention and feedforward network. This standard decoders layer is based on the paper “Attention Is All You Need”.
- Parameters
d_model – dimension of model (default: 512)
num_heads – number of attention heads (default: 8)
d_ff – dimension of feed forward network (default: 2048)
dropout_p – probability of dropout (default: 0.3)
- Inputs:
inputs (torch.FloatTensor): input sequence of transformer decoder layer encoder_outputs (torch.FloatTensor): outputs of encoder self_attn_mask (torch.BoolTensor): mask of self attention encoder_output_mask (torch.BoolTensor): mask of encoder outputs
- Returns
(Tensor, Tensor, Tensor)
outputs (torch.FloatTensor): output of transformer decoder layer
self_attn (torch.FloatTensor): output of self attention
encoder_attn (torch.FloatTensor): output of encoder attention
- Reference:
Ashish Vaswani et al.: Attention Is All You Need https://arxiv.org/abs/1706.03762
-
forward
(inputs: torch.Tensor, encoder_outputs: torch.Tensor, self_attn_mask: Optional[torch.Tensor] = None, encoder_attn_mask: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Forward propagate transformer decoder layer.
- Inputs:
inputs (torch.FloatTensor): input sequence of transformer decoder layer encoder_outputs (torch.FloatTensor): outputs of encoder self_attn_mask (torch.BoolTensor): mask of self attention encoder_output_mask (torch.BoolTensor): mask of encoder outputs
- Returns
output of transformer decoder layer self_attn (torch.FloatTensor): output of self attention encoder_attn (torch.FloatTensor): output of encoder attention
- Return type
outputs (torch.FloatTensor)
Transformer Transducer Decoder¶
-
class
openspeech.decoders.transformer_transducer_decoder.
TransformerTransducerDecoder
(num_classes: int, model_dim: int = 512, d_ff: int = 2048, num_layers: int = 2, num_heads: int = 8, dropout: float = 0.1, max_positional_length: int = 5000, pad_id: int = 0, sos_id: int = 1, eos_id: int = 2)[source]¶ Converts the label to higher feature values
- Parameters
num_classes (int) – the number of vocabulary
model_dim (int) – the number of features in the label encoder (default : 512)
d_ff (int) – the number of features in the feed forward layers (default : 2048)
num_layers (int) – the number of label encoder layers (default: 2)
num_heads (int) – the number of heads in the multi-head attention (default: 8)
dropout (float) – dropout probability of label encoder (default: 0.1)
max_positional_length (int) – Maximum length to use for positional encoding (default : 5000)
pad_id (int) – index of padding (default: 0)
sos_id (int) – index of the start of sentence (default: 1)
eos_id (int) – index of the end of sentence (default: 2)
- Inputs: inputs, inputs_lens
inputs: Ground truth of batch size number
inputs_lens: Tensor of target lengths
- Returns
(torch.FloatTensor, torch.FloatTensor)
outputs (torch.FloatTensor):
(batch, seq_length, dimension)
input_lengths (torch.FloatTensor):
(batch)
- Reference:
Qian Zhang et al.: Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss https://arxiv.org/abs/2002.02562
-
forward
(inputs: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Forward propagate a inputs for label encoder.
- Parameters
inputs (torch.LongTensor) – A input sequence passed to label encoder. Typically inputs will be a padded LongTensor of size
(batch, target_length)
input_lengths (torch.LongTensor) – The length of input tensor.
(batch)
- Returns
(batch, seq_length, dimension)
* output_lengths (Tensor):(batch)
- Return type
outputs (Tensor)