Criterion¶

Cross Entropy Loss¶

class openspeech.criterion.cross_entropy.cross_entropy.CrossEntropyLoss(configs: omegaconf.dictconfig.DictConfig, tokenizer: openspeech.tokenizers.tokenizer.Tokenizer)[source]¶

The negative log likelihood loss. It is useful to train a classification problem with C classes.

If provided, the optional argument weight should be a 1D Tensor assigning weight to each of the classes. This is particularly useful when you have an unbalanced training set.

The input given through a forward call is expected to contain log-probabilities of each class. input has to be a Tensor of size either (minibatch, C) or (minibatch, C, d_1, d_2, ..., d_K) with K \geq 1 for the K-dimensional case (described later).

Obtaining log-probabilities in a neural network is easily achieved by adding a LogSoftmax layer in the last layer of your network. You may use CrossEntropyLoss instead, if you prefer not to add an extra layer.

The target that this loss expects should be a class index in the range [0, C-1] where C = number of classes; if ignore_index is specified, this loss also accepts this class index (this index may not necessarily be in the class range).

The unreduced (i.e. with reduction set to 'none') loss can be described as:

\ell(x, y) = L = \{l_1,\dots,l_N\}^\top, \quad l_n = - w_{y_n} x_{n,y_n}, \quad w_{c} = \text{weight}[c] \cdot \mathbb{1}\{c \not= \text{ignore\_index}\},

where x is the input, y is the target, w is the weight, and N is the batch size. If reduction is not 'none' (default 'mean'), then

\ell(x, y) = \begin{cases} \sum_{n=1}^N \frac{1}{\sum_{n=1}^N w_{y_n}} l_n, & \text{if reduction} = \text{`mean';}\\ \sum_{n=1}^N l_n, & \text{if reduction} = \text{`sum'.} \end{cases}

Can also be used for higher dimension inputs, such as 2D images, by providing an input of size (minibatch, C, d_1, d_2, ..., d_K) with K \geq 1, where K is the number of dimensions, and a target of appropriate shape (see below). In the case of images, it computes NLL loss per-pixel.

Parameters

configs (DictConfig) – hydra configuration set
tokenizer (Tokenizer) – tokenizer is in charge of preparing the inputs for a model.

Inputs: logits, targets

logits (torch.FloatTensor): probability distribution value from model and it has a logarithm shape.
The FloatTensor of size (batch, seq_length, num_classes)
targets (torch.LongTensor): ground-truth encoded to integers which directly point a word in label.
The LongTensor of size (batch, target_length)

Returns: loss

loss (float): loss for training

Examples:

>>> B, T1, C, T2 = 3, 128, 4, 10
>>> loss = CrossEntropyLoss()
>>> inputs = torch.randn(B, T1, C, requires_grad=True)
>>> targets = torch.empty(B, T2, dtype=torch.long).random_(T2)
>>> outputs = loss(inputs, targets)
>>> outputs.backward()

Cross Entropy Loss Configuration¶

class openspeech.criterion.cross_entropy.configuration.CrossEntropyLossConfigs(criterion_name: str = 'cross_entropy', reduction: str = 'mean')[source]¶

This is the configuration class to store the configuration of a :class: ~openspeech.criterion.CrossEntropyLoss.

It is used to initiated an CrossEntropyLoss criterion.

Configuration objects inherit from :class: ~openspeech.dataclass.configs.OpenspeechDataclass.

Configurations:: criterion_name (str): name of criterion (default: cross_entropy) reduction (str): reduction method of criterion (default: mean)

CTC Loss¶

class openspeech.criterion.ctc.ctc.CTCLoss(configs: omegaconf.dictconfig.DictConfig, tokenizer: openspeech.tokenizers.tokenizer.Tokenizer)[source]¶

The Connectionist Temporal Classification loss.

Calculates loss between a continuous (unsegmented) time series and a target sequence. CTCLoss sums over the probability of possible alignments of input to target, producing a loss value which is differentiable with respect to each input node. The alignment of input to target is assumed to be “many-to-one”, which limits the length of the target sequence such that it must be \leq the input length.

Parameters

configs (DictConfig) – hydra configuration set
tokenizer (Tokenizer) – tokenizer is in charge of preparing the inputs for a model.

Inputs: log_probs, targets, input_lengths, target_lengths

Log_probs: Tensor of size (T, N, C), where T = \text{input length}, N = \text{batch size}, and C = \text{number of classes (including blank)}. The logarithmized probabilities of the outputs (e.g. obtained with torch.nn.functional.log_softmax()).
Targets: Tensor of size (N, S) or (\operatorname{sum}(\text{target\_lengths})), where N = \text{batch size} and S = \text{max target length, if shape is } (N, S). It represent the target sequences. Each element in the target sequence is a class index. And the target index cannot be blank (default=0). In the (N, S) form, targets are padded to the length of the longest sequence, and stacked. In the (\operatorname{sum}(\text{target\_lengths})) form, the targets are assumed to be un-padded and concatenated within 1 dimension.
Input_lengths: Tuple or tensor of size (N), where N = \text{batch size}. It represent the lengths of the inputs (must each be \leq T). And the lengths are specified for each sequence to achieve masking under the assumption that sequences are padded to equal lengths.
Target_lengths: Tuple or tensor of size (N), where N = \text{batch size}. It represent lengths of the targets. Lengths are specified for each sequence to achieve masking under the assumption that sequences are padded to equal lengths. If target shape is (N,S), target_lengths are effectively the stop index s_n for each target sequence, such that target_n = targets[n,0:s_n] for each target in a batch. Lengths must each be \leq S If the targets are given as a 1d tensor that is the concatenation of individual targets, the target_lengths must add up to the total length of the tensor.

Returns: loss

loss (float): loss for training

Examples:

>>> # Target are to be padded
>>> T = 50      # Input sequence length
>>> C = 20      # Number of classes (including blank)
>>> N = 16      # Batch size
>>> S = 30      # Target sequence length of longest target in batch (padding length)
>>> S_min = 10  # Minimum target length, for demonstration purposes
>>>
>>> # Initialize random batch of input vectors, for *size = (T,N,C)
>>> input = torch.randn(T, N, C).log_softmax(2).detach().requires_grad_()
>>>
>>> # Initialize random batch of targets (0 = blank, 1:C = classes)
>>> target = torch.randint(low=1, high=C, size=(N, S), dtype=torch.long)
>>>
>>> input_lengths = torch.full(size=(N,), fill_value=T, dtype=torch.long)
>>> target_lengths = torch.randint(low=S_min, high=S, size=(N,), dtype=torch.long)
>>> ctc_loss = nn.CTCLoss()
>>> loss = ctc_loss(input, target, input_lengths, target_lengths)
>>> loss.backward()
>>>
>>>
>>> # Target are to be un-padded
>>> T = 50      # Input sequence length
>>> C = 20      # Number of classes (including blank)
>>> N = 16      # Batch size
>>>
>>> # Initialize random batch of input vectors, for *size = (T,N,C)
>>> input = torch.randn(T, N, C).log_softmax(2).detach().requires_grad_()
>>> input_lengths = torch.full(size=(N,), fill_value=T, dtype=torch.long)
>>>
>>> # Initialize random batch of targets (0 = blank, 1:C = classes)
>>> target_lengths = torch.randint(low=1, high=T, size=(N,), dtype=torch.long)
>>> target = torch.randint(low=1, high=C, size=(sum(target_lengths),), dtype=torch.long)
>>> ctc_loss = CTCLoss()
>>> loss = ctc_loss(input, target, input_lengths, target_lengths)
>>> loss.backward()

Reference:: A. Graves et al.: Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks: https://www.cs.toronto.edu/~graves/icml_2006.pdf

CTC Loss Configuration¶

class openspeech.criterion.ctc.configuration.CTCLossConfigs(criterion_name: str = 'ctc', reduction: str = 'mean', zero_infinity: bool = True)[source]¶

This is the configuration class to store the configuration of a CTCLoss.

It is used to initiated an CTCLoss criterion.

Configuration objects inherit from :class: ~openspeech.dataclass.configs.OpenspeechDataclass.

Configurations:: criterion_name (str): name of criterion. (default: ctc) reduction (str): reduction method of criterion. (default: mean) zero_infibity (bool): whether to zero infinite losses and the associated gradients. (default: True)

Joint CTC Cross Entropy Loss¶

class openspeech.criterion.joint_ctc_cross_entropy.joint_ctc_cross_entropy.JointCTCCrossEntropyLoss(configs: omegaconf.dictconfig.DictConfig, num_classes: int, tokenizer: openspeech.tokenizers.tokenizer.Tokenizer)[source]¶

Privides Joint CTC-CrossEntropy Loss function. The logit from the encoder applies CTC Loss, and the logit from the decoder applies Cross Entropy. This loss makes the encoder more robust.

Parameters

configs (DictConfig) – hydra configuration set
num_classes (int) – the number of classfication
tokenizer (Tokenizer) – tokenizer is in charge of preparing the inputs for a model.

Inputs: encoder_logits, logits, output_lengths, targets, target_lengths

encoder_logits (torch.FloatTensor): probability distribution value from encoder and it has a logarithm shape.
The FloatTensor of size (input_length, batch, num_classes)
logits (torch.FloatTensor): probability distribution value from model and it has a logarithm shape.
The FloatTensor of size (batch, seq_length, num_classes)
output_lengths (torch.LongTensor): length of model’s outputs.
The LongTensor of size (batch)
targets (torch.LongTensor): ground-truth encoded to integers which directly point a word in label.
The LongTensor of size (batch, target_length)
target_lengths (torch.LongTensor): length of targets.
The LongTensor of size (batch)

Returns: loss, ctc_loss, cross_entropy_loss

loss (float): loss for training
ctc_loss (float): ctc loss for training
cross_entropy_loss (float): cross entropy loss for training

Reference:

Suyoun Kim et al.: Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning: https://arxiv.org/abs/1609.06773

Joint CTC Cross Entropy Loss Configuration¶

class openspeech.criterion.joint_ctc_cross_entropy.configuration.JointCTCCrossEntropyLossConfigs(criterion_name: str = 'joint_ctc_cross_entropy', reduction: str = 'mean', ctc_weight: float = 0.3, cross_entropy_weight: float = 0.7, smoothing: float = 0.0, zero_infinity: bool = True)[source]¶

This is the configuration class to store the configuration of a JointCTCCrossEntropyLoss.

It is used to initiated an CTCLoss criterion.

Configuration objects inherit from :class: ~openspeech.dataclass.configs.OpenspeechDataclass.

Configurations:: criterion_name (str): name of criterion. (default: joint_ctc_cross_entropy) reduction (str): reduction method of criterion. (default: mean) ctc_weight (float): weight of ctc loss for training. (default: 0.3) cross_entropy_weight (float): weight of cross entropy loss for training. (default: 0.7) smoothing (float): ratio of smoothing loss (confidence = 1.0 - smoothing) (default: 0.0) zero_infibity (bool): whether to zero infinite losses and the associated gradients. (default: True)

Label Smoothed Cross Entropy Loss¶

class openspeech.criterion.label_smoothed_cross_entropy.label_smoothed_cross_entropy.LabelSmoothedCrossEntropyLoss(configs: omegaconf.dictconfig.DictConfig, num_classes: int, tokenizer: openspeech.tokenizers.tokenizer.Tokenizer)[source]¶

Label smoothed cross entropy loss function.

Parameters

configs (DictConfig) – hydra configuration set
num_classes (int) – the number of classfication
tokenizer (Tokenizer) – tokenizer is in charge of preparing the inputs for a model.

Inputs: logits, targets

logits (torch.FloatTensor): probability distribution value from model and it has a logarithm shape.
The FloatTensor of size (batch, seq_length, num_classes)
targets (torch.LongTensor): ground-truth encoded to integers which directly point a word in label
The LongTensor of size (batch, target_length)

Returns: loss

loss (float): loss for training

Label Smoothed Cross Entropy Loss Configuration¶

class openspeech.criterion.label_smoothed_cross_entropy.configuration.LabelSmoothedCrossEntropyLossConfigs(criterion_name: str = 'label_smoothed_cross_entropy', reduction: str = 'mean', smoothing: float = 0.1)[source]¶

This is the configuration class to store the configuration of a LabelSmoothedCrossEntropyLoss.

It is used to initiated an LabelSmoothedCrossEntropyLoss criterion.

Configuration objects inherit from :class: ~openspeech.dataclass.configs.OpenspeechDataclass.

Configurations:: criterion_name (str): name of criterion. (default: label_smoothed_cross_entropy) reduction (str): reduction method of criterion. (default: mean) smoothing (float): ratio of smoothing loss (confidence = 1.0 - smoothing) (default: 0.1)

Perplexity¶

class openspeech.criterion.perplexity.perplexity.Perplexity(configs: omegaconf.dictconfig.DictConfig, tokenizer: openspeech.tokenizers.tokenizer.Tokenizer)[source]¶

Language model perplexity loss. Perplexity is the token averaged likelihood. When the averaging options are the same, it is the exponential of negative log-likelihood.

Parameters

configs (DictConfig) – hydra configuration set
tokenizer (Tokenizer) – tokenizer is in charge of preparing the inputs for a model.

Inputs: logits, targets

logits (torch.FloatTensor): probability distribution value from model and it has a logarithm shape.
The FloatTensor of size (batch, seq_length, num_classes)
targets (torch.LongTensor): ground-truth encoded to integers which directly point a word in label
The LongTensor of size (batch, target_length)

Returns: loss

loss (float): loss for training

Perplexity Configuration¶

class openspeech.criterion.perplexity.configuration.PerplexityLossConfigs(criterion_name: str = 'perplexity', reduction: str = 'mean')[source]¶

This is the configuration class to store the configuration of a :class: ~openspeech.criterion.Perplexity.

It is used to initiated an Perplexity criterion.

Configuration objects inherit from :class: ~openspeech.dataclass.configs.OpenspeechDataclass.

Configurations:: criterion_name (str): name of criterion (default: perplexity) reduction (str): reduction method of criterion (default: mean)

Transducer Loss¶

class openspeech.criterion.transducer.transducer.TransducerLoss(configs: omegaconf.dictconfig.DictConfig, tokenizer: openspeech.tokenizers.tokenizer.Tokenizer)[source]¶

Compute path-aware regularization transducer loss.

Parameters

configs (DictConfig) – hydra configuration set
tokenizer (Tokenizer) – tokenizer is in charge of preparing the inputs for a model.

Inputs:

logits (torch.FloatTensor): Input tensor with shape (N, T, U, V): where N is the minibatch size, T is the maximum number of input frames, U is the maximum number of output labels and V is the vocabulary of labels (including the blank).
targets (torch.IntTensor): Tensor with shape (N, U-1) representing the: reference labels for all samples in the minibatch.
input_lengths (torch.IntTensor): Tensor with shape (N,) representing the: number of frames for each sample in the minibatch.
target_lengths (torch.IntTensor): Tensor with shape (N,) representing the: length of the transcription for each sample in the minibatch.

Returns

transducer loss

Return type

loss (torch.FloatTensor)

Reference:: A. Graves: Sequence Transduction with Recurrent Neural Networks: https://arxiv.org/abs/1211.3711.pdf

Transducer Loss Configuration¶

class openspeech.criterion.transducer.configuration.TransducerLossConfigs(criterion_name: str = 'transducer', reduction: str = 'mean', gather: bool = True)[source]¶

This is the configuration class to store the configuration of a TransducerLoss.

It is used to initiated an TransducerLoss criterion.

Configuration objects inherit from :class: ~openspeech.dataclass.configs.OpenspeechDataclass.

Configurations:: criterion_name (str): name of criterion. (default: label_smoothed_cross_entropy) reduction (str): reduction method of criterion. (default: mean) gather (bool): reduce memory consumption. (default: True)