Feature Transform¶
Load Audio¶
Spectrogram Feature Transform¶
-
class
openspeech.data.audio.spectrogram.spectrogram.
SpectrogramFeatureTransform
(configs: omegaconf.dictconfig.DictConfig)[source]¶ Create a spectrogram from a audio signal.
- Parameters
configs (DictConfig) – configuraion set
- Returns
A spectrogram feature. The shape is
(seq_length, num_mels)
- Return type
Tensor
Spectrogram Feature Transform Configuration¶
-
class
openspeech.data.audio.spectrogram.configuration.
SpectrogramConfigs
(name: str = 'spectrogram', sample_rate: int = 16000, frame_length: float = 20.0, frame_shift: float = 10.0, del_silence: bool = False, num_mels: int = 161, apply_spec_augment: bool = True, apply_noise_augment: bool = False, apply_time_stretch_augment: bool = False, apply_joining_augment: bool = False)[source]¶ This is the configuration class to store the configuration of a
SpectrogramTransform
.It is used to initiated an SpectrogramTransform feature transform.
Configuration objects inherit from :class: ~openspeech.dataclass.OpenspeechDataclass.
- Parameters
name (str) – name of feature transform. (default: spectrogram)
sample_rate (int) – sampling rate of audio (default: 16000)
frame_length (float) – frame length for spectrogram (default: 20.0)
frame_shift (float) – length of hop between STFT (default: 10.0)
del_silence (bool) – flag indication whether to apply delete silence or not (default: False)
num_mels (int) – the number of mfc coefficients to retain. (default: 161)
apply_spec_augment (bool) – flag indication whether to apply spec augment or not (default: True)
apply_noise_augment (bool) – flag indication whether to apply noise augment or not (default: False)
apply_time_stretch_augment (bool) – flag indication whether to apply time stretch augment or not (default: False)
apply_joining_augment (bool) – flag indication whether to apply audio joining augment or not (default: False)
Mel-Spectrogram Feature Transform¶
-
class
openspeech.data.audio.melspectrogram.melspectrogram.
MelSpectrogramFeatureTransform
(configs: omegaconf.dictconfig.DictConfig)[source]¶ Create MelSpectrogram for a raw audio signal. This is a composition of Spectrogram and MelScale.
- Parameters
configs (DictConfig) – configuraion set
- Returns
A mel-spectrogram feature. The shape is
(seq_length, num_mels)
- Return type
Tensor
Mel-Spectrogram Feature Transform Configuration¶
-
class
openspeech.data.audio.melspectrogram.configuration.
MelSpectrogramConfigs
(name: str = 'melspectrogram', sample_rate: int = 16000, frame_length: float = 20.0, frame_shift: float = 10.0, del_silence: bool = False, num_mels: int = 80, apply_spec_augment: bool = True, apply_noise_augment: bool = False, apply_time_stretch_augment: bool = False, apply_joining_augment: bool = False)[source]¶ This is the configuration class to store the configuration of a
MelSpectrogramFeatureTransform
.It is used to initiated an MelSpectrogramFeatureTransform feature transform.
Configuration objects inherit from :class: ~openspeech.dataclass.OpenspeechDataclass.
- Parameters
name (str) – name of feature transform. (default: melspectrogram)
sample_rate (int) – sampling rate of audio (default: 16000)
frame_length (float) – frame length for spectrogram (default: 20.0)
frame_shift (float) – length of hop between STFT (default: 10.0)
del_silence (bool) – flag indication whether to apply delete silence or not (default: False)
num_mels (int) – the number of mfc coefficients to retain. (default: 80)
apply_spec_augment (bool) – flag indication whether to apply spec augment or not (default: True)
apply_noise_augment (bool) – flag indication whether to apply noise augment or not (default: False)
apply_time_stretch_augment (bool) – flag indication whether to apply time stretch augment or not (default: False)
apply_joining_augment (bool) – flag indication whether to apply audio joining augment or not (default: False)
Filter-Bank Feature Transform¶
-
class
openspeech.data.audio.filter_bank.filter_bank.
FilterBankFeatureTransform
(configs: omegaconf.dictconfig.DictConfig)[source]¶ Create a fbank from a raw audio signal. This matches the input/output of Kaldi’s compute-fbank-feats.
- Parameters
configs (DictConfig) – hydra configuraion set
- Inputs:
signal (np.ndarray): signal from audio file.
- Returns
A fbank identical to what Kaldi would output. The shape is
(seq_length, num_mels)
- Return type
Tensor
Filter-Bank Feature Transform Configuration¶
-
class
openspeech.data.audio.filter_bank.configuration.
FilterBankConfigs
(name: str = 'fbank', sample_rate: int = 16000, frame_length: float = 20.0, frame_shift: float = 10.0, del_silence: bool = False, num_mels: int = 80, apply_spec_augment: bool = True, apply_noise_augment: bool = False, apply_time_stretch_augment: bool = False, apply_joining_augment: bool = False)[source]¶ This is the configuration class to store the configuration of a
FilterBankFeatureTransform
.It is used to initiated an FilterBankFeatureTransform feature transform.
Configuration objects inherit from :class: ~openspeech.dataclass.configs.OpenspeechDataclass.
- Parameters
name (str) – name of feature transform. (default: fbank)
sample_rate (int) – sampling rate of audio (default: 16000)
frame_length (float) – frame length for spectrogram (default: 20.0)
frame_shift (float) – length of hop between STFT (default: 10.0)
del_silence (bool) – flag indication whether to apply delete silence or not (default: False)
num_mels (int) – the number of mfc coefficients to retain. (default: 80)
apply_spec_augment (bool) – flag indication whether to apply spec augment or not (default: True)
apply_noise_augment (bool) – flag indication whether to apply noise augment or not (default: False)
apply_time_stretch_augment (bool) – flag indication whether to apply time stretch augment or not (default: False)
apply_joining_augment (bool) – flag indication whether to apply audio joining augment or not (default: False)
MFCC Feature Transform¶
-
class
openspeech.data.audio.mfcc.mfcc.
MFCCFeatureTransform
(configs: omegaconf.dictconfig.DictConfig)[source]¶ Create the Mel-frequency cepstrum coefficients from an audio signal.
By default, this calculates the MFCC on the DB-scaled Mel spectrogram. This is not the textbook implementation, but is implemented here to give consistency with librosa.
This output depends on the maximum value in the input spectrogram, and so may return different values for an audio clip split into snippets vs. a a full clip.
- Parameters
configs (DictConfig) – configuraion set
- Returns
A mfcc feature. The shape is
(seq_length, num_mels)
- Return type
Tensor
MFCC Feature Transform Configuration¶
-
class
openspeech.data.audio.mfcc.configuration.
MFCCConfigs
(name: str = 'mfcc', sample_rate: int = 16000, frame_length: float = 20.0, frame_shift: float = 10.0, del_silence: bool = False, num_mels: int = 40, apply_spec_augment: bool = True, apply_noise_augment: bool = False, apply_time_stretch_augment: bool = False, apply_joining_augment: bool = False)[source]¶ This is the configuration class to store the configuration of a
MFCCFeatureTransform
.It is used to initiated an MFCCFeatureTransform feature transform.
Configuration objects inherit from :class: ~openspeech.dataclass.OpenspeechDataclass.
- Parameters
name (str) – name of feature transform. (default: mfcc)
sample_rate (int) – sampling rate of audio (default: 16000)
frame_length (float) – frame length for spectrogram (default: 20.0)
frame_shift (float) – length of hop between STFT (default: 10.0)
del_silence (bool) – flag indication whether to apply delete silence or not (default: False)
num_mels (int) – the number of mfc coefficients to retain. (default: 40)
apply_spec_augment (bool) – flag indication whether to apply spec augment or not (default: True)
apply_noise_augment (bool) – flag indication whether to apply noise augment or not (default: False)
apply_time_stretch_augment (bool) – flag indication whether to apply time stretch augment or not (default: False)
apply_joining_augment (bool) – flag indication whether to apply audio joining augment or not (default: False)