LibriSpeech¶
LibriSpeech¶
- 
class openspeech.datasets.librispeech.lit_data_module.LightningLibriSpeechDataModule(*args: Any, **kwargs: Any)[source]¶
- PyTorch Lightning Data Module for LibriSpeech Dataset. LibriSpeech is a corpus of approximately 1000 hours of read English speech with sampling rate of 16 kHz, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned. - Parameters
- configs (DictConfig) – configuraion set 
 - 
prepare_data() → openspeech.tokenizers.tokenizer.Tokenizer[source]¶
- Prepare librispeech data - Returns
- tokenizer is in charge of preparing the inputs for a model. 
- Return type
- tokenizer (Tokenizer) 
 
 - 
setup(stage: Optional[str] = None, tokenizer: openspeech.tokenizers.tokenizer.Tokenizer = None) → None[source]¶
- Split dataset into train, valid, and test. 
 - 
test_dataloader() → torch.utils.data.dataloader.DataLoader[source]¶
- Implement one or multiple PyTorch DataLoaders for testing. - The dataloader you return will not be called every epoch unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_epoch` to - True.- For data processing use the following pattern: - download in - prepare_data()
- process and split in - setup()
 - However, the above are only necessary for distributed processing. - Warning - do not assign state in prepare_data - Note - Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself. - Returns
- Single or multiple PyTorch DataLoaders. 
 - Example: - def test_dataloader(self): transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))]) dataset = MNIST(root='/path/to/mnist/', train=False, transform=transform, download=True) loader = torch.utils.data.DataLoader( dataset=dataset, batch_size=self.batch_size, shuffle=False ) return loader # can also return multiple dataloaders def test_dataloader(self): return [loader_a, loader_b, ..., loader_n] - Note - If you don’t need a test dataset and a - test_step(), you don’t need to implement this method.- Note - In the case where you return multiple test dataloaders, the - test_step()will have an argument- dataloader_idxwhich matches the order here.
 - 
train_dataloader() → torch.utils.data.dataloader.DataLoader[source]¶
- Implement one or more PyTorch DataLoaders for training. - Returns
- Either a single PyTorch - DataLoaderor a collection of these (list, dict, nested lists and dicts). In the case of multiple dataloaders, please see this page
 - The dataloader you return will not be called every epoch unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_epoch` to - True.- For data processing use the following pattern: - download in - prepare_data()
- process and split in - setup()
 - However, the above are only necessary for distributed processing. - Warning - do not assign state in prepare_data - fit()
- … 
 - Note - Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself. - Example: - # single dataloader def train_dataloader(self): transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))]) dataset = MNIST(root='/path/to/mnist/', train=True, transform=transform, download=True) loader = torch.utils.data.DataLoader( dataset=dataset, batch_size=self.batch_size, shuffle=True ) return loader # multiple dataloaders, return as list def train_dataloader(self): mnist = MNIST(...) cifar = CIFAR(...) mnist_loader = torch.utils.data.DataLoader( dataset=mnist, batch_size=self.batch_size, shuffle=True ) cifar_loader = torch.utils.data.DataLoader( dataset=cifar, batch_size=self.batch_size, shuffle=True ) # each batch will be a list of tensors: [batch_mnist, batch_cifar] return [mnist_loader, cifar_loader] # multiple dataloader, return as dict def train_dataloader(self): mnist = MNIST(...) cifar = CIFAR(...) mnist_loader = torch.utils.data.DataLoader( dataset=mnist, batch_size=self.batch_size, shuffle=True ) cifar_loader = torch.utils.data.DataLoader( dataset=cifar, batch_size=self.batch_size, shuffle=True ) # each batch will be a dict of tensors: {'mnist': batch_mnist, 'cifar': batch_cifar} return {'mnist': mnist_loader, 'cifar': cifar_loader} 
 - 
val_dataloader() → torch.utils.data.dataloader.DataLoader[source]¶
- Implement one or multiple PyTorch DataLoaders for validation. - The dataloader you return will not be called every epoch unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_epoch` to - True.- It’s recommended that all data downloads and preparation happen in - prepare_data().- Note - Lightning adds the correct sampler for distributed and arbitrary hardware There is no need to set it yourself. - Returns
- Single or multiple PyTorch DataLoaders. 
 - Examples: - def val_dataloader(self): transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))]) dataset = MNIST(root='/path/to/mnist/', train=False, transform=transform, download=True) loader = torch.utils.data.DataLoader( dataset=dataset, batch_size=self.batch_size, shuffle=False ) return loader # can also return multiple dataloaders def val_dataloader(self): return [loader_a, loader_b, ..., loader_n] - Note - If you don’t need a validation dataset and a - validation_step(), you don’t need to implement this method.- Note - In the case where you return multiple validation dataloaders, the - validation_step()will have an argument- dataloader_idxwhich matches the order here.