AISHELL¶
AISHELL¶
- 
class openspeech.datasets.aishell.lit_data_module.LightningAIShellDataModule(*args: Any, **kwargs: Any)[source]¶
- Lightning data module for AIShell-1. The corpus includes training set, development set and test sets. Training set contains 120,098 utterances from 340 speakers; development set contains 14,326 utterance from the 40 speakers; Test set contains 7,176 utterances from 20 speakers. For each speaker, around 360 utterances (about 26 minutes of speech) are released. - Parameters
- configs (DictConfig) – configuration set. 
 - 
prepare_data()[source]¶
- Prepare AI-Shell manifest file. If there is not exist manifest file, generate manifest file. - Returns
- tokenizer is in charge of preparing the inputs for a model. 
- Return type
- tokenizer (Tokenizer) 
 
 - 
setup(stage: Optional[str] = None, tokenizer: openspeech.tokenizers.tokenizer.Tokenizer = None)[source]¶
- Split train and valid dataset for training. 
 - 
test_dataloader() → torch.utils.data.dataloader.DataLoader[source]¶
- Implement one or multiple PyTorch DataLoaders for testing. - The dataloader you return will not be called every epoch unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_epoch` to - True.- For data processing use the following pattern: - download in - prepare_data()
- process and split in - setup()
 - However, the above are only necessary for distributed processing. - Warning - do not assign state in prepare_data - Note - Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself. - Returns
- Single or multiple PyTorch DataLoaders. 
 - Example: - def test_dataloader(self): transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))]) dataset = MNIST(root='/path/to/mnist/', train=False, transform=transform, download=True) loader = torch.utils.data.DataLoader( dataset=dataset, batch_size=self.batch_size, shuffle=False ) return loader # can also return multiple dataloaders def test_dataloader(self): return [loader_a, loader_b, ..., loader_n] - Note - If you don’t need a test dataset and a - test_step(), you don’t need to implement this method.- Note - In the case where you return multiple test dataloaders, the - test_step()will have an argument- dataloader_idxwhich matches the order here.
 - 
train_dataloader() → torch.utils.data.dataloader.DataLoader[source]¶
- Implement one or more PyTorch DataLoaders for training. - Returns
- Either a single PyTorch - DataLoaderor a collection of these (list, dict, nested lists and dicts). In the case of multiple dataloaders, please see this page
 - The dataloader you return will not be called every epoch unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_epoch` to - True.- For data processing use the following pattern: - download in - prepare_data()
- process and split in - setup()
 - However, the above are only necessary for distributed processing. - Warning - do not assign state in prepare_data - fit()
- … 
 - Note - Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself. - Example: - # single dataloader def train_dataloader(self): transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))]) dataset = MNIST(root='/path/to/mnist/', train=True, transform=transform, download=True) loader = torch.utils.data.DataLoader( dataset=dataset, batch_size=self.batch_size, shuffle=True ) return loader # multiple dataloaders, return as list def train_dataloader(self): mnist = MNIST(...) cifar = CIFAR(...) mnist_loader = torch.utils.data.DataLoader( dataset=mnist, batch_size=self.batch_size, shuffle=True ) cifar_loader = torch.utils.data.DataLoader( dataset=cifar, batch_size=self.batch_size, shuffle=True ) # each batch will be a list of tensors: [batch_mnist, batch_cifar] return [mnist_loader, cifar_loader] # multiple dataloader, return as dict def train_dataloader(self): mnist = MNIST(...) cifar = CIFAR(...) mnist_loader = torch.utils.data.DataLoader( dataset=mnist, batch_size=self.batch_size, shuffle=True ) cifar_loader = torch.utils.data.DataLoader( dataset=cifar, batch_size=self.batch_size, shuffle=True ) # each batch will be a dict of tensors: {'mnist': batch_mnist, 'cifar': batch_cifar} return {'mnist': mnist_loader, 'cifar': cifar_loader} 
 - 
val_dataloader() → torch.utils.data.dataloader.DataLoader[source]¶
- Implement one or multiple PyTorch DataLoaders for validation. - The dataloader you return will not be called every epoch unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_epoch` to - True.- It’s recommended that all data downloads and preparation happen in - prepare_data().- Note - Lightning adds the correct sampler for distributed and arbitrary hardware There is no need to set it yourself. - Returns
- Single or multiple PyTorch DataLoaders. 
 - Examples: - def val_dataloader(self): transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))]) dataset = MNIST(root='/path/to/mnist/', train=False, transform=transform, download=True) loader = torch.utils.data.DataLoader( dataset=dataset, batch_size=self.batch_size, shuffle=False ) return loader # can also return multiple dataloaders def val_dataloader(self): return [loader_a, loader_b, ..., loader_n] - Note - If you don’t need a validation dataset and a - validation_step(), you don’t need to implement this method.- Note - In the case where you return multiple validation dataloaders, the - validation_step()will have an argument- dataloader_idxwhich matches the order here.