Data Loaders

Dataset

class mammoth.inputters.dataset.ParallelCorpus(src_file, tgt_file, src_vocab, tgt_vocab, transforms, device='cpu', stride=None, offset=None, is_train=False, task=None)[source]

Bases: torch.utils.data.dataset.IterableDataset

Torch-style dataset

Data loading

class mammoth.inputters.dataloader.DynamicDatasetIter(task_queue_manager, opts, corpora_info, transforms_cls, vocabs_dict, is_train, batch_type, batch_size, batch_size_multiple, data_type='text', pool_size=2048, n_buckets=1024, skip_empty_level='warning')[source]

Bases: object

Yield batch from (multiple) plain text corpus.

Parameters
  • corpora (dict[str, ParallelCorpus]) – collections of corpora to iterate;

  • corpora_info (dict[str, dict]) – corpora infos correspond to corpora;

  • transforms (dict[str, Transform]) – transforms may be used by corpora;

  • fields (dict[str, Field]) – fields dict for convert corpora into Tensor;

  • is_train (bool) – True when generate data for training;

  • batch_type (str) – batching type to count on, choices=[tokens, sents];

  • batch_size (int) – numbers of examples in a batch;

  • batch_size_multiple (int) – make batch size multiply of this;

  • data_type (str) – input data type, currently only text;

  • pool_size (int) – accum this number of examples in a dynamic dataset;

  • skip_empty_level (str) – security level when encouter empty line;

  • stride (int) – iterate data files with this stride;

  • offset (int) – iterate data files with this offset.

Variables
  • dataset_adapter (DatasetAdapter) – organize raw corpus to tensor adapt;

  • mixer (MixingStrategy) – the strategy to iterate corpora.

classmethod from_opts(task_queue_manager, transforms_cls, vocabs_dict, opts, is_train)[source]

Initilize DynamicDatasetIter with options parsed from opts.

class mammoth.inputters.dataloader.LookAheadBucketing(dataset, look_ahead_size, n_buckets, batch_size, bucket_fn, numel_fn)[source]

Bases: object

bucket_is_empty(s_idx: int, t_idx: int) → bool[source]

check if this bucket is empty

is_empty() → bool[source]

check if all buckets are empty

maybe_replenish()[source]

try to look up one more example to add to this reservoir.

class mammoth.inputters.dataloader.InferenceBatcher(dataset, batch_size)[source]

Bases: object

Iterator for inference

Vocab

class mammoth.inputters.vocab.Vocab(path, items=None, tag='', size=None, specials=[])[source]

Bases: object

classmethod merge(*vocabs, size=None)[source]

Merge vocabs.