Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create a torchtext.data.TabularDataset directly from a list or dict

torchtext.data.TabularDataset can be created from a TSV/JSON/CSV file and then it can be used for building the vocabulary from Glove, FastText or any other embeddings. But my requirement is to create a torchtext.data.TabularDataset directly, either from a list or a dict.

Current implementation of the code by reading TSV files

self.RAW = data.RawField()
self.TEXT = data.Field(batch_first=True)
self.LABEL = data.Field(sequential=False, unk_token=None)


self.train, self.dev, self.test = data.TabularDataset.splits(
    path='.data/quora',
    train='train.tsv',
    validation='dev.tsv',
    test='test.tsv',
    format='tsv',
    fields=[('label', self.LABEL),
            ('q1', self.TEXT),
            ('q2', self.TEXT),
            ('id', self.RAW)])


self.TEXT.build_vocab(self.train, self.dev, self.test, vectors=GloVe(name='840B', dim=300))
self.LABEL.build_vocab(self.train)


sort_key = lambda x: data.interleave_keys(len(x.q1), len(x.q2))


self.train_iter, self.dev_iter, self.test_iter = \
    data.BucketIterator.splits((self.train, self.dev, self.test),
                               batch_sizes=[args.batch_size] * 3,
                               device=args.gpu,
                               sort_key=sort_key)

This is the current working code for reading data from a file. So in order to create the dataset directly from a List/Dict I tried inbuilt functions like Examples.fromDict or Examples.fromList but then while coming to the last for loop, it throws an error that AttributeError: 'BucketIterator' object has no attribute 'q1'

like image 899
Arjun Sankarlal Avatar asked Oct 29 '18 13:10

Arjun Sankarlal


People also ask

What is Tabulardataset?

A tabular dataset is mainly a collection of rows and columns. We are always interested to know what is the importance of the columns.

Is Torchtext part of PyTorch?

This library is part of the PyTorch project. PyTorch is an open source machine learning framework.

What is Torchtext?

Torchtext is a companion package to PyTorch consisting of data processing utilities and popular datasets for natural language. WML CE support for torchtext is included as a separate package.. Note: PyTorch is installed as a requisite to torchtext.

What is BucketIterator?

BucketIterator (dataset, batch_size, sort_key=None, device=None, batch_size_fn=None, train=True, repeat=False, shuffle=None, sort=None, sort_within_batch=None) Defines an iterator that batches examples of similar lengths together.


1 Answers

It required me to write an own class inheriting the Dataset class and with few modifications in torchtext.data.TabularDataset class.

class TabularDataset_From_List(data.Dataset):

    def __init__(self, input_list, format, fields, skip_header=False, **kwargs):
        make_example = {
            'json': Example.fromJSON, 'dict': Example.fromdict,
            'tsv': Example.fromTSV, 'csv': Example.fromCSV}[format.lower()]

        examples = [make_example(item, fields) for item in input_list]

        if make_example in (Example.fromdict, Example.fromJSON):
            fields, field_dict = [], fields
            for field in field_dict.values():
                if isinstance(field, list):
                    fields.extend(field)
                else:
                    fields.append(field)

        super(TabularDataset_From_List, self).__init__(examples, fields, **kwargs)

    @classmethod
    def splits(cls, path=None, root='.data', train=None, validation=None,
               test=None, **kwargs):
        if path is None:
            path = cls.download(root)
        train_data = None if train is None else cls(
            train, **kwargs)
        val_data = None if validation is None else cls(
            validation, **kwargs)
        test_data = None if test is None else cls(
            test, **kwargs)
        return tuple(d for d in (train_data, val_data, test_data)
                     if d is not None)
like image 175
Arjun Sankarlal Avatar answered Sep 21 '22 20:09

Arjun Sankarlal