How to create a torchtext.data.TabularDataset directly from a list or dict

Tags:

torchtext.data.TabularDataset can be created from a TSV/JSON/CSV file and then it can be used for building the vocabulary from Glove, FastText or any other embeddings. But my requirement is to create a torchtext.data.TabularDataset directly, either from a list or a dict.

Current implementation of the code by reading TSV files

self.RAW = data.RawField()
self.TEXT = data.Field(batch_first=True)
self.LABEL = data.Field(sequential=False, unk_token=None)


self.train, self.dev, self.test = data.TabularDataset.splits(
    path='.data/quora',
    train='train.tsv',
    validation='dev.tsv',
    test='test.tsv',
    format='tsv',
    fields=[('label', self.LABEL),
            ('q1', self.TEXT),
            ('q2', self.TEXT),
            ('id', self.RAW)])


self.TEXT.build_vocab(self.train, self.dev, self.test, vectors=GloVe(name='840B', dim=300))
self.LABEL.build_vocab(self.train)


sort_key = lambda x: data.interleave_keys(len(x.q1), len(x.q2))


self.train_iter, self.dev_iter, self.test_iter = \
    data.BucketIterator.splits((self.train, self.dev, self.test),
                               batch_sizes=[args.batch_size] * 3,
                               device=args.gpu,
                               sort_key=sort_key)

This is the current working code for reading data from a file. So in order to create the dataset directly from a List/Dict I tried inbuilt functions like Examples.fromDict or Examples.fromList but then while coming to the last for loop, it throws an error that AttributeError: 'BucketIterator' object has no attribute 'q1'

899

asked Oct 29 '18 13:10

Arjun Sankarlal

1 Answers

It required me to write an own class inheriting the Dataset class and with few modifications in torchtext.data.TabularDataset class.

class TabularDataset_From_List(data.Dataset):

    def __init__(self, input_list, format, fields, skip_header=False, **kwargs):
        make_example = {
            'json': Example.fromJSON, 'dict': Example.fromdict,
            'tsv': Example.fromTSV, 'csv': Example.fromCSV}[format.lower()]

        examples = [make_example(item, fields) for item in input_list]

        if make_example in (Example.fromdict, Example.fromJSON):
            fields, field_dict = [], fields
            for field in field_dict.values():
                if isinstance(field, list):
                    fields.extend(field)
                else:
                    fields.append(field)

        super(TabularDataset_From_List, self).__init__(examples, fields, **kwargs)

    @classmethod
    def splits(cls, path=None, root='.data', train=None, validation=None,
               test=None, **kwargs):
        if path is None:
            path = cls.download(root)
        train_data = None if train is None else cls(
            train, **kwargs)
        val_data = None if validation is None else cls(
            validation, **kwargs)
        test_data = None if test is None else cls(
            test, **kwargs)
        return tuple(d for d in (train_data, val_data, test_data)
                     if d is not None)

175

answered Sep 21 '22 20:09

Arjun Sankarlal

Related questions
                            
                                plot mouse movement Python
                            
                                Split Column containing lists into different rows in pandas [duplicate]
                            
                                How to use pyunpack to unpack .7z file?
                            
                                how to efficiently split a large dataframe into many parquet files?
                            
                                how to get standardised (Beta) coefficients for multiple linear regression using statsmodels
                            
                                How to test a custom loss function in keras?
                            
                                Increasing n_jobs has no effect on GridSearchCV
                            
                                Error: Classification metrics can't handle a mix of multiclass-multioutput and multilabel-indicator targets
                            
                                How to update several attributes of an item in dynamodb using boto3
                            
                                Using Wagtail as an API layer
                            
                                Trio execution time without IO operations
                            
                                configparser.ParsingError: Source contains parsing errors: 'my.ini'
                            
                                pandas Series.value_counts returns inconsistent order for equal count strings
                            
                                How to use ast.literal_eval in a pandas dataframe and handle exceptions
                            
                                What exactly is a matplotlib axes object?
                            
                                Numpy filter using condition on each element
                            
                                Set tkinter icon on Mac OS
                            
                                How to determine an overfitted model based on loss precision and recall
                            
                                Select top n TFIDF features for a given document
                            
                                Comparing Conv2D with padding between Tensorflow and PyTorch

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to create a torchtext.data.TabularDataset directly from a list or dict

Tags:

python

dataset

pytorch

torchtext

Arjun Sankarlal

People also ask

1 Answers

Arjun Sankarlal

Recent Activity

Donate For Us