torchtext.data.TabularDataset
can be created from a TSV/JSON/CSV file and then it can be used for building the vocabulary from Glove, FastText or any other embeddings. But my requirement is to create a torchtext.data.TabularDataset
directly, either from a list
or a dict
.
Current implementation of the code by reading TSV files
self.RAW = data.RawField()
self.TEXT = data.Field(batch_first=True)
self.LABEL = data.Field(sequential=False, unk_token=None)
self.train, self.dev, self.test = data.TabularDataset.splits(
path='.data/quora',
train='train.tsv',
validation='dev.tsv',
test='test.tsv',
format='tsv',
fields=[('label', self.LABEL),
('q1', self.TEXT),
('q2', self.TEXT),
('id', self.RAW)])
self.TEXT.build_vocab(self.train, self.dev, self.test, vectors=GloVe(name='840B', dim=300))
self.LABEL.build_vocab(self.train)
sort_key = lambda x: data.interleave_keys(len(x.q1), len(x.q2))
self.train_iter, self.dev_iter, self.test_iter = \
data.BucketIterator.splits((self.train, self.dev, self.test),
batch_sizes=[args.batch_size] * 3,
device=args.gpu,
sort_key=sort_key)
This is the current working code for reading data from a file. So in order to create the dataset directly from a List/Dict I tried inbuilt functions like Examples.fromDict
or Examples.fromList but then while coming to the last for loop, it throws an error that AttributeError: 'BucketIterator' object has no attribute 'q1'
A tabular dataset is mainly a collection of rows and columns. We are always interested to know what is the importance of the columns.
This library is part of the PyTorch project. PyTorch is an open source machine learning framework.
Torchtext is a companion package to PyTorch consisting of data processing utilities and popular datasets for natural language. WML CE support for torchtext is included as a separate package.. Note: PyTorch is installed as a requisite to torchtext.
BucketIterator (dataset, batch_size, sort_key=None, device=None, batch_size_fn=None, train=True, repeat=False, shuffle=None, sort=None, sort_within_batch=None) Defines an iterator that batches examples of similar lengths together.
It required me to write an own class inheriting the Dataset class and with few modifications in torchtext.data.TabularDataset
class.
class TabularDataset_From_List(data.Dataset):
def __init__(self, input_list, format, fields, skip_header=False, **kwargs):
make_example = {
'json': Example.fromJSON, 'dict': Example.fromdict,
'tsv': Example.fromTSV, 'csv': Example.fromCSV}[format.lower()]
examples = [make_example(item, fields) for item in input_list]
if make_example in (Example.fromdict, Example.fromJSON):
fields, field_dict = [], fields
for field in field_dict.values():
if isinstance(field, list):
fields.extend(field)
else:
fields.append(field)
super(TabularDataset_From_List, self).__init__(examples, fields, **kwargs)
@classmethod
def splits(cls, path=None, root='.data', train=None, validation=None,
test=None, **kwargs):
if path is None:
path = cls.download(root)
train_data = None if train is None else cls(
train, **kwargs)
val_data = None if validation is None else cls(
validation, **kwargs)
test_data = None if test is None else cls(
test, **kwargs)
return tuple(d for d in (train_data, val_data, test_data)
if d is not None)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With