Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert a list of strings into a tensor in pytorch?

I am working on classification problem in which I have a list of strings as class labels and I want to convert them into a tensor. So far I have tried converting the list of strings into a numpy array using the np.array function provided by the numpy module.

truth = torch.from_numpy(np.array(truths))

but I am getting the following error.

RuntimeError: can't convert a given np.ndarray to a tensor - it has an invalid type. The only supported types are: double, float, int64, int32, and uint8.

Can anybody suggest an alternative approach? Thanks

like image 406
deepayan das Avatar asked Jun 18 '17 17:06

deepayan das


People also ask

How do you convert a list into a PyTorch tensor?

To convert a Python list to a tensor, we are going to use the tf. convert_to_tensor() function and this function will help the user to convert the given object into a tensor. In this example, the object can be a Python list and by using the function will return a tensor.

What does torch As_tensor do?

as_tensor. Converts data into a tensor, sharing data and preserving autograd history if possible.

How do you convert to tensors?

a NumPy array is created by using the np. array() method. The NumPy array is converted to tensor by using tf. convert_to_tensor() method.


2 Answers

Unfortunately, you can't right now. And I don't think it is a good idea since it will make PyTorch clumsy. A popular workaround could convert it into numeric types using sklearn.

Here is a short example:

from sklearn import preprocessing
import torch

labels = ['cat', 'dog', 'mouse', 'elephant', 'pandas']
le = preprocessing.LabelEncoder()
targets = le.fit_transform(labels)
# targets: array([0, 1, 2, 3])

targets = torch.as_tensor(targets)
# targets: tensor([0, 1, 2, 3])

Since you may need the conversion between true labels and transformed labels, it is good to store the variable le.

like image 173
Tengerye Avatar answered Sep 21 '22 20:09

Tengerye


The trick is first to find out max length of a word in the list, and then at the second loop populate the tensor with zeros padding. Note that utf8 strings take two bytes per char.

In[]
import torch

words = ['שלום', 'beautiful', 'world']
max_l = 0
ts_list = []
for w in words:
    ts_list.append(torch.ByteTensor(list(bytes(w, 'utf8'))))
    max_l = max(ts_list[-1].size()[0], max_l)

w_t = torch.zeros((len(ts_list), max_l), dtype=torch.uint8)
for i, ts in enumerate(ts_list):
    w_t[i, 0:ts.size()[0]] = ts
w_t

Out[]
tensor([[215, 169, 215, 156, 215, 149, 215, 157,   0],
        [ 98, 101,  97, 117, 116, 105, 102, 117, 108],
        [119, 111, 114, 108, 100,   0,   0,   0,   0]], dtype=torch.uint8)
like image 34
Serge Tochilov Avatar answered Sep 21 '22 20:09

Serge Tochilov