Given a 3d tenzor, say: <code>batch x sentence length x embedding dim</code> <pre class="prettyprint"><code>a = torch.rand((10, 1000, 96)) </code></pre> and an array(or tensor) of actual lengths for each sentence <pre class="prettyprint"><code>lengths = torch .randint(1000,(10,)) </code></pre> <code>outputs tensor([ 370., 502., 652., 859., 545., 964., 566., 576.,1000., 803.])</code> How to fill tensor ‘a’ with zeros after certain index along dimension 1 (sentence length) according to tensor ‘lengths’ ? I want smth like that : <pre class="prettyprint"><code>a[ : , lengths : , : ] = 0 </code></pre> One way of doing it (slow if batch size is big enough): <pre class="prettyprint"><code>for i_batch in range(10): a[ i_batch , lengths[i_batch ] : , : ] = 0 </code></pre>

You can do it using a binary mask. Using <code>lengths</code> as column-indices to <code>mask</code> we indicate where each sequence ends (note that we make <code>mask</code> longer than <code>a.size(1)</code> to allow for sequences with full length). Using <code>cumsum()</code> we set all entries in <code>mask</code> after the seq len to 1. <pre class="prettyprint lang-py prettyprint-override"><code>mask = torch.zeros(a.shape[0], a.shape[1] + 1, dtype=a.dtype, device=a.device) mask[(torch.arange(a.shape[0]), lengths)] = 1 mask = mask.cumsum(dim=1)[:, :-1] # remove the superfluous column a = a * (1. - mask[..., None]) # use mask to zero after each column </code></pre> For <code>a.shape = (10, 5, 96)</code>, and <code>lengths = [1, 2, 1, 1, 3, 0, 4, 4, 1, 3]</code>. Assigning 1 to respective <code>lengths</code> at each row, <code>mask</code> looks like: <pre class="prettyprint lang-py prettyprint-override"><code>mask = tensor([[0., 1., 0., 0., 0., 0.], [0., 0., 1., 0., 0., 0.], [0., 1., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0.], [0., 0., 0., 1., 0., 0.], [1., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 1., 0.], [0., 0., 0., 0., 1., 0.], [0., 1., 0., 0., 0., 0.], [0., 0., 0., 1., 0., 0.]]) </code></pre> After <code>cumsum</code> you get <pre class="prettyprint lang-py prettyprint-override"><code>mask = tensor([[0., 1., 1., 1., 1.], [0., 0., 1., 1., 1.], [0., 1., 1., 1., 1.], [0., 1., 1., 1., 1.], [0., 0., 0., 1., 1.], [1., 1., 1., 1., 1.], [0., 0., 0., 0., 1.], [0., 0., 0., 0., 1.], [0., 1., 1., 1., 1.], [0., 0., 0., 1., 1.]]) </code></pre> Note that it exactly has zeros where the valid sequence entries are and ones beyond the lengths of the sequences. Taking <code>1 - mask</code> gives you exactly what you want. Enjoy ;)

Filling torch tensor with zeros after certain index

Tags:

python

nlp

pytorch

Given a 3d tenzor, say: batch x sentence length x embedding dim

a = torch.rand((10, 1000, 96))

and an array(or tensor) of actual lengths for each sentence

lengths =  torch .randint(1000,(10,))

outputs tensor([ 370., 502., 652., 859., 545., 964., 566., 576.,1000., 803.])

How to fill tensor ‘a’ with zeros after certain index along dimension 1 (sentence length) according to tensor ‘lengths’ ?

I want smth like that :

a[ : , lengths : , : ]  = 0

One way of doing it (slow if batch size is big enough):

for i_batch in range(10):
    a[ i_batch  , lengths[i_batch ] : , : ]  = 0

474

asked Aug 18 '19 20:08

D V

1 Answers

You can do it using a binary mask.
Using lengths as column-indices to mask we indicate where each sequence ends (note that we make mask longer than a.size(1) to allow for sequences with full length).
Using cumsum() we set all entries in mask after the seq len to 1.

mask = torch.zeros(a.shape[0], a.shape[1] + 1, dtype=a.dtype, device=a.device)
mask[(torch.arange(a.shape[0]), lengths)] = 1
mask = mask.cumsum(dim=1)[:, :-1]  # remove the superfluous column
a = a * (1. - mask[..., None])     # use mask to zero after each column

For a.shape = (10, 5, 96), and lengths = [1, 2, 1, 1, 3, 0, 4, 4, 1, 3].
Assigning 1 to respective lengths at each row, mask looks like:

mask = 
tensor([[0., 1., 0., 0., 0., 0.],
        [0., 0., 1., 0., 0., 0.],
        [0., 1., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0., 0.],
        [0., 0., 0., 1., 0., 0.],
        [1., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1., 0.],
        [0., 0., 0., 0., 1., 0.],
        [0., 1., 0., 0., 0., 0.],
        [0., 0., 0., 1., 0., 0.]])

After cumsum you get

mask = 
tensor([[0., 1., 1., 1., 1.],
        [0., 0., 1., 1., 1.],
        [0., 1., 1., 1., 1.],
        [0., 1., 1., 1., 1.],
        [0., 0., 0., 1., 1.],
        [1., 1., 1., 1., 1.],
        [0., 0., 0., 0., 1.],
        [0., 0., 0., 0., 1.],
        [0., 1., 1., 1., 1.],
        [0., 0., 0., 1., 1.]])

Note that it exactly has zeros where the valid sequence entries are and ones beyond the lengths of the sequences. Taking 1 - mask gives you exactly what you want.

Enjoy ;)

126

answered Sep 23 '22 20:09

Shai

Related questions
                            
                                Missing values in Pandas Pivot table?
                            
                                Optimizing suggestions for a piece of Julia and Python code
                            
                                Remove string element in a list of strings if the first characters match with another string element in the list
                            
                                DiGraph parallel ordering
                            
                                Drop rows in pandas if records in two columns do not appear together at least twice in the dataset
                            
                                Django Rest Framework Custom JWT authentication
                            
                                How to fetch a product from woocommerce api based on the sku?
                            
                                Pulling Zillow Rent Data from Zillow API
                            
                                How to convert a continuous variable to a categorical variable?
                            
                                Nexus pypi repository "Could not find a version that satisfies the requirement"
                            
                                Find an element where data-tb-test-id attribute is present instead of id using Selenium and Python
                            
                                How to properly use dask's upload_file() to pass local code to workers
                            
                                Matplotlib plot from Python script not showing up in output when run in Jupyter Notebook
                            
                                pandas int or float column to percentage distribution
                            
                                How to use pathlib.Path.expanduser() and amend and use a PosixPath value?
                            
                                How SelectKBest (chi2) calculates score?
                            
                                Pandas str.split without stripping split pattern
                            
                                tf.keras.layers.pop() doesn't work, but tf.keras._layers.pop() does
                            
                                Using Typing and Mypy with Descriptors
                            
                                Python comparison operator precedence

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With