In R, when adding new data of unequal length to a data frame, the values repeat to fill the data frame: <pre class="prettyprint"><code>df <- data.frame(first=c(1,2,3,4,5,6)) df$second <- c(1,2,3) </code></pre> yielding: <pre class="prettyprint"><code> first second 1 1 1 2 2 2 3 3 3 4 4 1 5 5 2 6 6 3 </code></pre> However, pandas requires equal index lengths. How do I "fill in" repeating data in pandas like I can in R?

The cycle method from itertools is good for repeating a common pattern. <pre class="prettyprint"><code>from itertools import cycle seq = cycle([1, 2, 3]) df['Seq'] = [next(seq) for count in range(df.shape[0])] </code></pre>

Seems there is no elegant way. This is the workaround I just figured out. Basically create a repeating list just bigger than original dataframe, and then left join them. <pre class="prettyprint"><code>import pandas df = pandas.DataFrame(range(100), columns=['first']) repeat_arr = [1, 2, 3] df = df.join(pandas.DataFrame(repeat_arr * (len(df)/len(repeat_arr)+1), columns=['second'])) </code></pre>

<pre class="prettyprint"><code>import pandas as pd import numpy as np def put(df, column, values): df[column] = 0 np.put(df[column], np.arange(len(df)), values) df = pd.DataFrame({'first':range(1, 8)}) put(df, 'second', [1,2,3]) </code></pre> yields <pre class="prettyprint"><code> first second 0 1 1 1 2 2 2 3 3 3 4 1 4 5 2 5 6 3 6 7 1 </code></pre> Not particularly beautiful, but one "feature" it possesses is that you do not have to worry if the length of the DataFrame is a multiple of the length of the repeated values. <code>np.put</code> repeats the values as necessary. <hr> My first answer was: <pre class="prettyprint"><code>import itertools as IT df['second'] = list(IT.islice(IT.cycle([1,2,3]), len(df))) </code></pre> but it turns out this is significantly slower: <pre class="prettyprint"><code>In [312]: df = pd.DataFrame({'first':range(10**6)}) In [313]: %timeit df['second'] = list(IT.islice(IT.cycle([1,2,3]), len(df))) 10 loops, best of 3: 143 ms per loop In [316]: %timeit df['second'] = 0; np.put(df['second'], np.arange(N), [1,2,3]) 10 loops, best of 3: 27.9 ms per loop </code></pre>

You might want to try using the power of modulo (%). You can take the value (or index) of first and use the length of second as the modulus to get the value (or index) you're looking for. Something like: <pre class="prettyprint"><code>df = pandas.DataFrame([0,1,2,3,4,5], columns=['first']) sec = [0,1,2] df['second'] = df['first'].apply(lambda x: x % len(sec) ) print(df) first second 0 0 0 1 1 1 2 2 2 3 3 0 4 4 1 5 5 2 </code></pre> I hope that helps.

How to fill in rows with repeating data in pandas?

Tags:

python

pandas

dataframe

In R, when adding new data of unequal length to a data frame, the values repeat to fill the data frame:

df <- data.frame(first=c(1,2,3,4,5,6))
df$second <- c(1,2,3)

yielding:

  first second
1     1      1
2     2      2
3     3      3
4     4      1
5     5      2
6     6      3

However, pandas requires equal index lengths.

How do I "fill in" repeating data in pandas like I can in R?

244

asked Jan 11 '14 22:01

Amyunimus

7 Answers

The cycle method from itertools is good for repeating a common pattern.

from itertools import cycle

seq = cycle([1, 2, 3])
df['Seq'] = [next(seq) for count in range(df.shape[0])]

164

answered Oct 02 '22 19:10

Meow

Seems there is no elegant way. This is the workaround I just figured out. Basically create a repeating list just bigger than original dataframe, and then left join them.

import pandas
df = pandas.DataFrame(range(100), columns=['first'])
repeat_arr = [1, 2, 3]
df = df.join(pandas.DataFrame(repeat_arr * (len(df)/len(repeat_arr)+1),
    columns=['second']))

answered Oct 02 '22 19:10

Yeqing Zhang

import pandas as pd
import numpy as np

def put(df, column, values):
    df[column] = 0
    np.put(df[column], np.arange(len(df)), values)

df = pd.DataFrame({'first':range(1, 8)})    
put(df, 'second', [1,2,3])

yields

   first  second
0      1       1
1      2       2
2      3       3
3      4       1
4      5       2
5      6       3
6      7       1

Not particularly beautiful, but one "feature" it possesses is that you do not have to worry if the length of the DataFrame is a multiple of the length of the repeated values. np.put repeats the values as necessary.

My first answer was:

import itertools as IT
df['second'] = list(IT.islice(IT.cycle([1,2,3]), len(df)))

but it turns out this is significantly slower:

In [312]: df = pd.DataFrame({'first':range(10**6)})

In [313]: %timeit df['second'] = list(IT.islice(IT.cycle([1,2,3]), len(df)))
10 loops, best of 3: 143 ms per loop

In [316]: %timeit df['second'] = 0; np.put(df['second'], np.arange(N), [1,2,3])
10 loops, best of 3: 27.9 ms per loop

answered Oct 02 '22 20:10

unutbu

How general of a solution are you looking for? I tried to make this a little less hard-coded:

import numpy as np
import pandas 

df = pandas.DataFrame(np.arange(1,7), columns=['first'])

base = [1, 2, 3]
df['second'] = base * (df.shape[0]/len(base))
print(df.to_string())


   first  second
0      1       1
1      2       2
2      3       3
3      4       1
4      5       2
5      6       3

answered Oct 02 '22 18:10

Paul H

In my case I needed to repeat the values without knowing the length of the sub-list, i.e. checking the length of every group. This was my solution:

import numpy as np
import pandas 

df = pandas.DataFrame(['a','a','a','b','b','b','b'], columns=['first'])

list = df.groupby('first').apply(lambda x: range(len(x))).tolist()
loop = [val for sublist in list for val in sublist]
df['second']=loop

df
  first  second
0     a       0
1     a       1
2     a       2
3     b       0
4     b       1
5     b       2
6     b       3

answered Oct 02 '22 18:10

Daniele

Probably inefficient, but here's sort of a pure pandas solution.

import numpy as np
import pandas as pd

base = [1,2,3]
df = pd.DataFrame(data = None,index = np.arange(10),columns = ["filler"])
df["filler"][:len(base)] = base

df["tmp"] = np.arange(len(df)) % len(base)
df["filler"] = df.sort_values("tmp")["filler"].ffill() #.sort_index()
print(df)

answered Oct 02 '22 20:10

SBM

You might want to try using the power of modulo (%). You can take the value (or index) of first and use the length of second as the modulus to get the value (or index) you're looking for. Something like:

df = pandas.DataFrame([0,1,2,3,4,5], columns=['first'])
sec = [0,1,2]
df['second'] = df['first'].apply(lambda x: x % len(sec) )
print(df)
   first  second
0      0       0
1      1       1
2      2       2
3      3       0
4      4       1
5      5       2

I hope that helps.

answered Oct 02 '22 18:10

JDenman6

Related questions
                            
                                Longest Prefix Matches for URLs
                            
                                Is there a Python equivalent to Ruby's respond_to?
                            
                                can't create django project using Windows command prompt
                            
                                IOError: [Errno 13] file not accessible with Google AppEngine 1.6.1
                            
                                Can an object inspect the name of the variable it's been assigned to?
                            
                                python pool apply_async and map_async do not block on full queue
                            
                                Django ModelChoiceField - use something other than id?
                            
                                Count number of lines in a txt file with Python excluding blank lines
                            
                                Elegant Format for a MAC Address in Python 3.2
                            
                                how to round_corner a logo without white background(transparent?) on it using pil?
                            
                                Where does os.remove go?
                            
                                What is a more efficient way to pass variables from Template to View in Django?
                            
                                Django queryset and generator
                            
                                how to call / run multiple python scripts from batch file in window xp / 7
                            
                                How to pass parameters to a build in Sublime Text 3?
                            
                                Unable to save DataFrame to HDF5 ("object header message is too large")
                            
                                Python dictreader - How to make CSV column names lowercase?
                            
                                Read previous line in a file python
                            
                                Animation with pcolormesh routine in matplotlib, how do I initialize the data?
                            
                                what do _ and __ mean in PYTHON

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to fill in rows with repeating data in pandas?

Tags:

python

pandas

dataframe

Amyunimus

People also ask

7 Answers

Meow

Yeqing Zhang

unutbu

Paul H

Daniele

SBM

JDenman6

Recent Activity

Donate For Us