Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to fill in rows with repeating data in pandas?

In R, when adding new data of unequal length to a data frame, the values repeat to fill the data frame:

df <- data.frame(first=c(1,2,3,4,5,6))
df$second <- c(1,2,3)

yielding:

  first second
1     1      1
2     2      2
3     3      3
4     4      1
5     5      2
6     6      3

However, pandas requires equal index lengths.

How do I "fill in" repeating data in pandas like I can in R?

like image 244
Amyunimus Avatar asked Jan 11 '14 22:01

Amyunimus


People also ask

How do I get repeated rows in pandas?

The pandas. DataFrame. duplicated() method is used to find duplicate rows in a DataFrame. It returns a boolean series which identifies whether a row is duplicate or unique.

How do you repeat values in a pandas series?

Pandas Series: repeat() function The repeat() function is used to repeat elements of a Series. Returns a new Series where each element of the current Series is repeated consecutively a given number of times. The number of repetitions for each element. This should be a non-negative integer.

How do you repeat a row in a DataFrame?

In R, the easiest way to repeat rows is with the REP() function. This function selects one or more observations from a data frame and creates one or more copies of them. Alternatively, you can use the SLICE() function from the dplyr package to repeat rows.

How do I fill a row in pandas?

Fill Data in an Empty Pandas DataFrame by Appending Rows First, create an empty DataFrame with column names and then append rows one by one. The append() method can also append rows. When creating an empty DataFrame with column names and row indices, we can fill data in rows using the loc() method.


7 Answers

The cycle method from itertools is good for repeating a common pattern.

from itertools import cycle

seq = cycle([1, 2, 3])
df['Seq'] = [next(seq) for count in range(df.shape[0])]
like image 164
Meow Avatar answered Oct 02 '22 19:10

Meow


Seems there is no elegant way. This is the workaround I just figured out. Basically create a repeating list just bigger than original dataframe, and then left join them.

import pandas
df = pandas.DataFrame(range(100), columns=['first'])
repeat_arr = [1, 2, 3]
df = df.join(pandas.DataFrame(repeat_arr * (len(df)/len(repeat_arr)+1),
    columns=['second']))
like image 25
Yeqing Zhang Avatar answered Oct 02 '22 19:10

Yeqing Zhang


import pandas as pd
import numpy as np

def put(df, column, values):
    df[column] = 0
    np.put(df[column], np.arange(len(df)), values)

df = pd.DataFrame({'first':range(1, 8)})    
put(df, 'second', [1,2,3])

yields

   first  second
0      1       1
1      2       2
2      3       3
3      4       1
4      5       2
5      6       3
6      7       1

Not particularly beautiful, but one "feature" it possesses is that you do not have to worry if the length of the DataFrame is a multiple of the length of the repeated values. np.put repeats the values as necessary.


My first answer was:

import itertools as IT
df['second'] = list(IT.islice(IT.cycle([1,2,3]), len(df)))

but it turns out this is significantly slower:

In [312]: df = pd.DataFrame({'first':range(10**6)})

In [313]: %timeit df['second'] = list(IT.islice(IT.cycle([1,2,3]), len(df)))
10 loops, best of 3: 143 ms per loop

In [316]: %timeit df['second'] = 0; np.put(df['second'], np.arange(N), [1,2,3])
10 loops, best of 3: 27.9 ms per loop
like image 38
unutbu Avatar answered Oct 02 '22 20:10

unutbu


How general of a solution are you looking for? I tried to make this a little less hard-coded:

import numpy as np
import pandas 

df = pandas.DataFrame(np.arange(1,7), columns=['first'])

base = [1, 2, 3]
df['second'] = base * (df.shape[0]/len(base))
print(df.to_string())


   first  second
0      1       1
1      2       2
2      3       3
3      4       1
4      5       2
5      6       3
like image 38
Paul H Avatar answered Oct 02 '22 18:10

Paul H


In my case I needed to repeat the values without knowing the length of the sub-list, i.e. checking the length of every group. This was my solution:

import numpy as np
import pandas 

df = pandas.DataFrame(['a','a','a','b','b','b','b'], columns=['first'])

list = df.groupby('first').apply(lambda x: range(len(x))).tolist()
loop = [val for sublist in list for val in sublist]
df['second']=loop

df
  first  second
0     a       0
1     a       1
2     a       2
3     b       0
4     b       1
5     b       2
6     b       3
like image 29
Daniele Avatar answered Oct 02 '22 18:10

Daniele


Probably inefficient, but here's sort of a pure pandas solution.

import numpy as np
import pandas as pd

base = [1,2,3]
df = pd.DataFrame(data = None,index = np.arange(10),columns = ["filler"])
df["filler"][:len(base)] = base

df["tmp"] = np.arange(len(df)) % len(base)
df["filler"] = df.sort_values("tmp")["filler"].ffill() #.sort_index()
print(df)
like image 25
SBM Avatar answered Oct 02 '22 20:10

SBM


You might want to try using the power of modulo (%). You can take the value (or index) of first and use the length of second as the modulus to get the value (or index) you're looking for. Something like:

df = pandas.DataFrame([0,1,2,3,4,5], columns=['first'])
sec = [0,1,2]
df['second'] = df['first'].apply(lambda x: x % len(sec) )
print(df)
   first  second
0      0       0
1      1       1
2      2       2
3      3       0
4      4       1
5      5       2

I hope that helps.

like image 26
JDenman6 Avatar answered Oct 02 '22 18:10

JDenman6