In R, when adding new data of unequal length to a data frame, the values repeat to fill the data frame:
df <- data.frame(first=c(1,2,3,4,5,6))
df$second <- c(1,2,3)
yielding:
first second
1 1 1
2 2 2
3 3 3
4 4 1
5 5 2
6 6 3
However, pandas requires equal index lengths.
How do I "fill in" repeating data in pandas like I can in R?
The pandas. DataFrame. duplicated() method is used to find duplicate rows in a DataFrame. It returns a boolean series which identifies whether a row is duplicate or unique.
Pandas Series: repeat() function The repeat() function is used to repeat elements of a Series. Returns a new Series where each element of the current Series is repeated consecutively a given number of times. The number of repetitions for each element. This should be a non-negative integer.
In R, the easiest way to repeat rows is with the REP() function. This function selects one or more observations from a data frame and creates one or more copies of them. Alternatively, you can use the SLICE() function from the dplyr package to repeat rows.
Fill Data in an Empty Pandas DataFrame by Appending Rows First, create an empty DataFrame with column names and then append rows one by one. The append() method can also append rows. When creating an empty DataFrame with column names and row indices, we can fill data in rows using the loc() method.
The cycle method from itertools is good for repeating a common pattern.
from itertools import cycle
seq = cycle([1, 2, 3])
df['Seq'] = [next(seq) for count in range(df.shape[0])]
Seems there is no elegant way. This is the workaround I just figured out. Basically create a repeating list just bigger than original dataframe, and then left join them.
import pandas
df = pandas.DataFrame(range(100), columns=['first'])
repeat_arr = [1, 2, 3]
df = df.join(pandas.DataFrame(repeat_arr * (len(df)/len(repeat_arr)+1),
columns=['second']))
import pandas as pd
import numpy as np
def put(df, column, values):
df[column] = 0
np.put(df[column], np.arange(len(df)), values)
df = pd.DataFrame({'first':range(1, 8)})
put(df, 'second', [1,2,3])
yields
first second
0 1 1
1 2 2
2 3 3
3 4 1
4 5 2
5 6 3
6 7 1
Not particularly beautiful, but one "feature" it possesses is that you do not have to worry if the length of the DataFrame is a multiple of the length of the repeated values. np.put
repeats the values as necessary.
My first answer was:
import itertools as IT
df['second'] = list(IT.islice(IT.cycle([1,2,3]), len(df)))
but it turns out this is significantly slower:
In [312]: df = pd.DataFrame({'first':range(10**6)})
In [313]: %timeit df['second'] = list(IT.islice(IT.cycle([1,2,3]), len(df)))
10 loops, best of 3: 143 ms per loop
In [316]: %timeit df['second'] = 0; np.put(df['second'], np.arange(N), [1,2,3])
10 loops, best of 3: 27.9 ms per loop
How general of a solution are you looking for? I tried to make this a little less hard-coded:
import numpy as np
import pandas
df = pandas.DataFrame(np.arange(1,7), columns=['first'])
base = [1, 2, 3]
df['second'] = base * (df.shape[0]/len(base))
print(df.to_string())
first second
0 1 1
1 2 2
2 3 3
3 4 1
4 5 2
5 6 3
In my case I needed to repeat the values without knowing the length of the sub-list, i.e. checking the length of every group. This was my solution:
import numpy as np
import pandas
df = pandas.DataFrame(['a','a','a','b','b','b','b'], columns=['first'])
list = df.groupby('first').apply(lambda x: range(len(x))).tolist()
loop = [val for sublist in list for val in sublist]
df['second']=loop
df
first second
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 b 2
6 b 3
Probably inefficient, but here's sort of a pure pandas solution.
import numpy as np
import pandas as pd
base = [1,2,3]
df = pd.DataFrame(data = None,index = np.arange(10),columns = ["filler"])
df["filler"][:len(base)] = base
df["tmp"] = np.arange(len(df)) % len(base)
df["filler"] = df.sort_values("tmp")["filler"].ffill() #.sort_index()
print(df)
You might want to try using the power of modulo (%). You can take the value (or index) of first and use the length of second as the modulus to get the value (or index) you're looking for. Something like:
df = pandas.DataFrame([0,1,2,3,4,5], columns=['first'])
sec = [0,1,2]
df['second'] = df['first'].apply(lambda x: x % len(sec) )
print(df)
first second
0 0 0
1 1 1
2 2 2
3 3 0
4 4 1
5 5 2
I hope that helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With