I have such data.
data = [
['2019-01-01', 'a',0],
['2019-01-02', 'b',0],
['2019-01-03', 'c',0],
['2019-01-04', 'd',0],
['2019-01-05', 'a',0],
['2019-01-05', 'd',0],
['2019-01-06', 'd',0],
['2019-01-07', 'f',0],
['2019-01-08', 'c',0],
['2019-01-08', 'b',0],
['2019-01-08', 'g',0],
['2019-01-08', 'h',0],
['2019-01-09', 'q',0],
['2019-01-09', 'b',0],
['2019-01-09', 'y',0],
['2019-01-10', 'd',0],
['2019-01-11', 'z',0],
['2019-01-11', 'x',0],
['2019-01-11', 'c',0],
['2019-01-12', 'y',0],
['2019-01-13', 'x',0],
['2019-01-13', 'q',0],
['2019-01-14', 't',0],
['2019-01-15', 'i',0]]
df = pd.DataFrame(data, columns = ['Date', 'Column1','Column2'])
Date Column1 Column2
0 2019-01-01 a 0
1 2019-01-02 b 0
2 2019-01-03 c 0
3 2019-01-04 d 0
4 2019-01-05 a 0
5 2019-01-05 d 0
6 2019-01-06 d 0
7 2019-01-07 f 0
8 2019-01-08 c 0
9 2019-01-08 b 0
10 2019-01-08 g 0
11 2019-01-08 h 0
12 2019-01-09 q 0
13 2019-01-09 b 0
14 2019-01-09 y 0
15 2019-01-10 d 0
16 2019-01-11 z 0
17 2019-01-11 x 0
18 2019-01-11 c 0
19 2019-01-12 y 0
20 2019-01-13 x 0
21 2019-01-13 q 0
22 2019-01-14 t 0
23 2019-01-15 i 0
My goal is to look at each column1 element and make the value of column2 1 if this element exists in column1 before.
I wrote a code like this.
for i in range(0,len(df)):
for j in range(0,i-1):
if df.Column1[i] == df.Column1[j]:
df.Column2[i] = 1
And I got the result I wanted.
Date Column1 Column2
0 2019-01-01 a 0
1 2019-01-02 b 0
2 2019-01-03 c 0
3 2019-01-04 d 0
4 2019-01-05 a 1
5 2019-01-05 d 1
6 2019-01-06 d 1
7 2019-01-07 f 0
8 2019-01-08 c 1
9 2019-01-08 b 1
10 2019-01-08 g 0
11 2019-01-08 h 0
12 2019-01-09 q 0
13 2019-01-09 b 1
14 2019-01-09 y 0
15 2019-01-10 d 1
16 2019-01-11 z 0
17 2019-01-11 x 0
18 2019-01-11 c 1
19 2019-01-12 y 1
20 2019-01-13 x 1
21 2019-01-13 q 1
22 2019-01-14 t 0
23 2019-01-15 i 0
But when I run this code on 100000 rows of data, it runs very slowly.
Is there a way to do this in a shorter time or are there different solution suggestions for this problem?
Thanks for answers
Growing variables in a loop takes very long. Each time you increase the length of the variable, a million times here, you force MATLAB to first create a variable with the initial length+1, then copy the contents, then delete the old variable. That's probably what is taking your code so long.
A faster way to loop using built-in functions A faster way to loop in Python is using built-in functions. In our example, we could replace the for loop with the sum function. This function will sum the values inside the range of numbers. The code above takes 0.84 seconds.
Loops are slower in R than in C++ because R is an interpreted language (not compiled), even if now there is just-in-time (JIT) compilation in R (>= 3.4) that makes R loops faster (yet, still not as fast). Then, R loops are not that bad if you don't use too many iterations (let's say not more than 100,000 iterations).
You can do groupby and cumcount on column1 and then clip the upper limit to 1:
df['Column2'] = df.groupby("Column1").cumcount().clip(upper=1)
However, even more concise would be to check for series.duplicated
here:
df['Column2'] = df['Column1'].duplicated().astype(int)
print(df)
Date Column1 Column2
0 2019-01-01 a 0
1 2019-01-02 b 0
2 2019-01-03 c 0
3 2019-01-04 d 0
4 2019-01-05 a 1
5 2019-01-05 d 1
6 2019-01-06 d 1
7 2019-01-07 f 0
8 2019-01-08 c 1
9 2019-01-08 b 1
10 2019-01-08 g 0
11 2019-01-08 h 0
12 2019-01-09 q 0
13 2019-01-09 b 1
14 2019-01-09 y 0
15 2019-01-10 d 1
16 2019-01-11 z 0
17 2019-01-11 x 0
18 2019-01-11 c 1
19 2019-01-12 y 1
20 2019-01-13 x 1
21 2019-01-13 q 1
22 2019-01-14 t 0
23 2019-01-15 i 0
You can use .groupby
on Column1
with "cumcount" transform:
df["Column2"] = (
df.groupby("Column1", sort=False)["Column1"]
.transform("cumcount")
.gt(0)
.astype(int)
)
print(df)
Prints:
Date Column1 Column2
0 2019-01-01 a 0
1 2019-01-02 b 0
2 2019-01-03 c 0
3 2019-01-04 d 0
4 2019-01-05 a 1
5 2019-01-05 d 1
6 2019-01-06 d 1
7 2019-01-07 f 0
8 2019-01-08 c 1
9 2019-01-08 b 1
10 2019-01-08 g 0
11 2019-01-08 h 0
12 2019-01-09 q 0
13 2019-01-09 b 1
14 2019-01-09 y 0
15 2019-01-10 d 1
16 2019-01-11 z 0
17 2019-01-11 x 0
18 2019-01-11 c 1
19 2019-01-12 y 1
20 2019-01-13 x 1
21 2019-01-13 q 1
22 2019-01-14 t 0
23 2019-01-15 i 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With