Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is the code running so slow that I use for loop in it. Is there a faster way?

I have such data.

data = [
        ['2019-01-01', 'a',0],
        ['2019-01-02', 'b',0],
        ['2019-01-03', 'c',0],
        ['2019-01-04', 'd',0],
        ['2019-01-05', 'a',0],
        ['2019-01-05', 'd',0],
        ['2019-01-06', 'd',0],
        ['2019-01-07', 'f',0],
        ['2019-01-08', 'c',0],
        ['2019-01-08', 'b',0],
        ['2019-01-08', 'g',0],
        ['2019-01-08', 'h',0],
        ['2019-01-09', 'q',0],
        ['2019-01-09', 'b',0],
        ['2019-01-09', 'y',0],
        ['2019-01-10', 'd',0],
        ['2019-01-11', 'z',0],
        ['2019-01-11', 'x',0],
        ['2019-01-11', 'c',0],
        ['2019-01-12', 'y',0],
        ['2019-01-13', 'x',0],
        ['2019-01-13', 'q',0],
        ['2019-01-14', 't',0],
        ['2019-01-15', 'i',0]]
  
df = pd.DataFrame(data, columns = ['Date', 'Column1','Column2'])
    Date    Column1 Column2
0   2019-01-01  a   0
1   2019-01-02  b   0
2   2019-01-03  c   0
3   2019-01-04  d   0
4   2019-01-05  a   0
5   2019-01-05  d   0
6   2019-01-06  d   0
7   2019-01-07  f   0
8   2019-01-08  c   0
9   2019-01-08  b   0
10  2019-01-08  g   0
11  2019-01-08  h   0
12  2019-01-09  q   0
13  2019-01-09  b   0
14  2019-01-09  y   0
15  2019-01-10  d   0
16  2019-01-11  z   0
17  2019-01-11  x   0
18  2019-01-11  c   0
19  2019-01-12  y   0
20  2019-01-13  x   0
21  2019-01-13  q   0
22  2019-01-14  t   0
23  2019-01-15  i   0

My goal is to look at each column1 element and make the value of column2 1 if this element exists in column1 before.

I wrote a code like this.

for i in range(0,len(df)):
    for j in range(0,i-1):
        if df.Column1[i] == df.Column1[j]:
            df.Column2[i] = 1  

And I got the result I wanted.


Date    Column1 Column2
0   2019-01-01  a   0
1   2019-01-02  b   0
2   2019-01-03  c   0
3   2019-01-04  d   0
4   2019-01-05  a   1
5   2019-01-05  d   1
6   2019-01-06  d   1
7   2019-01-07  f   0
8   2019-01-08  c   1
9   2019-01-08  b   1
10  2019-01-08  g   0
11  2019-01-08  h   0
12  2019-01-09  q   0
13  2019-01-09  b   1
14  2019-01-09  y   0
15  2019-01-10  d   1
16  2019-01-11  z   0
17  2019-01-11  x   0
18  2019-01-11  c   1
19  2019-01-12  y   1
20  2019-01-13  x   1
21  2019-01-13  q   1
22  2019-01-14  t   0
23  2019-01-15  i   0

But when I run this code on 100000 rows of data, it runs very slowly.

Is there a way to do this in a shorter time or are there different solution suggestions for this problem?

Thanks for answers

like image 805
SamuelBourgeois Avatar asked May 20 '21 16:05

SamuelBourgeois


People also ask

Why is my for loop taking so long?

Growing variables in a loop takes very long. Each time you increase the length of the variable, a million times here, you force MATLAB to first create a variable with the initial length+1, then copy the contents, then delete the old variable. That's probably what is taking your code so long.

How do I make my loops faster?

A faster way to loop using built-in functions A faster way to loop in Python is using built-in functions. In our example, we could replace the for loop with the sum function. This function will sum the values inside the range of numbers. The code above takes 0.84 seconds.

WHY DO FOR loops take so long in R?

Loops are slower in R than in C++ because R is an interpreted language (not compiled), even if now there is just-in-time (JIT) compilation in R (>= 3.4) that makes R loops faster (yet, still not as fast). Then, R loops are not that bad if you don't use too many iterations (let's say not more than 100,000 iterations).


2 Answers

You can do groupby and cumcount on column1 and then clip the upper limit to 1:

df['Column2'] = df.groupby("Column1").cumcount().clip(upper=1)

However, even more concise would be to check for series.duplicated here:

df['Column2'] = df['Column1'].duplicated().astype(int)

print(df)

          Date Column1  Column2
0   2019-01-01       a        0
1   2019-01-02       b        0
2   2019-01-03       c        0
3   2019-01-04       d        0
4   2019-01-05       a        1
5   2019-01-05       d        1
6   2019-01-06       d        1
7   2019-01-07       f        0
8   2019-01-08       c        1
9   2019-01-08       b        1
10  2019-01-08       g        0
11  2019-01-08       h        0
12  2019-01-09       q        0
13  2019-01-09       b        1
14  2019-01-09       y        0
15  2019-01-10       d        1
16  2019-01-11       z        0
17  2019-01-11       x        0
18  2019-01-11       c        1
19  2019-01-12       y        1
20  2019-01-13       x        1
21  2019-01-13       q        1
22  2019-01-14       t        0
23  2019-01-15       i        0
like image 196
anky Avatar answered Sep 22 '22 17:09

anky


You can use .groupby on Column1 with "cumcount" transform:

df["Column2"] = (
    df.groupby("Column1", sort=False)["Column1"]
    .transform("cumcount")
    .gt(0)
    .astype(int)
)
print(df)

Prints:

          Date Column1  Column2
0   2019-01-01       a        0
1   2019-01-02       b        0
2   2019-01-03       c        0
3   2019-01-04       d        0
4   2019-01-05       a        1
5   2019-01-05       d        1
6   2019-01-06       d        1
7   2019-01-07       f        0
8   2019-01-08       c        1
9   2019-01-08       b        1
10  2019-01-08       g        0
11  2019-01-08       h        0
12  2019-01-09       q        0
13  2019-01-09       b        1
14  2019-01-09       y        0
15  2019-01-10       d        1
16  2019-01-11       z        0
17  2019-01-11       x        0
18  2019-01-11       c        1
19  2019-01-12       y        1
20  2019-01-13       x        1
21  2019-01-13       q        1
22  2019-01-14       t        0
23  2019-01-15       i        0
like image 26
Andrej Kesely Avatar answered Sep 23 '22 17:09

Andrej Kesely