Why is the code running so slow that I use for loop in it. Is there a faster way?

Tags:

I have such data.

data = [
        ['2019-01-01', 'a',0],
        ['2019-01-02', 'b',0],
        ['2019-01-03', 'c',0],
        ['2019-01-04', 'd',0],
        ['2019-01-05', 'a',0],
        ['2019-01-05', 'd',0],
        ['2019-01-06', 'd',0],
        ['2019-01-07', 'f',0],
        ['2019-01-08', 'c',0],
        ['2019-01-08', 'b',0],
        ['2019-01-08', 'g',0],
        ['2019-01-08', 'h',0],
        ['2019-01-09', 'q',0],
        ['2019-01-09', 'b',0],
        ['2019-01-09', 'y',0],
        ['2019-01-10', 'd',0],
        ['2019-01-11', 'z',0],
        ['2019-01-11', 'x',0],
        ['2019-01-11', 'c',0],
        ['2019-01-12', 'y',0],
        ['2019-01-13', 'x',0],
        ['2019-01-13', 'q',0],
        ['2019-01-14', 't',0],
        ['2019-01-15', 'i',0]]
  
df = pd.DataFrame(data, columns = ['Date', 'Column1','Column2'])

    Date    Column1 Column2
0   2019-01-01  a   0
1   2019-01-02  b   0
2   2019-01-03  c   0
3   2019-01-04  d   0
4   2019-01-05  a   0
5   2019-01-05  d   0
6   2019-01-06  d   0
7   2019-01-07  f   0
8   2019-01-08  c   0
9   2019-01-08  b   0
10  2019-01-08  g   0
11  2019-01-08  h   0
12  2019-01-09  q   0
13  2019-01-09  b   0
14  2019-01-09  y   0
15  2019-01-10  d   0
16  2019-01-11  z   0
17  2019-01-11  x   0
18  2019-01-11  c   0
19  2019-01-12  y   0
20  2019-01-13  x   0
21  2019-01-13  q   0
22  2019-01-14  t   0
23  2019-01-15  i   0

My goal is to look at each column1 element and make the value of column2 1 if this element exists in column1 before.

I wrote a code like this.

for i in range(0,len(df)):
    for j in range(0,i-1):
        if df.Column1[i] == df.Column1[j]:
            df.Column2[i] = 1

And I got the result I wanted.


Date    Column1 Column2
0   2019-01-01  a   0
1   2019-01-02  b   0
2   2019-01-03  c   0
3   2019-01-04  d   0
4   2019-01-05  a   1
5   2019-01-05  d   1
6   2019-01-06  d   1
7   2019-01-07  f   0
8   2019-01-08  c   1
9   2019-01-08  b   1
10  2019-01-08  g   0
11  2019-01-08  h   0
12  2019-01-09  q   0
13  2019-01-09  b   1
14  2019-01-09  y   0
15  2019-01-10  d   1
16  2019-01-11  z   0
17  2019-01-11  x   0
18  2019-01-11  c   1
19  2019-01-12  y   1
20  2019-01-13  x   1
21  2019-01-13  q   1
22  2019-01-14  t   0
23  2019-01-15  i   0

But when I run this code on 100000 rows of data, it runs very slowly.

Is there a way to do this in a shorter time or are there different solution suggestions for this problem?

Thanks for answers

805

asked May 20 '21 16:05

SamuelBourgeois

2 Answers

You can do groupby and cumcount on column1 and then clip the upper limit to 1:

df['Column2'] = df.groupby("Column1").cumcount().clip(upper=1)

However, even more concise would be to check for series.duplicated here:

df['Column2'] = df['Column1'].duplicated().astype(int)

print(df)

          Date Column1  Column2
0   2019-01-01       a        0
1   2019-01-02       b        0
2   2019-01-03       c        0
3   2019-01-04       d        0
4   2019-01-05       a        1
5   2019-01-05       d        1
6   2019-01-06       d        1
7   2019-01-07       f        0
8   2019-01-08       c        1
9   2019-01-08       b        1
10  2019-01-08       g        0
11  2019-01-08       h        0
12  2019-01-09       q        0
13  2019-01-09       b        1
14  2019-01-09       y        0
15  2019-01-10       d        1
16  2019-01-11       z        0
17  2019-01-11       x        0
18  2019-01-11       c        1
19  2019-01-12       y        1
20  2019-01-13       x        1
21  2019-01-13       q        1
22  2019-01-14       t        0
23  2019-01-15       i        0

196

answered Sep 22 '22 17:09

anky

You can use .groupby on Column1 with "cumcount" transform:

df["Column2"] = (
    df.groupby("Column1", sort=False)["Column1"]
    .transform("cumcount")
    .gt(0)
    .astype(int)
)
print(df)

Prints:

          Date Column1  Column2
0   2019-01-01       a        0
1   2019-01-02       b        0
2   2019-01-03       c        0
3   2019-01-04       d        0
4   2019-01-05       a        1
5   2019-01-05       d        1
6   2019-01-06       d        1
7   2019-01-07       f        0
8   2019-01-08       c        1
9   2019-01-08       b        1
10  2019-01-08       g        0
11  2019-01-08       h        0
12  2019-01-09       q        0
13  2019-01-09       b        1
14  2019-01-09       y        0
15  2019-01-10       d        1
16  2019-01-11       z        0
17  2019-01-11       x        0
18  2019-01-11       c        1
19  2019-01-12       y        1
20  2019-01-13       x        1
21  2019-01-13       q        1
22  2019-01-14       t        0
23  2019-01-15       i        0

answered Sep 23 '22 17:09

Andrej Kesely

Related questions
                            
                                How to deal with large dependencies in AWS Lambda?
                            
                                Ceating dynamodb table says "invalid One or more parameter values were invalid: Some index key attributes are not defined in AttributeDefinitions"
                            
                                How to load a keras model saved as .pb
                            
                                Create a BigQuery table from pandas dataframe, WITHOUT specifying schema explicitly
                            
                                How can I list all defined URL paths in FastAPI?
                            
                                Identify all overlapping tuples in list
                            
                                get list of occurrences using pandas
                            
                                Numpy "Fortran"-like reshape?
                            
                                How do I rotate a PyTorch image tensor around it's center in a way that supports autograd?
                            
                                ImportError: cannot import name 'animation' (matplotlib + python 3.8.5)
                            
                                exe file made with pyinstaller being reported as a virus threat by windows defender
                            
                                Python asyncio: Future vs Task
                            
                                How to plot a separator line between two data classes?
                            
                                Code that randomly capitalizes each letter of a string (Code cleaning help) [closed]
                            
                                (Discord.py) How to make bot delete his own message after some time?
                            
                                Pandas: How to set hour of a datetime from another column?
                            
                                How to read HDF5 attributes (metadata) with Python and h5py
                            
                                Pandas DataFrame replace negative values with latest preceding positive value
                            
                                regex lookahead AND look behind
                            
                                How to use azureml.core.runconfig.DockerConfiguration class in azureml.core.Environment or azureml.core.ScriptRunConfig class

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is the code running so slow that I use for loop in it. Is there a faster way?

Tags:

performance

python

for-loop

pandas

dataframe

SamuelBourgeois

People also ask

2 Answers

anky

Andrej Kesely

Recent Activity

Donate For Us