Is there a faster way to separate duplicate and different data from CSV using python?

Question

I have a dataframe contains millions data. Suppose this is the dataframe named mydataframe:

filename | #insert-1 | #insert-2 | #delete-1 | #delete-2
---------------------------------------------------------
A        |         4 |         4 |         3 |         3
B        |         3 |         5 |         2 |         2
C        |         5 |         5 |         6 |         7
D        |         2 |         2 |         3 |         3
E        |         4 |         5 |         5 |         3
---------------------------------------------------------

I need to separate the file based on the different number of insert or delete, then save them into new CSV file, named different.csv. And also save the rest of the data having the same number of insert and delete in the separate CSV file called same.csv. In the other words, if the file has a different number between #insert-1 and #insert-2, or #delete-1 and #delete-2 then save it in different.csv, otherwise, save it in same.csv.

The expected result: different.csv:

filename | #insert-1 | #insert-2 | #delete-1 | #delete-2
---------------------------------------------------------
B        |         3 |         5 |         2 |         2
C        |         5 |         5 |         6 |         7
E        |         4 |         5 |         5 |         3
---------------------------------------------------------

same.csv

filename | #insert-1 | #insert-2 | #delete-1 | #delete-2
---------------------------------------------------------
A        |         4 |         4 |         3 |         3
D        |         2 |         2 |         3 |         3
---------------------------------------------------------

This is my code so far:

df_different = []
df_same = []
for row in range(0, len(mydataframe)):
    ins_1 = mydataframe.iloc[row][1]
    ins_2 = mydataframe.iloc[row][2]
    del_1 = mydataframe.iloc[row][3]
    del_2 = mydataframe.iloc[row][4]
    if (ins_1 != ins_2) or (del_1 != del_2):
        df_different.append(mydataframe.iloc[row])
    else:
        df_same.append(mydataframe.iloc[row])

with open('different.csv','w') as diffcsv:
    writers = csv.writer(diffcsv, delimiter=',')
    writers.writerow(fields)
    for item in df_different:
        writers.writerow(item)

with open('same.csv','w') as diffcsv:
    writers = csv.writer(diffcsv, delimiter=',')
    writers.writerow(fields)
    for item in df_same:
        writers.writerow(item)

Actually, the code works well but when the dataset is very large (I have millions of data), it takes very long time (more than 3 hours) to perform. My question is whether there is a method to make it faster. Thank you.

DSM · Accepted Answer

Avoid iterating over rows; that's pretty slow. Instead, vectorize the comparison operation:

same_mask = (df["#insert-1"] == df["#insert-2"]) & (df["#delete-1"] == df["#delete-2"])
df.loc[same_mask].to_csv("same.csv", index=False)
df.loc[~same_mask].to_csv("different.csv", index=False)

For a dataframe of 1M rows, this takes me only a few seconds.

Is there a faster way to separate duplicate and different data from CSV using python?

Tags:

python

pandas

dataframe

duplicates

csv

YusufUMS

1 Answers

DSM

Recent Activity

Donate For Us

Is there a faster way to separate duplicate and different data from CSV using python?

Tags:

python

pandas

dataframe

duplicates

csv

YusufUMS

1 Answers

DSM

Related questions

Recent Activity

Donate For Us