I have a dataframe contains millions data. Suppose this is the dataframe named mydataframe
:
filename | #insert-1 | #insert-2 | #delete-1 | #delete-2
---------------------------------------------------------
A | 4 | 4 | 3 | 3
B | 3 | 5 | 2 | 2
C | 5 | 5 | 6 | 7
D | 2 | 2 | 3 | 3
E | 4 | 5 | 5 | 3
---------------------------------------------------------
I need to separate the file based on the different number of insert or delete, then save them into new CSV
file, named different.csv
. And also save the rest of the data having the same number of insert and delete in the separate CSV
file called same.csv
. In the other words, if the file has a different number between #insert-1
and #insert-2
, or #delete-1
and #delete-2
then save it in different.csv
, otherwise, save it in same.csv
.
The expected result:
different.csv
:
filename | #insert-1 | #insert-2 | #delete-1 | #delete-2
---------------------------------------------------------
B | 3 | 5 | 2 | 2
C | 5 | 5 | 6 | 7
E | 4 | 5 | 5 | 3
---------------------------------------------------------
same.csv
filename | #insert-1 | #insert-2 | #delete-1 | #delete-2
---------------------------------------------------------
A | 4 | 4 | 3 | 3
D | 2 | 2 | 3 | 3
---------------------------------------------------------
This is my code so far:
df_different = []
df_same = []
for row in range(0, len(mydataframe)):
ins_1 = mydataframe.iloc[row][1]
ins_2 = mydataframe.iloc[row][2]
del_1 = mydataframe.iloc[row][3]
del_2 = mydataframe.iloc[row][4]
if (ins_1 != ins_2) or (del_1 != del_2):
df_different.append(mydataframe.iloc[row])
else:
df_same.append(mydataframe.iloc[row])
with open('different.csv','w') as diffcsv:
writers = csv.writer(diffcsv, delimiter=',')
writers.writerow(fields)
for item in df_different:
writers.writerow(item)
with open('same.csv','w') as diffcsv:
writers = csv.writer(diffcsv, delimiter=',')
writers.writerow(fields)
for item in df_same:
writers.writerow(item)
Actually, the code works well but when the dataset is very large (I have millions of data), it takes very long time (more than 3 hours) to perform. My question is whether there is a method to make it faster. Thank you.
Avoid iterating over rows; that's pretty slow. Instead, vectorize the comparison operation:
same_mask = (df["#insert-1"] == df["#insert-2"]) & (df["#delete-1"] == df["#delete-2"])
df.loc[same_mask].to_csv("same.csv", index=False)
df.loc[~same_mask].to_csv("different.csv", index=False)
For a dataframe of 1M rows, this takes me only a few seconds.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With