I've hit a wall with a data analysis project I'm working on.
Essentially, if I have example CSV 'A':
id | item_num
A123 | 1
A123 | 2
B456 | 1
And I have example CSV 'B':
id | description
A123 | Mary had a...
A123 | ...little lamb.
B456 | ...Its fleece...
If I perform a merge
using Pandas
, it ends up like this:
id | item_num | description
A123 | 1 | Mary had a...
A123 | 2 | Mary had a...
A123 | 1 | ...little lamb.
A123 | 2 | ...little lamb.
B456 | 1 | Its fleece...
How could I instead make it become:
id | item_num | description
A123 | 1 | Mary had a...
A123 | 2 | ...little lamb...
B456 | 1 | Its fleece...
This is my code:
import pandas as pd
# Import CSVs
first = pd.read_csv("../PATH_TO_CSV/A.csv")
print("Imported first CSV: " + str(first.shape))
second = pd.read_csv("../PATH_TO_CSV/B.csv")
print("Imported second CSV: " + str(second.shape))
# Create a resultant, but empty, DF, and then append the merge.
result = pd.DataFrame()
result = result.append(pd.merge(first, second), ignore_index = True)
print("Merged CSVs... resulting DataFrame is: " + str(result.shape))
# Lets do a "dedupe" to deal with an issue on how Pandas handles datetime merges
# I read about an issue where if datetime is involved, duplicate entires will be created.
result = result.drop_duplicates()
print("Deduping... resulting DataFrame is: " + str(result.shape))
# Save to another CSV
result.to_csv("EXPORT.csv", index=False)
print("Saved to file.")
I would really appreciate any help - I'm very stuck! And I'm dealing with 20,000+ rows.
Thanks.
Edit: my post was marked as a potential duplicate. It's not, as I'm not necessarily trying to add a column - I'm just trying to prevent the description
to be multiplied by the number of item_num
that are attributed to a particular id
.
UPDATE, 6/21:
How could I do the merge, if the 2 DFs looked like this instead?
id | item_num | other_col
A123 | 1 | lorem ipsum
A123 | 2 | dolor sit
A123 | 3 | amet, consectetur
B456 | 1 | lorem ipsum
And I have example CSV 'B':
id | item_num | description
A123 | 1 | Mary had a...
A123 | 2 | ...little lamb.
B456 | 1 | ...Its fleece...
So I end up with:
id | item_num | other_col | description
A123 | 1 | lorem ipsum | Mary Had a...
A123 | 2 | dolor sit | ...little lamb.
B456 | 1 | lorem ipsum | ...Its fleece...
Meaning, the row that has the 3, with "amet, consectetur" in the "other_col" is ignored.
I'd do it this way:
In [135]: result = A.merge(B.assign(item_num=B.groupby('id').cumcount()+1))
In [136]: result
Out[136]:
id item_num description
0 A123 1 Mary had a...
1 A123 2 ...little lamb.
2 B456 1 ...Its fleece...
Explanation: we can create "virtual" item_num
column in the B
DF for joining:
In [137]: B.assign(item_num=B.groupby('id').cumcount()+1)
Out[137]:
id description item_num
0 A123 Mary had a... 1
1 A123 ...little lamb. 2
2 B456 ...Its fleece... 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With