I have made a dataframe using a function I created
data = generate_xml()
I then make a subset of the dataframe based on the column names which in this case are called WalmartIDS and ASINS. Below is also an example of what the dataframe looks like
walmartIDS = data.loc[:,['WalmartIDS','ASINS']]
>>
WalmartIDS ASINS
602 20511489 B077BS6737
603 10311487 B077BMHVG7
604 10311302 B077BRTYCS
605 152381151 B077YW9PTQ
606 The-Holiday-Aisle-Projection-Kaleidoscope-Spider-Airblown-Inflatable- Halloween-Decoration-THDA5581.html B076CNN6K5
607 51409868 B0756DMVSC
608 51410962 B0756FKLCV
609 51411020 B0756F3F6J
610 51411529 B0756FDM74
611 915505165 B076W25SDZ
612 400796633 B076VM75ZF
As you can see sometimes bad data will get into the WalmartIDS column. So I want to filter this out by deleting all of the rows in the newly created walmartIDS dataframe where the WalmartIDS column contains characters other than integers. I don't want to alter the data version of the data frame because it is the raw data.
walmartIDS[walmartIDS.WalmartIDS != '^[-+]?[0-9]+$']
However, the above solution doesn't seem to do anything, and I can in fact still see the bad data, (in the example this is row 606) where it should have been deleted.
What is the proper way to do this?
Test data:
data = StringIO("""
Walmart IDS ASINS
602 20511489 B077BS6737
603 10311487 B077BMHVG7
604 10311302 B077BRTYCS
605 152381151 B077YW9PTQ
606 The-Holiday-Aisle-Projection-Kaleidoscope-Spider-Airblown-Inflatable-Halloween-Decoration-THDA5581.html B076CNN6K5
607 51409868 B0756DMVSC
608 51410962 B0756FKLCV
609 51411020 B0756F3F6J
610 51411529 B0756FDM74
611 915505165 B076W25SDZ
612 400796633 B076VM75ZF
""")
Create df and make a copy:
df = pd.read_table(data, delim_whitespace=True)
df2 = df
Convert IDS to numeric and drop rows with na:
df2['IDS'] = pd.to_numeric(df2['IDS'], errors="coerce")
df2.dropna(how="any", inplace=True)
print(df2)
Walmart IDS ASINS
0 602 20511489.0 B077BS6737
1 603 10311487.0 B077BMHVG7
2 604 10311302.0 B077BRTYCS
3 605 152381151.0 B077YW9PTQ
5 607 51409868.0 B0756DMVSC
6 608 51410962.0 B0756FKLCV
7 609 51411020.0 B0756F3F6J
8 610 51411529.0 B0756FDM74
9 611 915505165.0 B076W25SDZ
10 612 400796633.0 B076VM75ZF
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With