Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Integer Filtering

I have made a dataframe using a function I created

data = generate_xml()

I then make a subset of the dataframe based on the column names which in this case are called WalmartIDS and ASINS. Below is also an example of what the dataframe looks like

walmartIDS = data.loc[:,['WalmartIDS','ASINS']]

>>
    WalmartIDS  ASINS
602 20511489    B077BS6737
603 10311487    B077BMHVG7
604 10311302    B077BRTYCS
605 152381151   B077YW9PTQ
606 The-Holiday-Aisle-Projection-Kaleidoscope-Spider-Airblown-Inflatable-            Halloween-Decoration-THDA5581.html B076CNN6K5
607 51409868    B0756DMVSC
608 51410962    B0756FKLCV
609 51411020    B0756F3F6J
610 51411529    B0756FDM74
611 915505165   B076W25SDZ
612 400796633   B076VM75ZF

As you can see sometimes bad data will get into the WalmartIDS column. So I want to filter this out by deleting all of the rows in the newly created walmartIDS dataframe where the WalmartIDS column contains characters other than integers. I don't want to alter the data version of the data frame because it is the raw data.

walmartIDS[walmartIDS.WalmartIDS != '^[-+]?[0-9]+$']

However, the above solution doesn't seem to do anything, and I can in fact still see the bad data, (in the example this is row 606) where it should have been deleted.

What is the proper way to do this?

like image 620
Makenna Sophia Herl Avatar asked Jan 01 '23 23:01

Makenna Sophia Herl


1 Answers

Make a copy, convert to numeric, drop na:

Test data:

data = StringIO("""

Walmart  IDS         ASINS
602 20511489    B077BS6737
603 10311487    B077BMHVG7
604 10311302    B077BRTYCS
605 152381151   B077YW9PTQ
606 The-Holiday-Aisle-Projection-Kaleidoscope-Spider-Airblown-Inflatable-Halloween-Decoration-THDA5581.html   B076CNN6K5
607 51409868    B0756DMVSC
608 51410962    B0756FKLCV
609 51411020    B0756F3F6J
610 51411529    B0756FDM74
611 915505165   B076W25SDZ
612 400796633   B076VM75ZF

""")

Create df and make a copy:

df = pd.read_table(data, delim_whitespace=True)

df2 = df

Convert IDS to numeric and drop rows with na:

df2['IDS'] = pd.to_numeric(df2['IDS'], errors="coerce")

df2.dropna(how="any", inplace=True)

print(df2)

   Walmart          IDS       ASINS
0       602   20511489.0  B077BS6737
1       603   10311487.0  B077BMHVG7
2       604   10311302.0  B077BRTYCS
3       605  152381151.0  B077YW9PTQ
5       607   51409868.0  B0756DMVSC
6       608   51410962.0  B0756FKLCV
7       609   51411020.0  B0756F3F6J
8       610   51411529.0  B0756FDM74
9       611  915505165.0  B076W25SDZ
10      612  400796633.0  B076VM75ZF
like image 191
Dodge Avatar answered Jan 04 '23 12:01

Dodge