Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: dataframes won't merge

I have two data frames (that can be found here and here) below:

df= pd.read_csv('Thesis/ExternalData/naics_conversion_data/SIC2CRPCats.csv', \
                engine='python', sep=r'\s{2,}', encoding='utf-8_sig')

I have only provided the code for reading in df because it has some unique formatting issues.

df.dtypes

SICcode     object
Catcode     object
Category    object
SICname     object
MultSIC     object
dtype: object

merged.dtypes

2012 NAICS Code     float64
2002to2007 NAICS    float64
SICcode              object
dtype: object

df.columns.tolist()
['SICcode', 'Catcode', 'Category', 'SICname', 'MultSIC']

merged.columns.tolist()
['2012 NAICS Code', '2002to2007 NAICS', 'SICcode']

df.head(3)

    SICcode     Catcode     Category                          SICname   MultSIC
0   111         A1500   Wheat, corn, soybeans and cash grain    Wheat   X
1   112         A1600   Other commodities (incl rice, peanuts)  Rice    X
2   115         A1500   Wheat, corn, soybeans and cash grain    Corn    X

merged.sort_values('SICcode')

    2012 NAICS Code     2002to2007 NAICS    SICcode
89  212210                       212210     1011
93  212234                       212234     1021
92  212231                       212231     1031
90  212221                       212221     1041
91  212222                       212222     1044
96  212299                       212299     1061
94  212234                       212234     1061
119 213114                       213114     1081
1770    541360                   541360     1081
233     238910                   238910     1081
95  212291                       212291     1094
97  212299                       212299     1099
3   111140                       111140     111
6   111160                       111160     112
4   111150                       111150     115
0   111110                       111110     116

I am trying to merge them together with this code: merged=pd.merge(merged,df, how='right', on='SICcode')

Which results in this:

2012 NAICS Code        0
2002to2007 NAICS       0
SICcode             1007
Catcode              991
Category            1007
SICname             1007
MultSIC              906
dtype: int64

I suspect that the issues lies with the formatting of df but I do not know how to describe (I have heard the term white space, maybe that is relevant for this scenario) or remedy the issue. Does anyone have an idea on this?

like image 426
Collective Action Avatar asked Feb 24 '26 10:02

Collective Action


1 Answers

I believe this is the cause of your problem:

In [47]: merged[merged.SICcode == 'Aux']
Out[47]:
      2012 NAICS Code  2002to2007 NAICS SICcode
1828         551114.0          551114.0     Aux

resulting in different data types:

In [61]: df.dtypes
Out[61]:
SICcode      int64
Catcode     object
Category    object
SICname     object
MultSIC     object
dtype: object

In [62]: merged.dtypes
Out[62]:
2012 NAICS Code     float64
2002to2007 NAICS    float64
SICcode              object
dtype: object

In [63]: df.SICcode.unique()
Out[63]: array([ 111,  112,  115, ..., 9711, 9721, 9999], dtype=int64)

In [64]: merged.SICcode.head(10).unique()
Out[64]: array(['116', '119', '111', '115', '112', '139'], dtype=object)

So you can do it this way:

url = 'https://raw.githubusercontent.com/108michael/ms_thesis/master/SIC2CRPCats.csv'
df = pd.read_csv(url, engine='python', sep=r'\s{2,}', encoding='utf-8_sig')

url='https://raw.githubusercontent.com/108michael/ms_thesis/master/test.merge'
merged = pd.read_csv(url, index_col=0)

# clearing data
merged.SICcode = pd.to_numeric(merged.SICcode, errors='coerce')

mrg = df.merge(merged, on='SICcode', how='left')

mrg.head()

Output:

In [51]: mrg.head()
Out[51]:
   SICcode Catcode                                       Category  \
0      111   A1500           Wheat, corn, soybeans and cash grain
1      112   A1600  Other commodities (incl rice, peanuts, honey)
2      115   A1500           Wheat, corn, soybeans and cash grain
3      116   A1500           Wheat, corn, soybeans and cash grain
4      119   A1500           Wheat, corn, soybeans and cash grain

            SICname MultSIC  2012 NAICS Code  2002to2007 NAICS
0             Wheat       X         111140.0          111140.0
1              Rice       X         111160.0          111160.0
2              Corn       X         111150.0          111150.0
3          Soybeans       X         111110.0          111110.0
4  Cash grains, NEC       X         111120.0          111120.0
like image 179
MaxU - stop WAR against UA Avatar answered Feb 25 '26 22:02

MaxU - stop WAR against UA



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!