Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dataframe Merge in Pandas

Tags:

python

pandas

For some reason, I cannot get this merge to work correctly.

This Dataframe (rspars) has 2,000+ rows...

    rsparid  f1mult  f2mult  f3mult
 0        1   0.318   0.636   0.810
 1        2   0.348   0.703   0.893
 2        3   0.384   0.777   0.000
 3        4   0.296   0.590   0.911
 4        5   0.231   0.458   0.690
 5        6   0.275   0.546   0.839
 6        7   0.248   0.486   0.731
 7        8   0.430   0.873   0.000
 8        9   0.221   0.438   0.655
 9       11   0.204   0.399   0.593

When trying to join the above to a table based on the rsparid columns to this Dataframe...

            line_track  line_race rsparid
 line_date                               
 2013-03-23         TP         10    1400
 2013-02-23         GP          7     634
 2013-01-01         GP          7    1508
 2012-11-11        AQU          5      96
 2012-10-11        BEL          2     161

Using this...

 df = pd.merge(datalines, rspars, how='left', on='rsparid')

I get blanks..

   line_track  line_race rsparid  f1mult  f2mult  f3mult
 0         TP         10    1400     NaN     NaN     NaN
 1         TP         10    1400     NaN     NaN     NaN
 2         TP         10    1400     NaN     NaN     NaN
 3         GP          7     634     NaN     NaN     NaN
 4         GP         10     634     NaN     NaN     NaN

Note, the "datalines" column can have thousands more rows than the rspars, thus the left join. I must be doing something wrong?

I also tried it this way...

 df = datalines.merge(rspars, how='left', on='rsparid')

EXAMPLE #2

I dropped the data down to a few rows...

rspars:

    rsparid  f1mult  f2mult  f3mult
 0     1400   0.216   0.435   0.656

datalines:

   rsparid
 0    1400
 1     634
 2    1508
 3      96
 4     161
 5    1011
 6    1007
 7     518
 8    1955
 9     678

Merging...

 datalines.merge(rspars, how='left', on='rsparid')

Output...

   rsparid  f1mult  f2mult  f3mult
 0    1400     NaN     NaN     NaN
 1     634     NaN     NaN     NaN
 2    1508     NaN     NaN     NaN
 3      96     NaN     NaN     NaN
 4     161     NaN     NaN     NaN
 5    1011     NaN     NaN     NaN
 6    1007     NaN     NaN     NaN
 7     518     NaN     NaN     NaN
 8    1955     NaN     NaN     NaN
 9     678     NaN     NaN     NaN
like image 297
TravisVOX Avatar asked Oct 04 '22 03:10

TravisVOX


1 Answers

The NaNs mean they have no values in rsparid in common. This can be tricky when merging things that may look the same when they repr

The repr of small DataFrames with strings (of integers) or integers looks the same and no dtype information is printed when frames are small. You can get this information (and more) for small frames by calling the DataFrame.info() method, like so: df.info(). This will give you a nice summary of what's in the DataFrame and what the dtypes of its columns are:

In [205]: datalines_int = DataFrame({'rsparid':[1400,634,1508,96,161,1011,1007,518,1955,678]})

In [206]: datalines_str = DataFrame({'rsparid':map(str,[1400,634,1508,96,161,1011,1007,518,1955,678])})

In [207]: datalines_int
Out[207]:
   rsparid
0     1400
1      634
2     1508
3       96
4      161
5     1011
6     1007
7      518
8     1955
9      678

In [208]: datalines_str
Out[208]:
  rsparid
0    1400
1     634
2    1508
3      96
4     161
5    1011
6    1007
7     518
8    1955
9     678

In [209]: datalines_int.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 1 columns):
rsparid    10  non-null values
dtypes: int64(1)

In [210]: datalines_str.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 1 columns):
rsparid    10  non-null values
dtypes: object(1)

NOTE: You'll notice a slight difference in the reprs here, most likely because of padding of numeric DataFrames. Point is, no one would really be able to see that using this interactively, unless they were specifically looking for the difference.

like image 170
Phillip Cloud Avatar answered Oct 12 '22 11:10

Phillip Cloud