For some reason, I cannot get this merge to work correctly.
This Dataframe (rspars) has 2,000+ rows...
rsparid f1mult f2mult f3mult
0 1 0.318 0.636 0.810
1 2 0.348 0.703 0.893
2 3 0.384 0.777 0.000
3 4 0.296 0.590 0.911
4 5 0.231 0.458 0.690
5 6 0.275 0.546 0.839
6 7 0.248 0.486 0.731
7 8 0.430 0.873 0.000
8 9 0.221 0.438 0.655
9 11 0.204 0.399 0.593
When trying to join the above to a table based on the rsparid
columns to this Dataframe...
line_track line_race rsparid
line_date
2013-03-23 TP 10 1400
2013-02-23 GP 7 634
2013-01-01 GP 7 1508
2012-11-11 AQU 5 96
2012-10-11 BEL 2 161
Using this...
df = pd.merge(datalines, rspars, how='left', on='rsparid')
I get blanks..
line_track line_race rsparid f1mult f2mult f3mult
0 TP 10 1400 NaN NaN NaN
1 TP 10 1400 NaN NaN NaN
2 TP 10 1400 NaN NaN NaN
3 GP 7 634 NaN NaN NaN
4 GP 10 634 NaN NaN NaN
Note, the "datalines" column can have thousands more rows than the rspars, thus the left join. I must be doing something wrong?
I also tried it this way...
df = datalines.merge(rspars, how='left', on='rsparid')
EXAMPLE #2
I dropped the data down to a few rows...
rspars:
rsparid f1mult f2mult f3mult
0 1400 0.216 0.435 0.656
datalines:
rsparid
0 1400
1 634
2 1508
3 96
4 161
5 1011
6 1007
7 518
8 1955
9 678
Merging...
datalines.merge(rspars, how='left', on='rsparid')
Output...
rsparid f1mult f2mult f3mult
0 1400 NaN NaN NaN
1 634 NaN NaN NaN
2 1508 NaN NaN NaN
3 96 NaN NaN NaN
4 161 NaN NaN NaN
5 1011 NaN NaN NaN
6 1007 NaN NaN NaN
7 518 NaN NaN NaN
8 1955 NaN NaN NaN
9 678 NaN NaN NaN
The NaN
s mean they have no values in rsparid
in common. This can be tricky when merging things that may look the same when they repr
The repr of small DataFrames
with strings (of integers) or integers looks the same and no dtype
information is printed when frames are small. You can get this information (and more) for small frames by calling the DataFrame.info()
method, like so: df.info()
. This will give you a nice summary of what's in the DataFrame
and what the dtype
s of its columns are:
In [205]: datalines_int = DataFrame({'rsparid':[1400,634,1508,96,161,1011,1007,518,1955,678]})
In [206]: datalines_str = DataFrame({'rsparid':map(str,[1400,634,1508,96,161,1011,1007,518,1955,678])})
In [207]: datalines_int
Out[207]:
rsparid
0 1400
1 634
2 1508
3 96
4 161
5 1011
6 1007
7 518
8 1955
9 678
In [208]: datalines_str
Out[208]:
rsparid
0 1400
1 634
2 1508
3 96
4 161
5 1011
6 1007
7 518
8 1955
9 678
In [209]: datalines_int.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 1 columns):
rsparid 10 non-null values
dtypes: int64(1)
In [210]: datalines_str.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 1 columns):
rsparid 10 non-null values
dtypes: object(1)
NOTE: You'll notice a slight difference in the repr
s here, most likely because of padding of numeric DataFrame
s. Point is, no one would really be able to see that using this interactively, unless they were specifically looking for the difference.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With