I'm afraid I do not quite understand the merging capabilities of pandas, although I much prefer python over R for now.
In R, I have always been able to merge dataframes very easily as follows:
> merge(test,e2s, all.x=T)
Gene Mutation Chromosome Entrez
1 AGRN p.R451H chr1 375790
2 C1orf170 p.V663A/V683A chr1 84808
3 HES4 p.R44S chr1 57801
4 ISG15 p.S83N chr1 9636
5 PLEKHN1 p.S476P/S511P/S563P/S76P chr1 84069
However, I have been unable to reconstruct this in pandas with merge(how="left,right,inner,outer").. For example:
Outer yields a union, which makes sense:
x = test.merge(e2s, how="outer")
In [133]: x.shape
Out[133]: (46271, 4)
But inner yields an empty dataframe, even though Entrez_Gene_Id
has been merged successfully:
In [143]: x = test.merge(e2s, how="inner")
In [144]: x
Out[144]:
Empty DataFrame
Columns: [Gene, Mutation, Chromosome, Entrez_Gene_Id]
Index: []
[0 rows x 4 columns]
The intersection should contain one row with the gene : HES4
. Is there some sort of string matching I need to turn on for this?:
e2s:
57794 SUGP1
57795 BRINP2
57796 DKFZP761C1711
57798 GATAD1
57799 RAB40C
57801 HES4
57804 POLD4
57805 CCAR2
57817 HAMP
test:
Gene Mutation Chromosome
0 PLEKHN1 p.S476P/S511P/S563P/S76P chr1
1 C1orf170 p.V663A/V683A chr1
2 HES4 p.R44S chr1
3 ISG15 p.S83N chr1
4 AGRN p.R451H chr1
5 RNF223 p.P242H chr1
Update:
As far as I know the columns are labelled so that they should merge fine, I only want to merge by the Gene
column and keep all test rows:
In [148]: e2s.columns
Out[148]: Index([u'Gene', u'Entrez_Gene_Id'], dtype='object')
In [149]: test.columns
Out[149]: Index([u'Gene', u'Mutation', u'Chromosome'], dtype='object')
This was done by explicitly renaming the dataframes:
e2s.rename(columns={"Gene":u'Gene',"Entrez_Gene_Id":u'Entrez_Gene_Id'}, inplace=True)
to dict:
{u'Chromosome': {0: u'chr1',
1: u'chr1',
2: u'chr1',
3: u'chr1',
4: u'chr1',
5: u'chr1'},
u'Gene': {0: u'PLEKHN1',
1: u'C1orf170',
2: u'HES4',
3: u'ISG15',
4: u'AGRN',
5: u'RNF223'},
u'Mutation': {0: u'p.S476P/S511P/S563P/S76P',
1: u'p.V663A/V683A',
2: u'p.R44S',
3: u'p.S83N',
4: u'p.R451H',
5: u'p.P242H'}}
{u'Entrez_Gene_Id': {14118: u'SUGP1',
14119: u'BRINP2',
14120: u'DKFZP761C1711',
14121: u'GATAD1',
14122: u'RAB40C',
14123: u'HES4',
14124: u'POLD4',
14125: u'CCAR2',
14126: u'HAMP'},
u'Gene': {14118: 57794,
14119: 57795,
14120: 57796,
14121: 57798,
14122: 57799,
14123: 57801,
14124: 57804,
14125: 57805,
14126: 57817}}
Perhaps you haven't labelled the columns (this is needed otherwise how can you know which columns to use to match against!)
It works fine if they are both are frames with labelled columns:
In [11]: e2s
Out[11]:
number Gene
0 57794 SUGP1
1 57795 BRINP2
2 57796 DKFZP761C1711
3 57798 GATAD1
4 57799 RAB40C
5 57801 HES4
6 57804 POLD4
7 57805 CCAR2
8 57817 HAMP
In [12]: test
Out[12]:
Gene Mutation Chromosome
0 PLEKHN1 p.S476P/S511P/S563P/S76P chr1
1 C1orf170 p.V663A/V683A chr1
2 HES4 p.R44S chr1
3 ISG15 p.S83N chr1
4 AGRN p.R451H chr1
5 RNF223 p.P242H chr1
In [13]: e2s.merge(test)
Out[13]:
number Gene Mutation Chromosome
0 57801 HES4 p.R44S chr1
In [14]: test.merge(e2s)
Out[14]:
Gene Mutation Chromosome number
0 HES4 p.R44S chr1 57801
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With