Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

assign in pandas pipeline

Tags:

python

pandas

Say, I have the following DataFrame with raw input data, and want to process it using a chain of pandas functions ("pipeline"). In particular, I want to rename and drop columns and add an additional column based on another.

    Gene stable ID  Gene name   Gene type   miRBase accession   miRBase ID
0   ENSG00000274494 MIR6832     miRNA       MI0022677           hsa-mir-6832
1   ENSG00000283386 MIR4659B    miRNA       MI0017291           hsa-mir-4659b
2   ENSG00000221456 MIR1202     miRNA       MI0006334           hsa-mir-1202
3   ENSG00000199102 MIR302C     miRNA       MI0000773           hsa-mir-302c

At the moment I do the following (which works):

tmp_df = df.\
         drop("Gene type", axis=1).\
         rename(columns = {
            "Gene stable ID": "ENSG",
            "Gene name": "gene_name",
            "miRBase accession": "MI",
            "miRBase ID": "mirna_name"
         })

result = tmp_df.assign(species = tmp_df.mirna_name.str[:3])

result:

    ENSG            gene_name   MI          mirna_name      species
0   ENSG00000274494 MIR6832     MI0022677   hsa-mir-6832    hsa
1   ENSG00000283386 MIR4659B    MI0017291   hsa-mir-4659b   hsa
2   ENSG00000221456 MIR1202     MI0006334   hsa-mir-1202    hsa
3   ENSG00000199102 MIR302C     MI0000773   hsa-mir-302c    hsa

Is it possible to put the assign command directly into the 'pipeline'? It feels cumbersome having to assign an additional temporary variable. I have no idea how I should reference the corresponding renamed column ('mirna_name') in that case.

like image 419
Gregor Sturm Avatar asked Jun 19 '17 12:06

Gregor Sturm


3 Answers

You can use pipe:

tmp_df = df.\
         drop("Gene type", axis=1).\
         rename(columns = {
            "Gene stable ID": "ENSG",
            "Gene name": "gene_name",
            "miRBase accession": "MI",
            "miRBase ID": "mirna_name"
         }).\
         pipe(lambda x: x.assign(species = x.mirna_name.str[:3]))

tmp_df
Out[365]: 
              ENSG gene_name         MI     mirna_name species
0  ENSG00000274494   MIR6832  MI0022677   hsa-mir-6832     hsa
1  ENSG00000283386  MIR4659B  MI0017291  hsa-mir-4659b     hsa
2  ENSG00000221456   MIR1202  MI0006334   hsa-mir-1202     hsa
3  ENSG00000199102   MIR302C  MI0000773   hsa-mir-302c     hsa

As @Tom pointed out, this can also be done without using pipe in this case:

df.\
         drop("Gene type", axis=1).\
         rename(columns = {
            "Gene stable ID": "ENSG",
            "Gene name": "gene_name",
            "miRBase accession": "MI",
            "miRBase ID": "mirna_name"
         }).\
         assign(species = lambda x: x.mirna_name.str[:3])
like image 129
Allen Avatar answered Nov 01 '22 05:11

Allen


result = df.drop("Gene type", axis=1).\
     rename(columns = {
        "Gene stable ID": "ENSG",
        "Gene name": "gene_name",
        "miRBase accession": "MI",
        "miRBase ID": "mirna_name"
     }).assign(species = df['miRBase ID'].str[:3])

You can reference the renamed column as df[column_name].

like image 36
sowmya Avatar answered Nov 01 '22 04:11

sowmya


I found pandas-ply which introduces a magic symbol X for that purpose:

import pandas as pd 
from pandas_ply import X, install_ply
install_ply(pd)

df\
     .drop("Gene type", axis=1)\
     .rename(columns = {
        "Gene stable ID": "ENSG",
        "Gene name": "gene_name",
        "miRBase accession": "MI",
        "miRBase ID": "mirna_name"
     })\
     .ply_select("*", species = X.mirna_name.str[:3])

would be nice to have this in native pandas, though.

like image 42
Gregor Sturm Avatar answered Nov 01 '22 04:11

Gregor Sturm