I have a data table of data for a variety of genomic positions. The positions are represented as 3-tuples ('chromosome', 'srand', position) that I've turned into a multi-index. My goal is to look up various information about each position and add that to the table (for example gene name, etc.) I can do this with pybedtools.
df = pd.DataFrame(data={'A':range(1,8), 'B':range(1,8), 'C': range(1,8)},
index=pd.MultiIndex.from_tuples([('chrom1', '-', 1234), ('chrom1', '+', 5678),
('chrom1', '+', 9876), ('chrom2', '+', 13579), ('chrom2', '+', 8497), ('chrom2', '-', 98765),
('chrom2', '-', 76856)]))
df.index.rename(['chrom','strand','abs_pos'], inplace=True)
A B C
chrom strand abs_pos
chrom1 - 1234 1 1 1
+ 5678 2 2 2
9876 3 3 3
chrom2 + 13579 4 4 4
8497 5 5 5
- 98765 6 6 6
76856 7 7 7
My issue is with adding columns to a data frame with a multi-index. This seems straight forward without a multi-index: pandas - add new column to dataframe from dictionary
I have a dictionary of the look up information with 3-tuple keys corresponding to the multi-index. How can I add this data as a new column?
gene_d = {('chrom1', '-', 1234) : 'geneA', ('chrom1', '+', 5678): 'geneB',
('chrom1', '+', 9876): 'geneC', ('chrom2', '+', 13579): 'geneD',
('chrom2', '+', 8497): 'geneE', ('chrom2', '-', 98765): 'geneF',
('chrom2', '-', 76856): 'geneG'}
I've tried map, but can't seem to figure out how to get it to work with a multi-index to yield the following:
A B C
chrom strand abs_pos gene
chrom1 - 1234 geneA 1 1 1
+ 5678 geneB 2 2 2
9876 geneC 3 3 3
chrom2 + 13579 geneD 4 4 4
8497 geneE 5 5 5
- 98765 geneF 6 6 6
76856 geneG 7 7 7
pandas MultiIndex to ColumnsUse pandas DataFrame. reset_index() function to convert/transfer MultiIndex (multi-level index) indexes to columns. The default setting for the parameter is drop=False which will keep the index values as columns and set the new index to DataFrame starting from zero. Yields below output.
You can slice a MultiIndex by providing multiple indexers. You can provide any of the selectors as if you are indexing by label, see Selection by Label, including slices, lists of labels, labels, and boolean indexers. You can use slice(None) to select all the contents of that level.
A vectorized approach:
df['gene'] = df.index #you get the index as tuple
df['gene'] = df['gene'].map(gene_d)
df = df.set_index('gene', append=True)
Resulting df:
A B C
chrom strand abs_pos gene
chrom1 - 1234 geneA 1 1 1
+ 5678 geneB 2 2 2
9876 geneC 3 3 3
chrom2 + 13579 geneD 4 4 4
8497 geneE 5 5 5
- 98765 geneF 6 6 6
76856 geneG 7 7 7
Make gene_d into a dataframe:
df1 = pd.DataFrame.from_dict(gene_d, orient='index').rename(columns={0:'gene'})
Give it a multindex:
df1.index = pd.MultiIndex.from_tuples(df1.index)
Concatenate with original df:
new_df = pd.concat([df, df1], axis=1).sort_values('A')
Do some clean up:
new_df.index.rename(['chrom','strand','abs_pos'], inplace=True)
new_df.set_index('gene', append=True)
new_df
A B C
chrom strand abs_pos gene
chrom1 - 1234 geneA 1 1 1
+ 5678 geneB 2 2 2
9876 geneC 3 3 3
chrom2 + 13579 geneD 4 4 4
8497 geneE 5 5 5
- 98765 geneF 6 6 6
76856 geneG 7 7 7
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With