The problem
I would like to split a column from a pandas dataframe into 2 columns, in the percentage column (see below), each entry starts with a capitalised alphabet character, I would like to split the 'Percentage' column immediately after this letter, with the new column labelled 'Amino Acid'.
Current Code:
import pandas as pd
df = pd.read_csv('foo.csv')
df['Amino Acid'], df['Percentage'] = zip(*df['Percentage'].map(lambda x: x.split('[^a-zA-Z]')))
df.to_csv('bar.csv',index=False)
Example of input data
+-----------------------------+-------+-----+-----------+---------------------------------------------------------------------------------------------+
| Species | ID | OGT | DB | Percentage |
+-----------------------------+-------+-----+-----------+---------------------------------------------------------------------------------------------+
| Halogeometricum borinquense | 60847 | 37 | ATCC/DSMZ | E is 8.333003365670164% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |
| Halogeometricum borinquense | 60847 | 37 | ATCC/DSMZ | R is 6.310991522830762% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |
| Halogeometricum borinquense | 60847 | 37 | ATCC/DSMZ | A is 10.22668778459711% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |
+-----------------------------+-------+-----+-----------+---------------------------------------------------------------------------------------------+
Example of desired output
+-----------------------------+-------+-----+-----------+------------+--------------------------------------------------------------------------------------------+
| Species | ID | OGT | DB | Amino Acid | Percentage |
+-----------------------------+-------+-----+-----------+------------+--------------------------------------------------------------------------------------------+
| Halogeometricum borinquense | 60847 | 37 | ATCC/DSMZ | E | is 8.333003365670164% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |
| Halogeometricum borinquense | 60847 | 37 | ATCC/DSMZ | R | is 6.310991522830762% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |
| Halogeometricum borinquense | 60847 | 37 | ATCC/DSMZ | A | is 10.22668778459711% in ./archaea/GCF_000337855.1/GCF_000337855.1_ASM33785v1_protein.faa |
+-----------------------------+-------+-----+-----------+------------+--------------------------------------------------------------------------------------------+
You can extract the first letter directly:
df['Amino Acid'] = df['Percentage'].str[0]
df['Percentage'] = df['Percentage'].str[1:]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With