Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python pandas return value from other column

I have a file "specieslist.txt" which contain the following information:

Bacillus,genus
Borrelia,genus
Burkholderia,genus
Campylobacter,genus

Now, I want python to look for a variable in the first column (in this example "Campylobacter") and return the value of the second ("genus"). I wrote the following code

import csv
import pandas as pd

species_import = 'Campylobacter'
df = pd.read_csv('specieslist.txt', header=None, names = ['species', 'level'] )
input = df.loc[df['species'] == species_import]
print (input['level'])

However, my code return too much, while I am only want "genus"

3    genus
Name: level, dtype: object
like image 934
Gravel Avatar asked Dec 09 '25 17:12

Gravel


2 Answers

You can select first value of Series by iat:

species_import = 'Campylobacter'
out = df.loc[df['species'] == species_import, 'level'].iat[0]
#alternative
#out = df.loc[df['species'] == species_import, 'level'].values[0]
print (out)
genus

Better solution working if no value matched and empty Series is returned - it return no match:

@jpp comment
This solution is better only when you have a large series and the matched value is expected to be near the top

species_import = 'Campylobacter'
out = next(iter(df.loc[df['species'] == species_import, 'level']), 'no match')
print (out)
genus

EDIT:

Idea from comments, thanks @jpp:

def get_first_val(val):
    try:
        return df.loc[df['species'] == val, 'level'].iat[0]
    except IndexError:
        return 'no match'

print (get_first_val(species_import))
genus

print (get_first_val('aaa'))
no match

EDIT:

df = pd.DataFrame({'species':['a'] * 10000 + ['b'], 'level':np.arange(10001)})

def get_first_val(val):
    try:
        return df.loc[df['species'] == val, 'level'].iat[0]
    except IndexError:
        return 'no match'


In [232]: %timeit next(iter(df.loc[df['species'] == 'a', 'level']), 'no match')
1.3 ms ± 33.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [233]: %timeit (get_first_val('a'))
1.1 ms ± 21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)



In [235]: %timeit (get_first_val('b'))
1.48 ms ± 206 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [236]: %timeit next(iter(df.loc[df['species'] == 'b', 'level']), 'no match')
1.24 ms ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
like image 141
jezrael Avatar answered Dec 12 '25 06:12

jezrael


Performance of various methods, to demonstrate when it is useful to use next(...).

n = 10**6
df = pd.DataFrame({'species': ['b']+['a']*n, 'level': np.arange(n+1)})

def get_first_val(val):
    try:
        return df.loc[df['species'] == val, 'level'].iat[0]
    except IndexError:
        return 'no match'

%timeit next(iter(df.loc[df['species'] == 'b', 'level']), 'no match')     # 123 ms per loop
%timeit get_first_val('b')                                                # 125 ms per loop
%timeit next(idx for idx, val in enumerate(df['species']) if val == 'b')  # 20.3 µs per loop
like image 35
jpp Avatar answered Dec 12 '25 05:12

jpp



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!