Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to combine queries with a single external variable using Pandas

I am trying to accept a variable input of many search terms seperated by commas via html form (@search) and query 2 columns of a dataframe.

Each column query works on its own but I cannot get them to work together in a and/or way.

First column query:

filtered = df.query ('`Drug Name` in @search')

Second column query:

filtered = df.query ('BP.str.contains(@search, na=False)', engine='python')

edit combining like this:

filtered = df.query ("('`Drug Name` in @search') and ('BP.str.contains(@search, na=False)', engine='python')")

Gives the following error, highlighting the python identifier in the engine argument

SyntaxError: Python keyword not valid identifier in numexpr query

edit 2

The dataframe is read from an excel file, with columns: Drug Name (containing a single drug name), BP, U&E (with long descriptive text entries)

The search terms will be input via html form:

search = request.values.get('searchinput').replace(" ","").split(',')

as a list of drugs which a patient may be on sometimes with the addition of specific conditions relating to medication use. sample user input:

Captopril, Paracetamol, kidney disease, chronic

I want the list to be checked against specific drug names and also to check other columns such as BP and U&E for any mention of the search terms.

edit 3

Apologies, but trying to implement the answers given is giving me stacks of errors. What I have below is giving me 90% of what I'm after, letting me search both columns including the whole contents of 'BP'. But I can only search a single term via the terminal, if I # out and swap the lines which collect the use input (taking it from the html form as apposed to the terminal) I get:

TypeError: unhashable type: 'list'

@app.route('/', methods=("POST", "GET"))

    def html_table():
        searchterms = []
        #searchterms = request.values.get('searchinput').replace(" ","").split(',')
        searchterms = input("Enter drug...")   
        filtered = df.query('`Drug Name` in @searchterms | BP.str.contains(@searchterms, na=False)', engine='python')
        return render_template('drugsafety.html', tables=[filtered.to_html(classes='data')], titles=['na', 'Drug List'])

<form action="" method="post">
  <p><label for="search">Search</label>
  <input type="text" name="searchinput"></p>        
  <p><input type="submit"></p>
</form>

Sample data

The contents of the BP column can be quite long, descriptive and variable but an example is:

Every 12 months – Patients with CKD every 3 to 6 months.

Drug Name         BP                            U&E
Perindopril       Every 12 months               Not needed
Alendronic Acid   Not needed                    Every 12 months
Allopurinol       Whilst titrating - 3 months   Not needed

With this line:

searchterms = request.values.get('searchinput')

Entering 'months' into the html form outputs:

1   Perindopril  Every 12 months                Not needed 
14  Allopurinol  Whilst titrating – 3 months    Not needed

All good.

Entering 'Alendronic Acid' into the html form outputs:

13  Alendronic Acid Not needed  Every 12 months

Also good, but entering 'Perindopril, Allopurinol' returns nothing.

If I change the line to:

searchterms = request.values.get('searchinput').replace(" ","").split(',')

I get TypeError: unhashable type: 'list' when the page reloads.

However - If I then change:

filtered = df.query('`Drug Name` in @searchterms | BP.str.contains(@searchterms, na=False)', engine='python')

to:

filtered = df.query('`Drug Name` in @searchterms')

Then the unhashable type error goes and entering 'Perindopril, Allopurinol' returns:

1   Perindopril   Every 12 months                   Not needed
14  Allopurinol   Whilst titrating – Every 3 months Not needed

But I'm now no longer searching the BP column for the searchterms.

Just thought that maybe its because searchterms is a list '[]' changed it t oa tuple '()' Didn't change anything.

Any help is much appreciated.

like image 797
Dave Avatar asked May 03 '21 22:05

Dave


3 Answers

I am assuming you want to query 2 columns and want to return the row if any of the query matches.

In this line, the issue is that engine=python is inside query.

filtered = df.query ("('`Drug Name` in @search') and ('BP.str.contains(@search, na=False)', engine='python')")

It should be

df.query("BP.str.contains(@search, na=False)", engine='python')

If you do searchterms = request.values.get('searchinput').replace(" ","").split(','), it converts your string to list of words which will cause Unhashable type list error because str.contains expects str as input.

What you can do is use regex to search for search terms in list, it will look something like this:

df.query("BP.str.contains('|'.join(@search), na=False, regex=True)", engine='python')

What this does is it searches for all the individual words using regex. ('|'.join(@search) will be "searchterm_1|search_term2|..." and "|" is used to represent or in regex, so it looks for searchterm_1 or searchterm_2 in BP column value)

To combine the outputs of both queries, you can run those separately and concatenate the results

pd.concat([df.query("`Drug Name` in @search", engine='python'),df.query("BP.str.contains('|'.join(@search), na=False, regex=True)", engine='python')])

Also any string based matching will require your strings to match perfectly, including case. so you can maybe lowercase everything in dataframe and query. Similarly for space separated words, this will remove spaces.

if you do searchterms = request.values.get('searchinput').replace(" ","").split(',') on Every 12 months, it will get converted to "Every12months". so you can maybe remove the .replace() part and just use searchterms = request.values.get('searchinput').split(',')

like image 50
SajanGohil Avatar answered Oct 18 '22 17:10

SajanGohil


Use sets. You can change the text columns to sets and check for intersection with the input. The rest is pure pandas. I never use .query because it is slow.

# change your search from list to set
search = set(request.values.get('searchinput').replace(" ","").split(','))

filtered = df.loc[(df['Drug Name'].str.split().map(lambda x: set(x).intersection(search)))
                  & (df['BP'].str.split().map(lambda x: set(x).intersection(search)))]

print(filtered)

Demo:

import pandas as pd

search = set(["apple", "banana", "orange"])
df = pd.DataFrame({
    "Drug Name": ["I am eating an apple", "We are banana", "nothing is here"],
    "BP": ["apple is good", "nothing is here", "nothing is there"],
    "Just": [1, 2, 3]
})


filtered = df.loc[(df['Drug Name'].str.split().map(lambda x: set(x).intersection(search)))
                  & (df['BP'].str.split().map(lambda x: set(x).intersection(search)))]

print(filtered)

#               Drug Name             BP  Just
# 0  I am eating an apple  apple is good     1

Updated:

I would want the results to also show We are banana, nothing is here and 2

That requires or which is Pandas' | instead of and which Pandas' $

filtered = df.loc[(df['Drug Name'].str.split().map(lambda x: set(x).intersection(search)))
                  | (df['BP'].str.split().map(lambda x: set(x).intersection(search)))]

print(filtered)

#                Drug Name               BP  Just
# 0  I am eating an apple    apple is good       1
# 1          We are banana   nothing is here     2
like image 33
Prayson W. Daniel Avatar answered Oct 18 '22 17:10

Prayson W. Daniel


If you want to search for text in all columns, you can first join all columns, and then check for search terms in each row using str.contains and the regular expression pattern that matches at least one of the terms (term1|term2|...|termN). I've also added flags=re.IGNORECASE to make the search case-insensitive:

# search function
def search(searchterms):
    return df.loc[df.apply(' '.join, axis=1)    # join text in all columns
        .str.contains(                          # check if it contains
            '|'.join([                          # regex pattern
                x.strip()                       # strip spaces
                for x in searchterms.split(',') # split by ','
            ]), flags=re.IGNORECASE)]           # case-insensitive

# test search terms
for s in ['Alendronic Acid', 'months', 'Perindopril, Allopurinol']:
    print(f'Search terms: "{s}"')
    print(search(s))
    print('-'*70)

Output:

Search terms: "Alendronic Acid"
         Drug Name          BP              U&E
1  Alendronic Acid  Not needed  Every 12 months
----------------------------------------------------------------------
Search terms: "months"
         Drug Name                           BP              U&E
0      Perindopril              Every 12 months       Not needed
1  Alendronic Acid                   Not needed  Every 12 months
2      Allopurinol  Whilst titrating - 3 months       Not needed
----------------------------------------------------------------------
Search terms: "Perindopril, Allopurinol"
     Drug Name                           BP         U&E
0  Perindopril              Every 12 months  Not needed
2  Allopurinol  Whilst titrating - 3 months  Not needed
----------------------------------------------------------------------

P.S. If you want to limit search to specific columns, here's a version that does that (with the default of searching all columns for convenience):

# search function
def search(searchterms, cols=None):
    
    # search columns (if None, searches in all columns)
    if cols is None:
        cols = df.columns
        
    return df.loc[df[cols].apply(' '.join, axis=1) # join text in cols
        .str.contains(                             # check if it contains
            '|'.join([                             # regex pattern
                x.strip()                          # remove spaces
                for x in searchterms.split(',')    # split by ','
            ]), flags=re.IGNORECASE)]              # make search case-insensitive

Now if I search for months only in Drug Name and BP, it will not return Alendronic Acid where months is only found in U&E:

search('months', ['Drug Name', 'BP'])

Output:

     Drug Name                           BP         U&E
0  Perindopril              Every 12 months  Not needed
2  Allopurinol  Whilst titrating - 3 months  Not needed
like image 2
perl Avatar answered Oct 18 '22 17:10

perl