Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find the difference between strings for each two rows of pandas data.frame

I am new in python, and I am struggling with this for some time. I have a file that looks like this:

    name   seq
1   a1     bbb
2   a2     bbc
3   b1     fff
4   b2     fff
5   c1     aaa
6   c2     acg

where name is the name of the string and seq is the string. I would like a new column or a new data frame that indicates the number of differences between every two rows without overlap. For example, I want the number of differences between sequences for the name [a1-a2] then [b1-b2] and lastly between [c1-c2].

So I need something like this:

    name   seq   diff  
1   a1     bbb    NA   
2   a2     bbc    1
3   b1     fff    NA
4   b2     fff    0
5   c1     aaa    NA
6   c2     acg    2

Any help is highly appreciated

like image 958
LDT Avatar asked Apr 12 '20 14:04

LDT


People also ask

How do you find the difference between two rows in pandas?

Difference between rows or columns of a pandas DataFrame object is found using the diff() method. The axis parameter decides whether difference to be calculated is between rows or between columns. When the periods parameter assumes positive values, difference is found by subtracting the previous row from the next row.

How do I check if two rows have the same value in pandas?

The equals() function is used to test whether two Pandas objects contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.

What does diff () do in pandas?

The diff() method returns a DataFrame with the difference between the values for each row and, by default, the previous row. Which row to compare with can be specified with the periods parameter.


4 Answers

It looks like you want the jaccard distance of the pairs of strings. Here's one way using groupby and scipy.spatial.distance.jaccard:

from scipy.spatial.distance import jaccard
g = df.groupby(df.name.str[0])

df['diff'] = [sim for _, seqs in g.seq for sim in 
              [float('nan'), jaccard(*map(list,seqs))]]

print(df)

  name  seq  diff
1   a1  bbb   NaN
2   a2  bbc   1.0
3   b1  fff   NaN
4   b2  fff   0.0
5   c1  aaa   NaN
6   c2  acg   2.0
like image 169
yatu Avatar answered Oct 23 '22 08:10

yatu


Alternative with Levenshtein distance:

import Levenshtein
s = df['name'].str[0]
out = df.assign(Diff=s.drop_duplicates(keep='last').map(df.groupby(s)['seq']
                    .apply(lambda x: Levenshtein.distance(x.iloc[0],x.iloc[-1]))))

  name  seq  Diff
1   a1  bbb   NaN
2   a2  bbc   1.0
3   b1  fff   NaN
4   b2  fff   0.0
5   c1  aaa   NaN
6   c2  acg   2.0
like image 37
anky Avatar answered Oct 23 '22 07:10

anky


As a first step I recreated your data with:

#!/usr/bin/env python3
import pandas as pd

# Setup
data = {'name': {1: 'a1', 2: 'a2', 3: 'b1', 4: 'b2', 5: 'c1', 6: 'c2'}, 'seq': {1: 'bbb', 2: 'bbc', 3: 'fff', 4: 'fff', 5: 'aaa', 6: 'acg'}}
df = pd.DataFrame(data)

Solution You could try to iterate over the dataframe and compare the seq value of the last iteration with the current one. For the comparison of the two strings (stored in the seq columns of your dataframe) you can apply a simple list comprehension like in this function:

def diff_letters(a,b):
    return sum ( a[i] != b[i] for i in range(len(a)) )

Iteration over the Dataframe rows

diff = ['NA']

row_iterator = df.iterrows()
_, last = next(row_iterator)

# Iterate over the df get populate a list with result of the comparison
for i, row in row_iterator:
    if i % 2 == 0:
        diff.append(diff_letters(last['seq'],row['seq']))
    else:
        # for odd row numbers append NA value
        diff.append("NA")
    last = row
df['diff'] = diff

Result looks like this

  name  seq diff
1   a1  bbb   NA
2   a2  bbc    1
3   b1  fff   NA
4   b2  fff    0
5   c1  aaa   NA
6   c2  acg    2
like image 21
Björn Avatar answered Oct 23 '22 07:10

Björn


Check this one

import pandas as pd

data = {'name':  ['a1', 'a2','b1','b2','c1','c2'],
    'seq': ['bbb', 'bbc','fff','fff','aaa','acg']
    }

df = pd.DataFrame (data, columns = ['name','seq'])
diffCntr=0
df['diff'] = np.nan
i=0
while i < len(df)-1:
    diffCntr=np.nan
    item=df.at[i,'seq']
    df.at[i,'diff']=diffCntr
    diffCntr=0
    for j in df.at[i+1,'seq']:
        if item.find(j) < 0:
            diffCntr +=1
    df.at[i+1,'diff']=diffCntr
    i +=2    
df  

The result is this:

    name seq    diff
0   a1   bbb    NaN
1   a2   bbc    1.0
2   b1   fff    NaN
3   b2   fff    0.0
4   c1   aaa    NaN
5   c2   acg    2.0
like image 20
Rola Avatar answered Oct 23 '22 08:10

Rola