Find the difference between strings for each two rows of pandas data.frame

Tags:

I am new in python, and I am struggling with this for some time. I have a file that looks like this:

    name   seq
1   a1     bbb
2   a2     bbc
3   b1     fff
4   b2     fff
5   c1     aaa
6   c2     acg

where name is the name of the string and seq is the string. I would like a new column or a new data frame that indicates the number of differences between every two rows without overlap. For example, I want the number of differences between sequences for the name [a1-a2] then [b1-b2] and lastly between [c1-c2].

So I need something like this:

    name   seq   diff  
1   a1     bbb    NA   
2   a2     bbc    1
3   b1     fff    NA
4   b2     fff    0
5   c1     aaa    NA
6   c2     acg    2

Any help is highly appreciated

958

asked Apr 12 '20 14:04

LDT

4 Answers

It looks like you want the jaccard distance of the pairs of strings. Here's one way using groupby and scipy.spatial.distance.jaccard:

from scipy.spatial.distance import jaccard
g = df.groupby(df.name.str[0])

df['diff'] = [sim for _, seqs in g.seq for sim in 
              [float('nan'), jaccard(*map(list,seqs))]]

print(df)

  name  seq  diff
1   a1  bbb   NaN
2   a2  bbc   1.0
3   b1  fff   NaN
4   b2  fff   0.0
5   c1  aaa   NaN
6   c2  acg   2.0

169

answered Oct 23 '22 08:10

yatu

Alternative with Levenshtein distance:

import Levenshtein
s = df['name'].str[0]
out = df.assign(Diff=s.drop_duplicates(keep='last').map(df.groupby(s)['seq']
                    .apply(lambda x: Levenshtein.distance(x.iloc[0],x.iloc[-1]))))

  name  seq  Diff
1   a1  bbb   NaN
2   a2  bbc   1.0
3   b1  fff   NaN
4   b2  fff   0.0
5   c1  aaa   NaN
6   c2  acg   2.0

answered Oct 23 '22 07:10

anky

As a first step I recreated your data with:

#!/usr/bin/env python3
import pandas as pd

# Setup
data = {'name': {1: 'a1', 2: 'a2', 3: 'b1', 4: 'b2', 5: 'c1', 6: 'c2'}, 'seq': {1: 'bbb', 2: 'bbc', 3: 'fff', 4: 'fff', 5: 'aaa', 6: 'acg'}}
df = pd.DataFrame(data)

Solution You could try to iterate over the dataframe and compare the seq value of the last iteration with the current one. For the comparison of the two strings (stored in the seq columns of your dataframe) you can apply a simple list comprehension like in this function:

def diff_letters(a,b):
    return sum ( a[i] != b[i] for i in range(len(a)) )

Iteration over the Dataframe rows

diff = ['NA']

row_iterator = df.iterrows()
_, last = next(row_iterator)

# Iterate over the df get populate a list with result of the comparison
for i, row in row_iterator:
    if i % 2 == 0:
        diff.append(diff_letters(last['seq'],row['seq']))
    else:
        # for odd row numbers append NA value
        diff.append("NA")
    last = row
df['diff'] = diff

Result looks like this

  name  seq diff
1   a1  bbb   NA
2   a2  bbc    1
3   b1  fff   NA
4   b2  fff    0
5   c1  aaa   NA
6   c2  acg    2

answered Oct 23 '22 07:10

Björn

Check this one

import pandas as pd

data = {'name':  ['a1', 'a2','b1','b2','c1','c2'],
    'seq': ['bbb', 'bbc','fff','fff','aaa','acg']
    }

df = pd.DataFrame (data, columns = ['name','seq'])
diffCntr=0
df['diff'] = np.nan
i=0
while i < len(df)-1:
    diffCntr=np.nan
    item=df.at[i,'seq']
    df.at[i,'diff']=diffCntr
    diffCntr=0
    for j in df.at[i+1,'seq']:
        if item.find(j) < 0:
            diffCntr +=1
    df.at[i+1,'diff']=diffCntr
    i +=2    
df

The result is this:

    name seq    diff
0   a1   bbb    NaN
1   a2   bbc    1.0
2   b1   fff    NaN
3   b2   fff    0.0
4   c1   aaa    NaN
5   c2   acg    2.0

answered Oct 23 '22 08:10

Rola

Related questions
                            
                                Drop all rows in Pandas DataFrame where value is NOT NaN
                            
                                Django Background tasks vs Celery
                            
                                SQLAlchemy filter on list attribute
                            
                                How to change the order of x-axis labels in a seaborn lineplot? [duplicate]
                            
                                How to customize keyboard shortcuts in Jupyter Lab to run current line or selected text?
                            
                                Google App Engine gunicorn worker timeout in Flask app when loading a large pickle?
                            
                                How to group and highlight group of pixels in an image using OpenCV? [closed]
                            
                                Any way to speedup itertool.product
                            
                                Pass data between different views in Django
                            
                                Docker compose executable file not found in $PATH": unknown
                            
                                In what situation is an object not equal to itself?
                            
                                How to solve TypeError: on_delete must be callable on Django models?
                            
                                python based Dockerfile throws locale.Error: unsupported locale setting
                            
                                BERT tokenizer & model download
                            
                                Is there a way for pytest to check if a log entry was made at Error level or higher?
                            
                                What is the difference between numpy.fft.fft and numpy.fft.fftfreq
                            
                                Scraping Google images with Python
                            
                                How to correctly use the Tensorflow MeanIOU metric?
                            
                                ERROR: Could not build wheels for pendulum which use PEP 517 and cannot be installed directly
                            
                                Python: How to replace tqdm progress bar by next one in nested loop?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Find the difference between strings for each two rows of pandas data.frame

Tags:

python

string

pandas

difference

LDT

People also ask