I am new in python, and I am struggling with this for some time. I have a file that looks like this:
name seq
1 a1 bbb
2 a2 bbc
3 b1 fff
4 b2 fff
5 c1 aaa
6 c2 acg
where name is the name of the string and seq is the string. I would like a new column or a new data frame that indicates the number of differences between every two rows without overlap. For example, I want the number of differences between sequences for the name [a1-a2] then [b1-b2] and lastly between [c1-c2].
So I need something like this:
name seq diff
1 a1 bbb NA
2 a2 bbc 1
3 b1 fff NA
4 b2 fff 0
5 c1 aaa NA
6 c2 acg 2
Any help is highly appreciated
Difference between rows or columns of a pandas DataFrame object is found using the diff() method. The axis parameter decides whether difference to be calculated is between rows or between columns. When the periods parameter assumes positive values, difference is found by subtracting the previous row from the next row.
The equals() function is used to test whether two Pandas objects contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.
The diff() method returns a DataFrame with the difference between the values for each row and, by default, the previous row. Which row to compare with can be specified with the periods parameter.
It looks like you want the jaccard distance of the pairs of strings. Here's one way using groupby
and scipy.spatial.distance.jaccard
:
from scipy.spatial.distance import jaccard
g = df.groupby(df.name.str[0])
df['diff'] = [sim for _, seqs in g.seq for sim in
[float('nan'), jaccard(*map(list,seqs))]]
print(df)
name seq diff
1 a1 bbb NaN
2 a2 bbc 1.0
3 b1 fff NaN
4 b2 fff 0.0
5 c1 aaa NaN
6 c2 acg 2.0
Alternative with Levenshtein
distance:
import Levenshtein
s = df['name'].str[0]
out = df.assign(Diff=s.drop_duplicates(keep='last').map(df.groupby(s)['seq']
.apply(lambda x: Levenshtein.distance(x.iloc[0],x.iloc[-1]))))
name seq Diff
1 a1 bbb NaN
2 a2 bbc 1.0
3 b1 fff NaN
4 b2 fff 0.0
5 c1 aaa NaN
6 c2 acg 2.0
As a first step I recreated your data with:
#!/usr/bin/env python3
import pandas as pd
# Setup
data = {'name': {1: 'a1', 2: 'a2', 3: 'b1', 4: 'b2', 5: 'c1', 6: 'c2'}, 'seq': {1: 'bbb', 2: 'bbc', 3: 'fff', 4: 'fff', 5: 'aaa', 6: 'acg'}}
df = pd.DataFrame(data)
Solution
You could try to iterate over the dataframe and compare the seq
value of the last iteration with the current one. For the comparison of the two strings (stored in the seq
columns of your dataframe) you can apply a simple list comprehension like in this function:
def diff_letters(a,b):
return sum ( a[i] != b[i] for i in range(len(a)) )
Iteration over the Dataframe rows
diff = ['NA']
row_iterator = df.iterrows()
_, last = next(row_iterator)
# Iterate over the df get populate a list with result of the comparison
for i, row in row_iterator:
if i % 2 == 0:
diff.append(diff_letters(last['seq'],row['seq']))
else:
# for odd row numbers append NA value
diff.append("NA")
last = row
df['diff'] = diff
Result looks like this
name seq diff
1 a1 bbb NA
2 a2 bbc 1
3 b1 fff NA
4 b2 fff 0
5 c1 aaa NA
6 c2 acg 2
Check this one
import pandas as pd
data = {'name': ['a1', 'a2','b1','b2','c1','c2'],
'seq': ['bbb', 'bbc','fff','fff','aaa','acg']
}
df = pd.DataFrame (data, columns = ['name','seq'])
diffCntr=0
df['diff'] = np.nan
i=0
while i < len(df)-1:
diffCntr=np.nan
item=df.at[i,'seq']
df.at[i,'diff']=diffCntr
diffCntr=0
for j in df.at[i+1,'seq']:
if item.find(j) < 0:
diffCntr +=1
df.at[i+1,'diff']=diffCntr
i +=2
df
The result is this:
name seq diff
0 a1 bbb NaN
1 a2 bbc 1.0
2 b1 fff NaN
3 b2 fff 0.0
4 c1 aaa NaN
5 c2 acg 2.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With