My question is how to calculate greatest distance between any two strings that correspond to a certain group. Each line in my file starts with a 'group number' followed by a long string. I want to know, for each group, what the greatest distance between any two strings in a group, for each group. Below is the kind of file I'm working with (the strings have been shortened). Notice the groups aren't necessarily in order, and some of my groups only have one string associated with them, so I would want to just skip over them (Group '3' in the below example):
0 GCAGACGGGUGAGUAACGCGUGGGAACGUACCAUUUGCUACGGAAUAACUCAGG
0 GCAGACGGGUGAGUAACGCGUGGGAACGUACCAUUUGCUACGGAAUAACUCAGG
1 CGAACGGGUGAGUAACACGUGGGCAAUCUGCCCUGCACUCUGGGACAAGCCCUG
1 CGAACGGGUGAGUAACACGUGGGCAAUCUGCCCUGCACUCUGGGACAAGCCCUG
1 CGAACGGGUGAGUAACACGUGGGCAAUCUGCCCUGCACUCUGGGACAAGCCCUG
2 GCCCUUCGGGGUACUCGAGUGGCGAACGGGUGAGUAACACGUGGGUGAUCUGCC
2 GCCCUUCGGGGUACUCGAGUGGCGAACGGGUGAGUAACACGUGGGUGAUCUGCC
2 GCCCUUCGGGGUACUCGAGUGGCGAACGGGUGAGUAACACGUGGGUGAUCUGCC
0 GCAGACGGGUGAGUAACGCGUGGGAACGUACCAUUUGCUACGGAAUAACUCAGG
0 GCAGACGGGUGAGUAACGCGUGGGAACGUACCAUUUGCUACGGAAUAACUCAGG
3 GCAGACGGGUGAGUAACAAAAAGGAACGUACCAUUUGCUACGGAAUAACUCAGG
I want to create something that will create an output that looks something like this:
Group0 = 0
Group1 = 1.2
Group2 = 2.1
Average = 1.1
This output would be giving me the group number and then the greatest difference for that group. And also the overall average of the greatest difference between all groups (again skipping over the groups with only one string associated with them):
My real file has about 5000 groups, and the strings I'm comparing are ~400 characters long.
I think I could start solving this by looking at this Question, but I'm not sure how to only calculate percent differences for strings in the same group, avoid groups with only one string, and calculate the overall average percent difference for all the groups. Any help would be greatly appreciated, thank you very much for any ideas!
EDIT: Here are a few truncated lines from the file I'm working with. The 'group' numbers range from 0 to ~ 6000. The string of letters is actually 426 characters long. The file format is [number][a whitespace][string of letters][end of line character]
7 UGGCGAACGGGUGAGUAAC
35 GUGGGGAUUAGUGGCGAAC
50 AAACGAGAUGUAGCAAUAC
82 GGAGAGAGCUUGCUCUCUU
479 UCAGGAGCUUGCUCCUGU
46 CGAGGAGCUUGCUCCUUU
24 AACUGGGUCUAAUACCUU
You could also try to use difflib's SequenceMatcher from the standard library:
>>> import difflib
>>> from itertools import groupby, combinations
>>> def find_max_ratio(lines):
lines = [row.split() for row in lines] # the file should already break at each line break
lines = [(int(row[0]), row[1]) for row in lines]
lines = groupby(sorted(lines), lambda x: x[0]) # combine strings into their respective groups, sorting them first on int of first element
group_max = dict()
for group in lines:
strings = list(group[1]) # need to convert group[1] from iterator into list
if len(strings) > 1: # if the number of strings is 1, then there is nothing to compare the string with in its group
similarity = 1
for line1, line2 in combinations(strings, 2):
s = difflib.SequenceMatcher(None, line1[1], line2[1]) # need to compare second element in each list and exclude the first element (which is the group number)
similarity = s.ratio() if s.ratio() < similarity else similarity
group_max[line1[0]] = 1 - similarity # gives difference ratio
return group_max
>>> t = open('test.txt')
>>> print find_max_ratio(t) # it appears that your examples don't have any differences
{'1': 0, '0': 0, '2': 0}
You can then calculate the average as follows:
>>> max_ratios = find_max_ratio(t)
>>> average = sum(max_ratios.values())/float(len(max_ratios))
>>> average
0.0 # there are no differences in your test data above
EDIT: Writing to a file
>>> output = sorted(max_ratios.items(), key=lambda x: x[1], reverse=True) # sorting by descending ratios
>>> with open('test2.txt', 'w') as f: # a new file name
>>> f.write('\n'.join([group + ': ' + str(ratio) for group, ratio in output])
+ '\n\nAverage: ' + str(average))
EDIT 2: Adding minimum difference
You can add the minimum difference into your result (here in the form of a tuple (<max_difference>, <min_difference>)
like this:
def find_maxmin_ratios(lines):
lines = [row.split() for row in lines] # the file should already break at each line break
lines = [(int(row[0]), row[1]) for row in lines]
lines = groupby(sorted(lines), lambda x: x[0]) # combine strings into their respective groups, sorting them first on int of first element
group_minmax = dict()
for index, group in lines:
strings = list(group) # need to convert group[1] from iterator into list
if len(strings) > 1: # if the number of strings is 1, then there is nothing to compare the string with in its group
max_similarity = 1
min_similarity = 0
for line1, line2 in combinations(strings, 2):
s = difflib.SequenceMatcher(None, line1[1], line2[1]) # need to compare second element in each list and exclude the first element (which is the group number)
max_similarity = s.ratio() if s.ratio() < max_similarity else max_similarity
min_similarity = s.ratio() if s.ratio() > min_similarity else min_similarity
group_minmax[index] = (1 - max_similarity, 1 - min_similarity) # gives max difference ratio and then min difference ratio
return group_minmax
Then you can find the respective averages like this:
>>> t = open('test.txt')
>>> maxmin_ratios = find_maxmin_ratios(t)
>>> maxmin_ratios
{'1': (0, 0.0), '0': (0, 0.0), '2': (0, 0.0)} # again, no differences in your test data
>>> average_max = sum([maxmin[0] for maxmin in maxmin_ratios.values()])/float(len(maxmin_ratios))
>>> average_min = sum([maxmin[1] for maxmin in maxmin_ratios.values()])/float(len(maxmin_ratios))
>>> average_max, average_min
(0.0, 0.0) # no differences in your test data
Edit 3: Optimization Concerns
Finally, in light of your last comment, I'm not sure if you will be able to optimize this function too much in its present form. If your computer can't handle it, you may need to process smaller chunks of text and then compile the results at the end. difflib
doesn't require huge amounts of memory, but it DOES do a LOT of work. Your performance SHOULD be a lot better than mine (depending on your machine) because every line of mine was random. If your lines are more similar than dissimilar, you should do a lot better. Here are the results of cProfile on my machine for the following scenario (3.172 hours total):
text2.txt
- 9700 lines of text
- each line begins with one random number (1 to 10)
- each line has 400 random characters that follow the random number # if your data is not random, you should do CONSIDERABLY better than this
Note that the majority of the cumtime (the total time for a given function and all functions below it) was spent in difflib, which is outside of your control with the present function. In fact, the rest of the function takes very little time at all.
4581938093 function calls in 11422.852 seconds
Ordered by: tottime # the total time spent in a given function, excluding time spent in subfunctions
ncalls tottime percall cumtime percall filename:lineno(function)
81770876 8579.568 0 9919.636 0 difflib.py:350(find_longest_match)
-724102230 1268.238 0 1268.238 0 {method 'get' of 'dict' objects}
4700900 874.878 0 1143.419 0 difflib.py:306(__chain_b)
9401960 160.366 0 10183.511 0.001 difflib.py:460(get_matching_blocks)
2060343126 141.242 0 141.242 0 {method 'append' of 'list' objects}
1889761800 110.013 0 110.013 0 {method 'setdefault' of 'dict' objects}
81770876 32.433 0 55.41 0 <string>:8(__new__)
130877001 32.061 0 32.061 0 {built-in method __new__ of type object at 0x1E228030}
81770876 29.773 0 29.773 0 {method 'pop' of 'list' objects}
1 23.259 23.259 11422.852 11422.852 <pyshell#50>:1(find_maxmin_ratios)
49106125 21.45 0 33.218 0 <string>:12(_make)
9401960 20.539 0 10239.234 0.001 difflib.py:636(ratio)
335752019 17.719 0 17.719 0 {len}
9401960 17.607 0 30.829 0 {_functools.reduce}
4700900 16.778 0 49.996 0 {map}
230344786 16.42 0 16.42 0 {method __contains__' of 'set' objects}
191093877 14.962 0 14.962 0 {method 'add' of 'set' objects}
98214517 13.222 0 13.222 0 difflib.py:658(<lambda>)
4700900 6.428 0 6.428 0 {method 'sort' of 'list' objects}
4700900 5.794 0 5.794 0 {method 'items' of 'dict' objects}
4700900 5.339 0 1148.758 0 difflib.py:261(set_seq2)
4700900 4.333 0 1160.351 0 difflib.py:154(__init__)
4700900 3.83 0 1156.018 0 difflib.py:223(set_seqs)
4700900 3.43 0 3.43 0 difflib.py:235(set_seq1)
9401960 3.162 0 3.162 0 difflib.py:41(_calculate_ratio)
9700 0.003 0 0.003 0 {method 'strip' of 'str' objects}
1 0.003 0.003 0.003 0.003 {sorted}
9700 0.001 0 0.001 0 <pyshell#50>:3(<lambda>)
1 0 0 11422.852 11422.852 <string>:1(<module>)
1 0 0 0 0 {method 'disable' of '_lsprof.Profiler' objects}
If your machine can handle it, I would just run this function and be prepared to wait two or three hours. A LOT is happening here in order to compare these strings character-by-character.
seq_file = open("sequences.txt", 'r')
# make an dict of groups, each group is a list of sequences in that group
groups = {}
for item in seq_file.readlines():
(group, sequence) = item.split()
try:
groups[group].append(sequence)
except:
groups[group] = [sequence]
# measure the distance from every seq in a group to every other seq in that group,
# keep a record of the maximum found in each group. (It doesn't matter that we
# compare a sequence to itself during this process).
max_distances = {}
for group_num, group_seqs in groups.iteritems():
greatest_distance = 0
for seq in group_seqs:
for other_seq in group_seqs:
greatest_distance = max(greatest_distance, levenshtein_distance(seq, other_seq))
max_distances[group_num] = greatest_distance
print "max for group %s is %s" % (group_num, greatest_distance)
# Average maximum distance, across the groups
max_distanace_list = max_distances.values()
av_max_dist = float(sum(max_distanace_list)/len(max_distanace_list))
... the link you provided shows how to do levenshtein_distance().
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With