Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to compare clusters?

Hopefully this can be done with python! I used two clustering programs on the same data and now have a cluster file from both. I reformatted the files so that they look like this:

Cluster 0:
Brucellaceae(10)
    Brucella(10)
        abortus(1)
        canis(1)
        ceti(1)
        inopinata(1)
        melitensis(1)
        microti(1)
        neotomae(1)
        ovis(1)
        pinnipedialis(1)
        suis(1)
Cluster 1:
    Streptomycetaceae(28)
        Streptomyces(28)
            achromogenes(1)
            albaduncus(1)
            anthocyanicus(1)

etc.

These files contain bacterial species info. So I have the cluster number (Cluster 0), then right below it 'family' (Brucellaceae) and the number of bacteria in that family (10). Under that is the genera found in that family (name followed by number, Brucella(10)) and finally the species in each genera (abortus(1), etc.).

My question: I have 2 files formatted in this way and want to write a program that will look for differences between the two. The only problem is that the two programs cluster in different ways, so two cluster may be the same, even if the actual "Cluster Number" is different (so the contents of Cluster 1 in one file might match Cluster 43 in the other file, the only different being the actual cluster number). So I need something to ignore the cluster number and focus on the cluster contents.

Is there any way I could compare these 2 files to examine the differences? Is it even possible? Any ideas would be greatly appreciated!

like image 871
Jen Avatar asked Jul 25 '13 19:07

Jen


2 Answers

Given:

file1 = '''Cluster 0:
 giant(2)
  red(2)
   brick(1)
   apple(1)
Cluster 1:
 tiny(3)
  green(1)
   dot(1)
  blue(2)
   flower(1)
   candy(1)'''.split('\n')
file2 = '''Cluster 18:
 giant(2)
  red(2)
   brick(1)
   tomato(1)
Cluster 19:
 tiny(2)
  blue(2)
   flower(1)
   candy(1)'''.split('\n')

Is this what you need?

def parse_file(open_file):
    result = []

    for line in open_file:
        indent_level = len(line) - len(line.lstrip())
        if indent_level == 0:
            levels = ['','','']
        item = line.lstrip().split('(', 1)[0]
        levels[indent_level - 1] = item
        if indent_level == 3:
            result.append('.'.join(levels))
    return result

data1 = set(parse_file(file1))
data2 = set(parse_file(file2))

differences = [
    ('common elements', data1 & data2),
    ('missing from file2', data1 - data2),
    ('missing from file1', data2 - data1) ]

To see the differences:

for desc, items in differences:
    print desc
    print 
    for item in items:
        print '\t' + item
    print

prints

common elements

    giant.red.brick
    tiny.blue.candy
    tiny.blue.flower

missing from file2

    tiny.green.dot
    giant.red.apple

missing from file1

    giant.red.tomato
like image 90
Steven Rumbalski Avatar answered Sep 30 '22 18:09

Steven Rumbalski


So just for help, as I see lots of different answers in the comment, I'll give you a very, very simple implementation of a script that you can start from.

Note that this does not answer your full question but points you in one of the directions in the comments.

Normally if you have no experience I'd argue to go a head and read up on Python (which i'll do anyways, and i'll throw in a few links in the bottom of the answer)

On to the fun stuffs! :)

class Cluster(object):
  '''
  This is a class that will contain your information about the Clusters.
  '''
  def __init__(self, number):
    '''
    This is what some languages call a constructor, but it's not.
    This method initializes the properties with values from the method call.
    '''
    self.cluster_number = number
    self.family_name = None
    self.bacteria_name = None
    self.bacteria = []

#This part below isn't a part of the class, this is the actual script.
with open('bacteria.txt', 'r') as file:
  cluster = None
  clusters = []
  for index, line in enumerate(file):
    if line.startswith('Cluster'):
      cluster = Cluster(index)
      clusters.append(cluster)
    else:
      if not cluster.family_name:
        cluster.family_name = line
      elif not cluster.bacteria_name:
        cluster.bacteria_name = line
      else:
        cluster.bacteria.append(line)

I wrote this as dumb and overly simple as I could without any fancy stuff and for Python 2.7.2 You could copy this file into a .py file and run it directly from command line python bacteria.py for example.

Hope this helps a bit and don't hesitate to come by our Python chat room if you have any questions! :)

  • http://learnpythonthehardway.org/
  • http://www.diveintopython.net/
  • http://docs.python.org/2/tutorial/inputoutput.html
  • check if all elements in a list are identical
  • Retaining order while using Python's set difference
like image 44
Henrik Andersson Avatar answered Sep 30 '22 16:09

Henrik Andersson