Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storing a data for recalling functions Python

Tags:

python

I have a project in which I run multiple data through a specific function that "cleans" them.

The cleaning function looks like this: Misc.py

def clean(my_data)
    sys.stdout.write("Cleaning genes...\n")

    synonyms = FileIO("raw_data/input_data", 3, header=False).openSynonyms()
    clean_genes = {}

    for g in data:
        if g in synonyms:
            # Found a data point which appears in the synonym list.
            #print synonyms[g]
            for synonym in synonyms[g]:
                if synonym in data:
                    del data[synonym]
                    clean_data[g] = synonym
                    sys.stdout.write("\t%s is also known as %s\n" % (g, clean_data[g]))
    return data

FileIO is a custom class I made to open files.

My question is, this function will be called many times throughout the program's life cycle. What I want to achieve is don't have to read the input_data every time since it's gonna be the same every time. I know that I can just return it, and pass it as an argument in this way:

def clean(my_data, synonyms = None) 
    if synonyms == None:
       ...
    else
       ...

But is there another, better looking way of doing this?

My file structure is the following:

lib
    Misc.py
    FileIO.py
    __init__.py
    ...
raw_data
runme.py

From runme.py, I do this from lib import * and call all the functions I made.

Is there a pythonic way to go around this? Like a 'memory' for the function

Edit: this line: synonyms = FileIO("raw_data/input_data", 3, header=False).openSynonyms() returns a collections.OrderedDict() from input_data and using the 3rd column as the key of the dictionary.

The dictionary for the following dataset:

column1    column2    key    data
  ...        ...      A      B|E|Z
  ...        ...      B      F|W
  ...        ...      C      G|P
  ...

Will look like this:

OrderedDict([('A',['B','E','Z']), ('B',['F','W']), ('C',['G','P'])])

This tells my script that A is also known as B,E,Z. B as F,W. etc...

So these are the synonyms. Since, The synonyms list will never change throughout the life of the code. I want to just read it once, and re-use it.

like image 705
Pavlos Panteliadis Avatar asked Feb 05 '23 06:02

Pavlos Panteliadis


2 Answers

Use a class with a __call__ operator. You can call objects of this class and store data between calls in the object. Some data probably can best be saved by the constructor. What you've made this way is known as a 'functor' or 'callable object'.

Example:

class Incrementer:
    def __init__ (self, increment):
        self.increment = increment

    def __call__ (self, number):
        return self.increment + number

incrementerBy1 = Incrementer (1)

incrementerBy2 = Incrementer (2)

print (incrementerBy1 (3))
print (incrementerBy2 (3))

Output:

4
5

[EDIT]

Note that you can combine the answer of @Tagc with my answer to create exactly what you're looking for: a 'function' with built-in memory.

Name your class Clean rather than DataCleaner and the name the instance clean. Name the method __call__ rather than clean.

like image 140
Jacques de Hooge Avatar answered Feb 08 '23 17:02

Jacques de Hooge


Like a 'memory' for the function

Half-way to rediscovering object-oriented programming.

Encapsulate the data cleaning logic in a class, such as DataCleaner. Make it so that instances read synonym data once when instantiated and then retain that information as part of their state. Have the class expose a clean method that operates on the data:

class FileIO(object):
    def __init__(self, file_path, some_num, header):
        pass

    def openSynonyms(self):
        return []

class DataCleaner(object):
    def __init__(self, synonym_file):
        self.synonyms = FileIO(synonym_file, 3, header=False).openSynonyms()

    def clean(self, data):
        for g in data:
            if g in self.synonyms:
                # ...
                pass

if __name__ == '__main__':
    dataCleaner = DataCleaner('raw_data/input_file')
    dataCleaner.clean('some data here')
    dataCleaner.clean('some more data here')

As a possible future optimisation, you can expand on this approach to use a factory method to create instances of DataCleaner which can cache instances based on the synonym file provided (so you don't need to do expensive recomputation every time for the same file).

like image 26
Tagc Avatar answered Feb 08 '23 15:02

Tagc