I have a project in which I run multiple data through a specific function that "cleans"
them.
The cleaning function looks like this: Misc.py
def clean(my_data)
sys.stdout.write("Cleaning genes...\n")
synonyms = FileIO("raw_data/input_data", 3, header=False).openSynonyms()
clean_genes = {}
for g in data:
if g in synonyms:
# Found a data point which appears in the synonym list.
#print synonyms[g]
for synonym in synonyms[g]:
if synonym in data:
del data[synonym]
clean_data[g] = synonym
sys.stdout.write("\t%s is also known as %s\n" % (g, clean_data[g]))
return data
FileIO
is a custom class I made to open files.
My question is, this function will be called many times throughout the program's life cycle. What I want to achieve is don't have to read the input_data every time since it's gonna be the same every time. I know that I can just return it, and pass it as an argument in this way:
def clean(my_data, synonyms = None)
if synonyms == None:
...
else
...
But is there another, better looking way of doing this?
My file structure is the following:
lib
Misc.py
FileIO.py
__init__.py
...
raw_data
runme.py
From runme.py
, I do this from lib import *
and call all the functions I made.
Is there a pythonic way to go around this? Like a 'memory' for the function
Edit:
this line: synonyms = FileIO("raw_data/input_data", 3, header=False).openSynonyms()
returns a collections.OrderedDict()
from input_data
and using the 3rd column as the key of the dictionary.
The dictionary for the following dataset:
column1 column2 key data
... ... A B|E|Z
... ... B F|W
... ... C G|P
...
Will look like this:
OrderedDict([('A',['B','E','Z']), ('B',['F','W']), ('C',['G','P'])])
This tells my script that A
is also known as B,E,Z
. B
as F,W
. etc...
So these are the synonyms. Since, The synonyms list will never change throughout the life of the code. I want to just read it once, and re-use it.
Use a class with a __call__ operator. You can call objects of this class and store data between calls in the object. Some data probably can best be saved by the constructor. What you've made this way is known as a 'functor' or 'callable object'.
Example:
class Incrementer:
def __init__ (self, increment):
self.increment = increment
def __call__ (self, number):
return self.increment + number
incrementerBy1 = Incrementer (1)
incrementerBy2 = Incrementer (2)
print (incrementerBy1 (3))
print (incrementerBy2 (3))
Output:
4
5
[EDIT]
Note that you can combine the answer of @Tagc with my answer to create exactly what you're looking for: a 'function' with built-in memory.
Name your class Clean
rather than DataCleaner
and the name the instance clean
. Name the method __call__
rather than clean
.
Like a 'memory' for the function
Half-way to rediscovering object-oriented programming.
Encapsulate the data cleaning logic in a class, such as DataCleaner
. Make it so that instances read synonym data once when instantiated and then retain that information as part of their state. Have the class expose a clean
method that operates on the data:
class FileIO(object):
def __init__(self, file_path, some_num, header):
pass
def openSynonyms(self):
return []
class DataCleaner(object):
def __init__(self, synonym_file):
self.synonyms = FileIO(synonym_file, 3, header=False).openSynonyms()
def clean(self, data):
for g in data:
if g in self.synonyms:
# ...
pass
if __name__ == '__main__':
dataCleaner = DataCleaner('raw_data/input_file')
dataCleaner.clean('some data here')
dataCleaner.clean('some more data here')
As a possible future optimisation, you can expand on this approach to use a factory method to create instances of DataCleaner
which can cache instances based on the synonym file provided (so you don't need to do expensive recomputation every time for the same file).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With