I am looking at a python script that maintains a huge set of strings. The strings are actually full paths to log file names from a regression test runs. The number of files is so large that the script starts hitting the virtual address space limit.
The strings like those have little entropy, so I think compressing them would work nicely. The problem is that the compression libraries I considered (like the ones discussed at Python - Compress Ascii String) seem to compress each sting individually (storing all the information required for decompression). Common sense suggests that it would be much more efficient space-wise to compress the whole set of strings into a single global blob/dictionary and refer to individual strings using some sort of short references.
Here is some (pseudo) code to clarify the idea:
class TestResult:
def __init__(self):
self._Id = None
....
self.__appLogList = []
return
....
def setAppLogList(self, currentAppLogList):
self.__appLogList = currentAppLogList
# what I want to do here instead is
# for each s in currentAppLogList
# add s to global compressed blob and
# store reference in __appLogList
def getAppLogList(self):
return self.__appLogList
self.__appLogList = currentAppLogList
# what I want to do here instead is
# currentAppLogList = []
# for each ref in __appLogList
# decompress ref into string s and add s to currentAppLogList
# return currentAppLogList
# for each test
# add new TestResult to result map
# setAppLogList
# for selected tests
# getAppLogList and access logs
Is this possible to do using existing publicly available Python libraries?
This is a variation or extension of my previous answer. The primary difference is that it's able to "compress" the strings potentially much more by taking advantage of the fact that they're actually file paths with possibly one or more common sub-components. This redundancy is eliminated by building a tree data structure from this information meaning that for any given level, each unique component value is only stored once. The tree data structure itself is implemented using what is essentially a dictionary of dictionaries. Its creation ought to be relatively fast due to it use of the built-in reduce()
function.
The updated version below now has the property that -- unlike with previous one -- what's created can be pickled, which means it can be saved and read back intact later from a file.
Like my other answer, this isn't doing what you call "real" compression, however I think doing something like might be overkill for what you're trying to accomplish since this would solve your problem, is relatively quick, plus would be much simpler to maintain, enhance, and extend later.
import collections
import cPickle as pickle # use C version for performance
import operator
import os
class DefaultDict(collections.defaultdict):
""" pickle-able defaultdict """
def __reduce__(self):
args = (self.default_factory,) if self.default_factory else tuple()
return type(self), args, None, None, self.iteritems()
def Tree(): return DefaultDict(Tree)
class Leaf(object): pass
def print_tree(d, level=0, indent=' '*4):
""" Tree structure pretty printer """
if not d:
print indent * level + '<empty>'
else:
for key, value in sorted(d.iteritems()):
print indent * level + str(key)
if isinstance(value, dict):
print_tree(value, level+1, indent)
elif not isinstance(value, Leaf):
print indent * (level+1) + str(value)
# create Tree structure from paths
tree = Tree()
LEAF = Leaf()
with open('log_file_paths.txt') as inf:
for line in inf:
path = line.strip().split(os.sep)
reduce(operator.getitem, path[:-1], tree)[path[-1]] = LEAF
print_tree(tree)
# save tree to a file
with open('file_tree.pk', 'wb') as outf:
pickle.dump(tree, outf, pickle.HIGHEST_PROTOCOL)
# try reading it back in
del tree
with open('file_tree.pk', 'rb') as inf:
tree = pickle.load(inf)
print_tree(tree) # display reconstituted tree
So if the input file consisted of these file paths:
tests/first/set1.log
tests/first/set2.log
tests/first/subfirst/set3.log
tests/first/subfirst/set4.log
tests/second/set5.log
tests/second/subsecond/set6.log
tests/second/subsecond/subsubsecond/set7.log
tests/third/set8.log
The following tree structure is what gets created in memory and displayed (and later written, read back, and redisplayed) to hold them:
tests
first
set1.log
set2.log
subfirst
set3.log
set4.log
second
set5.log
subsecond
set6.log
subsubsecond
set7.log
third
set8.log
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With