Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

compressing a huge set of similar strings

I am looking at a python script that maintains a huge set of strings. The strings are actually full paths to log file names from a regression test runs. The number of files is so large that the script starts hitting the virtual address space limit.

The strings like those have little entropy, so I think compressing them would work nicely. The problem is that the compression libraries I considered (like the ones discussed at Python - Compress Ascii String) seem to compress each sting individually (storing all the information required for decompression). Common sense suggests that it would be much more efficient space-wise to compress the whole set of strings into a single global blob/dictionary and refer to individual strings using some sort of short references.

Here is some (pseudo) code to clarify the idea:

class TestResult:
    def __init__(self):
        self._Id         = None
        ....
        self.__appLogList = []
        return
....
    def setAppLogList(self, currentAppLogList):
        self.__appLogList = currentAppLogList
        # what I want to do here instead is 
        # for each s in currentAppLogList
        # add s to global compressed blob and
        # store reference in __appLogList

    def getAppLogList(self):
        return self.__appLogList
        self.__appLogList = currentAppLogList
        # what I want to do here instead is 
        # currentAppLogList = []
        # for each ref in __appLogList
        # decompress ref into string s and add s to currentAppLogList
        # return currentAppLogList

# for each test
# add new TestResult to result map
# setAppLogList

# for selected tests
# getAppLogList and access logs

Is this possible to do using existing publicly available Python libraries?

like image 820
glagolig Avatar asked Sep 05 '25 17:09

glagolig


1 Answers

This is a variation or extension of my previous answer. The primary difference is that it's able to "compress" the strings potentially much more by taking advantage of the fact that they're actually file paths with possibly one or more common sub-components. This redundancy is eliminated by building a tree data structure from this information meaning that for any given level, each unique component value is only stored once. The tree data structure itself is implemented using what is essentially a dictionary of dictionaries. Its creation ought to be relatively fast due to it use of the built-in reduce() function.

The updated version below now has the property that -- unlike with previous one -- what's created can be pickled, which means it can be saved and read back intact later from a file.

Like my other answer, this isn't doing what you call "real" compression, however I think doing something like might be overkill for what you're trying to accomplish since this would solve your problem, is relatively quick, plus would be much simpler to maintain, enhance, and extend later.

import collections
import cPickle as pickle  # use C version for performance
import operator
import os

class DefaultDict(collections.defaultdict):
    """  pickle-able defaultdict """
    def __reduce__(self):
        args = (self.default_factory,) if self.default_factory else tuple()
        return type(self), args, None, None, self.iteritems()

def Tree(): return DefaultDict(Tree)
class Leaf(object): pass

def print_tree(d, level=0, indent=' '*4):
    """ Tree structure pretty printer """
    if not d:
        print indent * level + '<empty>'
    else:
        for key, value in sorted(d.iteritems()):
            print indent * level + str(key)
            if isinstance(value, dict):
                print_tree(value, level+1, indent)
            elif not isinstance(value, Leaf):
                print indent * (level+1) + str(value)

# create Tree structure from paths
tree = Tree()
LEAF = Leaf()
with open('log_file_paths.txt') as inf:
    for line in inf:
        path = line.strip().split(os.sep)
        reduce(operator.getitem, path[:-1], tree)[path[-1]] = LEAF
print_tree(tree)

# save tree to a file
with open('file_tree.pk', 'wb') as outf:
    pickle.dump(tree, outf, pickle.HIGHEST_PROTOCOL)

# try reading it back in
del tree
with open('file_tree.pk', 'rb') as inf:
    tree = pickle.load(inf)
print_tree(tree)  # display reconstituted tree

So if the input file consisted of these file paths:

tests/first/set1.log
tests/first/set2.log
tests/first/subfirst/set3.log
tests/first/subfirst/set4.log
tests/second/set5.log
tests/second/subsecond/set6.log
tests/second/subsecond/subsubsecond/set7.log
tests/third/set8.log

The following tree structure is what gets created in memory and displayed (and later written, read back, and redisplayed) to hold them:

tests
    first
        set1.log
        set2.log
        subfirst
            set3.log
            set4.log
    second
        set5.log
        subsecond
            set6.log
            subsubsecond
                set7.log
    third
        set8.log
like image 114
martineau Avatar answered Sep 07 '25 06:09

martineau