I have a project where I am given a file and i need to extract the strings from the file. Basically think of the "strings" command in linux but i'm doing this in python. The next condition is that the file is given to me as a stream (e.g. string) so the obvious answer of using one of the subprocess functions to run strings isn't an option either. I wrote this code: <pre class="prettyprint"><code>def isStringChar(ch): if ord(ch) >= ord('a') and ord(ch) <= ord('z'): return True if ord(ch) >= ord('A') and ord(ch) <= ord('Z'): return True if ord(ch) >= ord('0') and ord(ch) <= ord('9'): return True if ch in ['/', '-', ':', '.', ',', '_', '$', '%', '\'', '(', ')', '[', ']', '<', '>', ' ']: return True # default out return False def process(stream): dwStreamLen = len(stream) if dwStreamLen < 4: return None dwIndex = 0; strString = '' for ch in stream: if isStringChar(ch) == False: if len(strString) > 4: #print strString strString = '' else: strString += ch </code></pre> This technically works but is WAY slow. For instance, I was able to use the strings command on a 500Meg executable and it produced 300k worth of strings in less than 1 second. I ran the same file through the above code and it took 16 minutes. Is there a library out there that will let me do this without the burden of python's latency? Thanks!

Of similar speed to David Wolever's, using <code>re</code>, Python's regular expression library. The short story of optimisation is that the less code you write, the faster it is. A library function that loops is often implemented in C and will be faster than you can hope to be. Same goes for the <code>char in set()</code> being faster than checking yourself. Python is the opposite of C in that respect. <pre class="prettyprint"><code>import sys import re chars = r"A-Za-z0-9/\-:.,_$%'()[\]<> " shortest_run = 4 regexp = '[%s]{%d,}' % (chars, shortest_run) pattern = re.compile(regexp) def process(stream): data = stream.read() return pattern.findall(data) if __name__ == "__main__": for found_str in process(sys.stdin): print found_str </code></pre> Working in 4k chunks would be clever, but is a bit trickier on edge-cases with <code>re</code>. (where two characters are on the end of the 4k block and the next 2 are at the start of the next block)

extract strings from a binary file in python

Tags:

python

string-search

I have a project where I am given a file and i need to extract the strings from the file. Basically think of the "strings" command in linux but i'm doing this in python. The next condition is that the file is given to me as a stream (e.g. string) so the obvious answer of using one of the subprocess functions to run strings isn't an option either.

I wrote this code:

def isStringChar(ch):
    if ord(ch) >= ord('a') and ord(ch) <= ord('z'): return True
    if ord(ch) >= ord('A') and ord(ch) <= ord('Z'): return True
    if ord(ch) >= ord('0') and ord(ch) <= ord('9'): return True

    if ch in ['/', '-', ':', '.', ',', '_', '$', '%', '\'', '(', ')', '[', ']', '<', '>', ' ']: return True

# default out
return False

def process(stream):
dwStreamLen = len(stream)
if dwStreamLen < 4: return None

dwIndex = 0;
strString = ''
for ch in stream:
    if isStringChar(ch) == False:
        if len(strString) > 4:
            #print strString
            strString = ''
    else:
        strString += ch

This technically works but is WAY slow. For instance, I was able to use the strings command on a 500Meg executable and it produced 300k worth of strings in less than 1 second. I ran the same file through the above code and it took 16 minutes.

Is there a library out there that will let me do this without the burden of python's latency?

Thanks!

251

asked Jul 24 '11 02:07

tjac

2 Answers

Of similar speed to David Wolever's, using re, Python's regular expression library. The short story of optimisation is that the less code you write, the faster it is. A library function that loops is often implemented in C and will be faster than you can hope to be. Same goes for the char in set() being faster than checking yourself. Python is the opposite of C in that respect.

import sys
import re

chars = r"A-Za-z0-9/\-:.,_$%'()[\]<> "
shortest_run = 4

regexp = '[%s]{%d,}' % (chars, shortest_run)
pattern = re.compile(regexp)

def process(stream):
    data = stream.read()
    return pattern.findall(data)

if __name__ == "__main__":
    for found_str in process(sys.stdin):
        print found_str

Working in 4k chunks would be clever, but is a bit trickier on edge-cases with re. (where two characters are on the end of the 4k block and the next 2 are at the start of the next block)

answered Oct 18 '22 22:10

dougallj

At least one of your problems is that you're reading the entire stream into memory (… = len(stream)), and another is that your isStringChar function is very slow (function calls are relatively slow, and you're doing a lot of them).

Better would be something like this:

import sys
import string

printable = set(string.printable)

def process(stream):
    found_str = ""
    while True:
        data = stream.read(1024*4)
        if not data:
            break
        for char in data:
            if char in printable:
                found_str += char
            elif len(found_str) >= 4:
                yield found_str
                found_str = ""
            else:
                found_str = ""

 if __name__ == "__main__":
     for found_str in process(sys.stdin):
        print found_str

This will be much faster because:

The "is character printable" lookup is performed with one set lookup (and O(1) operation) which calls directly (if I'm not mistaken) into a C function (which will be very fast).
The stream is processed in 4k chunks, which will improve memory use and runtime on large inputs, as no swapping will be required.

answered Oct 18 '22 21:10

David Wolever

Related questions
                            
                                What is faster for loop using enumerate or for loop using xrange in Python?
                            
                                Automatic image scaling on resize with (Py)GTK
                            
                                Iterate through parts of a string
                            
                                Best approach to use jira programmatically
                            
                                Python + nose: make assertions about logged text?
                            
                                How can I let android emulator talk to the localhost?
                            
                                Store a lot of data inside python
                            
                                Algorithm to determine the winner of a Texas Hold'em Hand
                            
                                How to draw probabilistic distributions with numpy/matplotlib?
                            
                                CSV Module's writer won't let me write binary out
                            
                                Getting command line arguments as tuples in python
                            
                                How do I url encode in Python?
                            
                                Determining redirected URL in Python
                            
                                Problems with Python MD5, SHA512 (+salt) encryption
                            
                                Converting string to ascii using ord()
                            
                                get array values in python
                            
                                What are the advantages of "yield item" vs return iter(items)?
                            
                                function print in python shell
                            
                                Is the better way to match two different repetitions of the same character class in a regex?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With