I have a project where I am given a file and i need to extract the strings from the file. Basically think of the "strings" command in linux but i'm doing this in python. The next condition is that the file is given to me as a stream (e.g. string) so the obvious answer of using one of the subprocess functions to run strings isn't an option either.
I wrote this code:
def isStringChar(ch):
if ord(ch) >= ord('a') and ord(ch) <= ord('z'): return True
if ord(ch) >= ord('A') and ord(ch) <= ord('Z'): return True
if ord(ch) >= ord('0') and ord(ch) <= ord('9'): return True
if ch in ['/', '-', ':', '.', ',', '_', '$', '%', '\'', '(', ')', '[', ']', '<', '>', ' ']: return True
# default out
return False
def process(stream):
dwStreamLen = len(stream)
if dwStreamLen < 4: return None
dwIndex = 0;
strString = ''
for ch in stream:
if isStringChar(ch) == False:
if len(strString) > 4:
#print strString
strString = ''
else:
strString += ch
This technically works but is WAY slow. For instance, I was able to use the strings command on a 500Meg executable and it produced 300k worth of strings in less than 1 second. I ran the same file through the above code and it took 16 minutes.
Is there a library out there that will let me do this without the burden of python's latency?
Thanks!
The open() function opens a file in text format by default. To open a file in binary format, add 'b' to the mode parameter. Hence the "rb" mode opens the file in binary format for reading, while the "wb" mode opens the file in binary format for writing. Unlike text files, binary files are not human-readable.
Binary files can range from image files like JPEGs or GIFs, audio files like MP3s or binary document formats like Word or PDF. In Python, files are opened in text mode by default. To open files in binary mode, when specifying a mode, add 'b' to it.
Of similar speed to David Wolever's, using re
, Python's regular expression library. The short story of optimisation is that the less code you write, the faster it is. A library function that loops is often implemented in C and will be faster than you can hope to be. Same goes for the char in set()
being faster than checking yourself. Python is the opposite of C in that respect.
import sys
import re
chars = r"A-Za-z0-9/\-:.,_$%'()[\]<> "
shortest_run = 4
regexp = '[%s]{%d,}' % (chars, shortest_run)
pattern = re.compile(regexp)
def process(stream):
data = stream.read()
return pattern.findall(data)
if __name__ == "__main__":
for found_str in process(sys.stdin):
print found_str
Working in 4k chunks would be clever, but is a bit trickier on edge-cases with re
. (where two characters are on the end of the 4k block and the next 2 are at the start of the next block)
At least one of your problems is that you're reading the entire stream into memory (… = len(stream)
), and another is that your isStringChar
function is very slow (function calls are relatively slow, and you're doing a lot of them).
Better would be something like this:
import sys
import string
printable = set(string.printable)
def process(stream):
found_str = ""
while True:
data = stream.read(1024*4)
if not data:
break
for char in data:
if char in printable:
found_str += char
elif len(found_str) >= 4:
yield found_str
found_str = ""
else:
found_str = ""
if __name__ == "__main__":
for found_str in process(sys.stdin):
print found_str
This will be much faster because:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With