I'm trying to write a script which will extract strings from an executable binary and save them in a file. Having this file be newline-separated isn't an option since the strings could have newlines themselves. This also means, however, that using the unix "strings" utility isn't an option, since it just prints out all the strings newline-separated, meaning there's no way to tell which strings have newlines included just by looking at the output of "strings". Thus, I was hoping to find a python function or library which implements the same functionality of "strings", but which will give me those strings as variables so that I can avoid the newline issue.
Thanks!
Here's a generator that yields all the strings of printable characters >= min
(4 by default) in length that it finds in filename
:
import string
def strings(filename, min=4):
with open(filename, errors="ignore") as f: # Python 3.x
# with open(filename, "rb") as f: # Python 2.x
result = ""
for c in f.read():
if c in string.printable:
result += c
continue
if len(result) >= min:
yield result
result = ""
if len(result) >= min: # catch result at EOF
yield result
Which you can iterate over:
for s in strings("something.bin"):
# do something with s
... or store in a list:
sl = list(strings("something.bin"))
I've tested this very briefly, and it seems to give the same output as the Unix strings
command for the arbitrary binary file I chose. However, it's pretty naïve (for a start, it reads the whole file into memory at once, which might be expensive for large files), and is very unlikely to approach the performance of the Unix strings
command.
To quote man strings
:
STRINGS(1) GNU Development Tools STRINGS(1) NAME strings - print the strings of printable characters in files. [...] DESCRIPTION For each file given, GNU strings prints the printable character sequences that are at least 4 characters long (or the number given with the options below) and are followed by an unprintable character. By default, it only prints the strings from the initialized and loaded sections of object files; for other types of files, it prints the strings from the whole file.
You could achieve a similar result by using a regex
matching at least 4 printable characters. Something like that:
>>> import re
>>> content = "hello,\x02World\x88!"
>>> re.findall("[^\x00-\x1F\x7F-\xFF]{4,}", content)
['hello,', 'World']
Please note this solution require the entire file content to be loaded in memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With