For a project of mine, I'm trying to implement a small part of the BitTorrent protocol, which can be found here. Specifically, I want to use the "Bencoding" part of it, which is a way to safely encode data for transfer over a socket. The format is as follows:
8:a string => "a string"
i1234e => 1234
l1:a1:be => ['a', 'b']
d1:a1:b3:one3:twoe => {'a':'b', 'one':two}
The encoding part was easy enough, but decoding is become quite a hassle. For example, if I have a list of strings, I have no way to separate them into individual strings. I've tried several different solutions, including PyParsing and a custom token parser. I'm currently attempting to use regexes, and it seems to be going fairly well, but I'm still hung up on the string problem. My current regex is:
(?P<length>\d+):(?P<contents>.{\1})
However, i can't seem to use the first group as the length of the second group. Is there any good way to do this? Or am I approaching this all wrong, and the answer is sitting right in front of me?
To check the length of a string, a simple approach is to test against a regular expression that starts at the very beginning with a ^ and includes every character until the end by finishing with a $.
With regex you have a couple of options to match a digit. You can use a number from 0 to 9 to match a single choice. Or you can match a range of digits with a character group e.g. [4-9]. If the character group allows any digit (i.e. [0-9]), it can be replaced with a shorthand (\d).
Put brackets ( [ ] ) in the pattern string, and inside the brackets put the list of characters. Do not separate the characters with commas or any other separator. Any single character in the list makes a successful match.
Any parser you use for this is going to need to be stateful (i.e. remember stuff), and regexes are, by and large, not stateful. They're the wrong tool for this job.
If those are the only data types you have to worry about, I think I'd just write custom parsers for each data type, passing control to the appropriate parser after reading the first character.
I'd actually implement one now, but it's late.
Alright I decided to write an implementation:
from StringIO import StringIO
import string
inputs = ["10:a stringly",
"i1234e" ,
"l1:a1:be",
"d1:a1:b3:one3:twoe"]
# Constants
DICT_TYPE = 'd'
LIST_TYPE = 'l'
INT_TYPE = 'i'
TOKEN_EOF = ''
TOKEN_END = 'e'
COLON = ':'
class BadTypeIndicatorException(Exception):pass
def read_int(stream):
s = ""
while True:
ch = stream.read(1)
if ch not in [TOKEN_EOF, TOKEN_END, COLON]:
s += ch
else:
break
return s
def tokenize(stream):
s = ""
while True:
ch = stream.read(1)
if ch == TOKEN_END or ch == TOKEN_EOF:
return
if ch == COLON:
length = int(s)
yield stream.read(length)
s = ""
else:
s += ch
def parse(stream):
TYPE = stream.read(1)
if TYPE in string.digits:
length = int( TYPE + read_int(stream) )
return stream.read(length)
elif TYPE is INT_TYPE:
return int( read_int(stream) )
elif TYPE is LIST_TYPE:
return list(tokenize(stream))
elif TYPE is DICT_TYPE:
tokens = list(tokenize(stream))
return dict(zip(tokens[0::2], tokens[1::2]))
else:
raise BadTypeIndicatorException
for input in inputs:
stream = StringIO(input)
print parse(stream)
You can do it if you parse the string twice. Apply the first regex to get the length. Concatenate the length in your second regex to form a valid expression.
Not sure how that can be done in python, but a sample in C# would be:
string regex = "^[A-Za-z0-9_]{1," + length + "}$"
To match 1 to length no of chars which can be alpanumeric or _ where length is determined from a previous regex that retrieves only the length.
Hope this helps :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With