Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to match a string of a certain length with a regex

Tags:

python

regex

For a project of mine, I'm trying to implement a small part of the BitTorrent protocol, which can be found here. Specifically, I want to use the "Bencoding" part of it, which is a way to safely encode data for transfer over a socket. The format is as follows:

8:a string => "a string"
i1234e => 1234
l1:a1:be => ['a', 'b']
d1:a1:b3:one3:twoe => {'a':'b', 'one':two}

The encoding part was easy enough, but decoding is become quite a hassle. For example, if I have a list of strings, I have no way to separate them into individual strings. I've tried several different solutions, including PyParsing and a custom token parser. I'm currently attempting to use regexes, and it seems to be going fairly well, but I'm still hung up on the string problem. My current regex is:

(?P<length>\d+):(?P<contents>.{\1})

However, i can't seem to use the first group as the length of the second group. Is there any good way to do this? Or am I approaching this all wrong, and the answer is sitting right in front of me?

like image 808
Cassie Meharry Avatar asked Jun 02 '09 05:06

Cassie Meharry


People also ask

How do you define the length of a string in regex?

To check the length of a string, a simple approach is to test against a regular expression that starts at the very beginning with a ^ and includes every character until the end by finishing with a $.

How do I match a range in regex?

With regex you have a couple of options to match a digit. You can use a number from 0 to 9 to match a single choice. Or you can match a range of digits with a character group e.g. [4-9]. If the character group allows any digit (i.e. [0-9]), it can be replaced with a shorthand (\d).

How do you match a string to a pattern?

Put brackets ( [ ] ) in the pattern string, and inside the brackets put the list of characters. Do not separate the characters with commas or any other separator. Any single character in the list makes a successful match.


2 Answers

Any parser you use for this is going to need to be stateful (i.e. remember stuff), and regexes are, by and large, not stateful. They're the wrong tool for this job.

If those are the only data types you have to worry about, I think I'd just write custom parsers for each data type, passing control to the appropriate parser after reading the first character.

I'd actually implement one now, but it's late.

Alright I decided to write an implementation:

from StringIO import StringIO
import string

inputs = ["10:a stringly",
         "i1234e" ,
         "l1:a1:be",
         "d1:a1:b3:one3:twoe"]

# Constants
DICT_TYPE = 'd'
LIST_TYPE = 'l'
INT_TYPE  = 'i'
TOKEN_EOF = ''
TOKEN_END = 'e'
COLON     = ':'


class BadTypeIndicatorException(Exception):pass


def read_int(stream):

   s = ""

   while True:
      ch = stream.read(1)
      if ch not in [TOKEN_EOF, TOKEN_END, COLON]:
         s += ch
      else:
         break

   return s


def tokenize(stream):

   s = ""

   while True:

      ch = stream.read(1)

      if ch == TOKEN_END or ch == TOKEN_EOF:
         return 

      if ch == COLON:
         length = int(s)
         yield stream.read(length)
         s = ""

      else:
         s += ch


def parse(stream):

   TYPE = stream.read(1)

   if TYPE in string.digits:
      length = int( TYPE + read_int(stream) )
      return stream.read(length)

   elif TYPE is INT_TYPE: 
      return int( read_int(stream) )

   elif TYPE is LIST_TYPE: 
      return list(tokenize(stream))

   elif TYPE is DICT_TYPE:
      tokens = list(tokenize(stream))
      return dict(zip(tokens[0::2], tokens[1::2]))

   else: 
      raise BadTypeIndicatorException



for input in inputs:
   stream = StringIO(input)
   print parse(stream)
like image 77
Kenan Banks Avatar answered Oct 21 '22 07:10

Kenan Banks


You can do it if you parse the string twice. Apply the first regex to get the length. Concatenate the length in your second regex to form a valid expression.

Not sure how that can be done in python, but a sample in C# would be:

string regex = "^[A-Za-z0-9_]{1," + length + "}$"

To match 1 to length no of chars which can be alpanumeric or _ where length is determined from a previous regex that retrieves only the length.

Hope this helps :)

like image 26
SO User Avatar answered Oct 21 '22 07:10

SO User