How to match a string of a certain length with a regex

Tags:

For a project of mine, I'm trying to implement a small part of the BitTorrent protocol, which can be found here. Specifically, I want to use the "Bencoding" part of it, which is a way to safely encode data for transfer over a socket. The format is as follows:

8:a string => "a string"
i1234e => 1234
l1:a1:be => ['a', 'b']
d1:a1:b3:one3:twoe => {'a':'b', 'one':two}

The encoding part was easy enough, but decoding is become quite a hassle. For example, if I have a list of strings, I have no way to separate them into individual strings. I've tried several different solutions, including PyParsing and a custom token parser. I'm currently attempting to use regexes, and it seems to be going fairly well, but I'm still hung up on the string problem. My current regex is:

(?P<length>\d+):(?P<contents>.{\1})

However, i can't seem to use the first group as the length of the second group. Is there any good way to do this? Or am I approaching this all wrong, and the answer is sitting right in front of me?

808

asked Jun 02 '09 05:06

Cassie Meharry

2 Answers

Any parser you use for this is going to need to be stateful (i.e. remember stuff), and regexes are, by and large, not stateful. They're the wrong tool for this job.

If those are the only data types you have to worry about, I think I'd just write custom parsers for each data type, passing control to the appropriate parser after reading the first character.

~~I'd actually implement one now, but it's late.~~

Alright I decided to write an implementation:

from StringIO import StringIO
import string

inputs = ["10:a stringly",
         "i1234e" ,
         "l1:a1:be",
         "d1:a1:b3:one3:twoe"]

# Constants
DICT_TYPE = 'd'
LIST_TYPE = 'l'
INT_TYPE  = 'i'
TOKEN_EOF = ''
TOKEN_END = 'e'
COLON     = ':'


class BadTypeIndicatorException(Exception):pass


def read_int(stream):

   s = ""

   while True:
      ch = stream.read(1)
      if ch not in [TOKEN_EOF, TOKEN_END, COLON]:
         s += ch
      else:
         break

   return s


def tokenize(stream):

   s = ""

   while True:

      ch = stream.read(1)

      if ch == TOKEN_END or ch == TOKEN_EOF:
         return 

      if ch == COLON:
         length = int(s)
         yield stream.read(length)
         s = ""

      else:
         s += ch


def parse(stream):

   TYPE = stream.read(1)

   if TYPE in string.digits:
      length = int( TYPE + read_int(stream) )
      return stream.read(length)

   elif TYPE is INT_TYPE: 
      return int( read_int(stream) )

   elif TYPE is LIST_TYPE: 
      return list(tokenize(stream))

   elif TYPE is DICT_TYPE:
      tokens = list(tokenize(stream))
      return dict(zip(tokens[0::2], tokens[1::2]))

   else: 
      raise BadTypeIndicatorException



for input in inputs:
   stream = StringIO(input)
   print parse(stream)

answered Oct 21 '22 07:10

Kenan Banks

You can do it if you parse the string twice. Apply the first regex to get the length. Concatenate the length in your second regex to form a valid expression.

Not sure how that can be done in python, but a sample in C# would be:

string regex = "^[A-Za-z0-9_]{1," + length + "}$"

To match 1 to length no of chars which can be alpanumeric or _ where length is determined from a previous regex that retrieves only the length.

Hope this helps :)

answered Oct 21 '22 07:10

SO User

Related questions
                            
                                Django Structlog is not printing or writing log message to console or file
                            
                                How to make pytest cases "runnable" in IntelliJ
                            
                                Plotly Log Scale in Subplot Python
                            
                                Best way to overwrite Azure Blob in Python
                            
                                Using IPython.display.audio to play audio in jupyter notebook not working when used inside a function
                            
                                matplotlib 'bar' vs 'barh' plots. Problem with barh
                            
                                SessionNotCreatedException: This version of ChromeDriver only supports Chrome version 84 using ChromeDriver and Chrome through Selenium and Python
                            
                                Outline text on image in Python
                            
                                Converting a simple Python requests POST to Rust reqwest
                            
                                Xcode update broke gcloud and gsutil
                            
                                Python, Error while installing matplotlib
                            
                                copy-webpack-plugin configuration issue
                            
                                ChatterBot error- OSError: [E941] Can't find model 'en'
                            
                                Removing *NEARLY* Duplicate Observations - Python
                            
                                I am able to read a file, but not able to see the result on Google Colab
                            
                                Why is this Python code faster than its equivalent Clojure code
                            
                                What is the meaning of '(?i)password' in python regular expression?
                            
                                Ignoring XML errors in Python
                            
                                Running unit tests with Nose inside a Python environment such as Autodesk Maya?
                            
                                How to send a session message to an anonymous user in a Django site?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to match a string of a certain length with a regex

Tags:

python

regex

Cassie Meharry

People also ask

2 Answers

Kenan Banks

SO User

Recent Activity

Donate For Us