Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to understand snippet of Regex

I am attempting to understand what this snippet of code does:

passwd1=re.sub(r'^.*? --', ' -- ', line)
password=passwd1[4:]

I understand that the top line uses regex to remove the " -- ", and the bottom line I think removes something as well? I went back to this code after a while and need to improve it but to do that I need to understand this again. I've been trying to read regex docs to no avail, what is this: r'^.*? at the beginning of the regex?.

like image 731
CobraCabbe Avatar asked Oct 21 '25 06:10

CobraCabbe


2 Answers

To break r'^.*? -- into pieces:

  • r in front of a string in Python lets the interpreter know that it's a regex string. This lets you not have to do a bunch of confusing character escaping.
  • The ^ tells the regex to match only from the beginning of the string.
  • .*? tells the regex to match any number of characters up to...
  • --, which is a literal match.

The sum of this is that it will match any string, starting at the beginning of a line up to the -- demarcation. Since it is re.sub(), the matched part of the string will be replaced with --.

This is why something like Google -- MyPassword becomes -- MyPassword.

The second line is a simple string slice, dropping the first four elements (characters) of the string. This might be superfluous - you could just substitute the match with an empty string like this:

passwd1 = re.sub(r'^.* --', '', line)

This achieves the same result. Note I've dropped the ?, which is also superfluous here, because the * has a similar but broader effect. There are some technical differences, but I don't think you need it for your stated purpose.

? will match zero or one of the previous character - in this case a ., which is 'any character'. The * will match zero or more of the previous character. .* is what is known as a greedy quantifier, and .*? a lazy quantifier. That is, the greedy quantifier will match as much as possible, and the lazy will match as little as possible. The difference between ^.*? -- and ^.* -- is what is matched in this case:

Something something -- mypassword -- yourpassword

In the greedy case, the first two clauses ('something something -- mypassword') are matched and deleted. In the lazy case, only 'something something' is deleted. Most passwords don't include spaces, nevermind ' -- ', so you probably want to use the greedy version.

like image 94
Nathaniel Ford Avatar answered Oct 23 '25 19:10

Nathaniel Ford


You can use a site like regex101 to input your regular expression and get some analysis of it. It will tell you whether your regular expression matches some test cases, and also explain what each character in the regular expression means. In this case it matches everything up to and including the first instance of ' -- ' in your string, and replaces it with just the characters ' -- '.

The second line is slicing the string. It takes a substring, skipping over the first four characters and then continuing to the end of the string.

Effectively, given a string which has ' -- ' somewhere in it, this pair of lines will take everything after that substring. However, if that substring is not found in line then instead you will simply be discarding the first four characters. If line has less than four characters you will get an error.

like image 30
Dakeyras Avatar answered Oct 23 '25 21:10

Dakeyras



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!