I have a text column that looks like:
http://start.blabla.com/landing/fb603?&mkw...
I want to extract "start.blabla.com" which is always between:
http://
and:
/landing/
namely:
start.blabla.com
I do:
df.col.str.extract('http://*?\/landing')
But it doesn't work. What am I doing wrong?
Your regex matches http:/
, then 0+ /
symbols as few as possible and then /landing
.
You need to match and capture the characters (The extract
method accepts a regular expression with at least one capture group.) after http://
other than /
, 1 or more times. It can be done with
http://([^/]+)/landing
^^^^^^^
where [^/]+
is a negated character class that matches 1+ occurrences of characters other than /
.
See the regex demo
Just to answer a question you didn't ask, if you wanted to extract several portions of the string into separate columns, you'd do it this way:
df.col.str.extract('http://(?P<Site>.*?)/landing/(?P<RestUrl>.*)')
You'd get something along the lines of:
Site RestUrl
0 start.blabla.com fb603?&mkw...
To understand how this regex (and any other regex for that matter) is constructed I suggest you take a look at the excellent site regex101. I constructed a snippet where you can see the above regex in action here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With