extract string betwen two strings in pandas

Question

I have a text column that looks like:

http://start.blabla.com/landing/fb603?&mkw...

I want to extract "start.blabla.com" which is always between:

http://

and:

/landing/

namely:

start.blabla.com

I do:

df.col.str.extract('http://*?\/landing')

But it doesn't work. What am I doing wrong?

Wiktor Stribiżew · Accepted Answer

Your regex matches http:/, then 0+ / symbols as few as possible and then /landing.

You need to match and capture the characters (The extract method accepts a regular expression with at least one capture group.) after http:// other than /, 1 or more times. It can be done with

http://([^/]+)/landing
       ^^^^^^^

where [^/]+ is a negated character class that matches 1+ occurrences of characters other than /.

See the regex demo

Julien Marrec · Answer

Just to answer a question you didn't ask, if you wanted to extract several portions of the string into separate columns, you'd do it this way:

df.col.str.extract('http://(?P<Site>.*?)/landing/(?P<RestUrl>.*)')

You'd get something along the lines of:

               Site        RestUrl
0  start.blabla.com  fb603?&mkw...

To understand how this regex (and any other regex for that matter) is constructed I suggest you take a look at the excellent site regex101. I constructed a snippet where you can see the above regex in action here.

extract string betwen two strings in pandas

Tags:

python

regex

pandas

chopin_is_the_best

2 Answers

Wiktor Stribiżew

Julien Marrec

Recent Activity

Donate For Us

extract string betwen two strings in pandas

Tags:

python

regex

pandas

chopin_is_the_best

2 Answers

Wiktor Stribiżew

Julien Marrec

Related questions

Recent Activity

Donate For Us