Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

extract string betwen two strings in pandas

I have a text column that looks like:

http://start.blabla.com/landing/fb603?&mkw...

I want to extract "start.blabla.com" which is always between:

http://

and:

/landing/

namely:

start.blabla.com

I do:

df.col.str.extract('http://*?\/landing')

But it doesn't work. What am I doing wrong?

like image 783
chopin_is_the_best Avatar asked Dec 23 '22 22:12

chopin_is_the_best


2 Answers

Your regex matches http:/, then 0+ / symbols as few as possible and then /landing.

You need to match and capture the characters (The extract method accepts a regular expression with at least one capture group.) after http:// other than /, 1 or more times. It can be done with

http://([^/]+)/landing
       ^^^^^^^

where [^/]+ is a negated character class that matches 1+ occurrences of characters other than /.

See the regex demo

like image 126
Wiktor Stribiżew Avatar answered Jan 12 '23 04:01

Wiktor Stribiżew


Just to answer a question you didn't ask, if you wanted to extract several portions of the string into separate columns, you'd do it this way:

df.col.str.extract('http://(?P<Site>.*?)/landing/(?P<RestUrl>.*)')

You'd get something along the lines of:

               Site        RestUrl
0  start.blabla.com  fb603?&mkw...

To understand how this regex (and any other regex for that matter) is constructed I suggest you take a look at the excellent site regex101. I constructed a snippet where you can see the above regex in action here.

like image 35
Julien Marrec Avatar answered Jan 12 '23 02:01

Julien Marrec