Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract specific content in a pandas dataframe with a regex?

Tags:

Consider the following pandas dataframe:

In [114]:  df['movie_title'].head()  ​ Out[114]:  0     Toy Story (1995) 1     GoldenEye (1995) 2    Four Rooms (1995) 3    Get Shorty (1995) 4       Copycat (1995) ... Name: movie_title, dtype: object 

Update: I would like to extract with a regular expression just the titles of the movies. So, let's use the following regex: \b([^\d\W]+)\b. So I tried the following:

df_3['movie_title'] = df_3['movie_title'].str.extract('\b([^\d\W]+)\b') df_3['movie_title'] 

However, I get the following:

0       NaN 1       NaN 2       NaN 3       NaN 4       NaN 5       NaN 6       NaN 7       NaN 8       NaN 

Any idea of how to extract specific features from text in a pandas dataframe?. More specifically, how can I extract just the titles of the movies in a completely new dataframe?. For instance, the desired output should be:

Out[114]:  0     Toy Story 1     GoldenEye 2    Four Rooms 3    Get Shorty 4       Copycat ... Name: movie_title, dtype: object 
like image 216
tumbleweed Avatar asked Mar 16 '16 07:03

tumbleweed


People also ask

How do I pull a specific value from a pandas DataFrame?

get_value() function is used to quickly retrieve the single value in the data frame at the passed column and index. The input to the function is the row label and the column label.

How do you access a specific element in a DataFrame?

By using at and iat attributes We can also access a single value of a DataFrame with the help of “at” and “iat” attributes. Access a single value by row/column name. At and iat take two arguments. If we pass only one argument, then it will generate an error.

How do you extract a value from a DataFrame?

Select Cell Value from DataFrame Using df['col_name']. values[] We can use df['col_name']. values[] to get 1×1 DataFrame as a NumPy array, then access the first and only value of that array to get a cell value, for instance, df["Duration"].


1 Answers

You can try str.extract and strip, but better is use str.split, because in names of movies can be numbers too. Next solution is replace content of parentheses by regex and strip leading and trailing whitespaces:

#convert column to string df['movie_title'] = df['movie_title'].astype(str)  #but it remove numbers in names of movies too df['titles'] = df['movie_title'].str.extract('([a-zA-Z ]+)', expand=False).str.strip() df['titles1'] = df['movie_title'].str.split('(', 1).str[0].str.strip() df['titles2'] = df['movie_title'].str.replace(r'\([^)]*\)', '').str.strip() print df           movie_title      titles      titles1      titles2 0  Toy Story 2 (1995)   Toy Story  Toy Story 2  Toy Story 2 1    GoldenEye (1995)   GoldenEye    GoldenEye    GoldenEye 2   Four Rooms (1995)  Four Rooms   Four Rooms   Four Rooms 3   Get Shorty (1995)  Get Shorty   Get Shorty   Get Shorty 4      Copycat (1995)     Copycat      Copycat      Copycat 
like image 144
jezrael Avatar answered Sep 27 '22 18:09

jezrael