Consider the following pandas dataframe:
In [114]: df['movie_title'].head() Out[114]: 0 Toy Story (1995) 1 GoldenEye (1995) 2 Four Rooms (1995) 3 Get Shorty (1995) 4 Copycat (1995) ... Name: movie_title, dtype: object
Update: I would like to extract with a regular expression just the titles of the movies. So, let's use the following regex: \b([^\d\W]+)\b
. So I tried the following:
df_3['movie_title'] = df_3['movie_title'].str.extract('\b([^\d\W]+)\b') df_3['movie_title']
However, I get the following:
0 NaN 1 NaN 2 NaN 3 NaN 4 NaN 5 NaN 6 NaN 7 NaN 8 NaN
Any idea of how to extract specific features from text in a pandas dataframe?. More specifically, how can I extract just the titles of the movies in a completely new dataframe?. For instance, the desired output should be:
Out[114]: 0 Toy Story 1 GoldenEye 2 Four Rooms 3 Get Shorty 4 Copycat ... Name: movie_title, dtype: object
get_value() function is used to quickly retrieve the single value in the data frame at the passed column and index. The input to the function is the row label and the column label.
By using at and iat attributes We can also access a single value of a DataFrame with the help of “at” and “iat” attributes. Access a single value by row/column name. At and iat take two arguments. If we pass only one argument, then it will generate an error.
Select Cell Value from DataFrame Using df['col_name']. values[] We can use df['col_name']. values[] to get 1×1 DataFrame as a NumPy array, then access the first and only value of that array to get a cell value, for instance, df["Duration"].
You can try str.extract
and strip
, but better is use str.split
, because in names of movies can be numbers too. Next solution is replace
content of parentheses by regex
and strip
leading and trailing whitespaces:
#convert column to string df['movie_title'] = df['movie_title'].astype(str) #but it remove numbers in names of movies too df['titles'] = df['movie_title'].str.extract('([a-zA-Z ]+)', expand=False).str.strip() df['titles1'] = df['movie_title'].str.split('(', 1).str[0].str.strip() df['titles2'] = df['movie_title'].str.replace(r'\([^)]*\)', '').str.strip() print df movie_title titles titles1 titles2 0 Toy Story 2 (1995) Toy Story Toy Story 2 Toy Story 2 1 GoldenEye (1995) GoldenEye GoldenEye GoldenEye 2 Four Rooms (1995) Four Rooms Four Rooms Four Rooms 3 Get Shorty (1995) Get Shorty Get Shorty Get Shorty 4 Copycat (1995) Copycat Copycat Copycat
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With