I Have a list of strings like this:
stringlist = [JAN, jan, FEB, feb, mar]
And I have a dataframe that looks like this:
**date** **value**
01MAR16 1
05FEB16 12
10jan17 5
10mar15 9
03jan05 7
04APR12 3
I only want to keep the dates which contain one string from stringlist in it, the result should look like this:
**date** **value**
NA 1
05FEB16 12
10jan17 5
10mar15 9
03jan05 7
NA 3
Im new to using regular expression so having some trouble wrapping my head around it, would appreciate some help.
You can check if a column contains/exists a particular value (string/int), list of multiple values in pandas DataFrame by using pd. series() , in operator, pandas.
Pandas' different string dtypes DataFrame , have a dtype: the type of object stored inside it. By default, Pandas will store strings using the object dtype, meaning it store strings as NumPy array of pointers to normal Python object.
Conclusion. By using df.at() , df. iat() , df. loc[] method you can insert a list of values into a pandas DataFrame cell.
The following is the syntax: Here, allowed_values is the list of values of column Col1 that you want to filter the dataframe for. Any row with its Col1 value not present in the given list is filtered out. Let’s look at an example to see the filtering in action.
Syntax: dataframe [dataframe [‘column_name’].isin (list_of_strings)] column_name is the column to check the list of strings present in that column Example: Python program to check if pandas column has a value from a list of strings Here NumPy also uses isin () operator to check if pandas column has a value from a list of strings.
Example: Python program to check if pandas column has a value from a list of strings Here NumPy also uses isin () operator to check if pandas column has a value from a list of strings. Syntax: dataframe [~numpy.isin (dataframe [‘column’], list_of_value)]
How to filter a pandas dataframe on a set of values? To filter rows of a dataframe on a set or collection of values you can use the isin () membership function. This way, you can have only the rows that you’d like to keep based on the list values.
stringlist = ["JAN", "jan", "FEB", "feb", "mar"]
m = df["date"].str.contains("|".join(stringlist))
df.loc[~m, "date"] = np.nan
print(df)
Prints:
date value
0 NaN 1
1 05FEB16 12
2 10jan17 5
3 10mar15 9
4 03jan05 7
5 NaN 3
You can use the Series.str.contains
method as demonstrated here: Select by partial string from a pandas DataFrame:
import pandas as pd
df = pd.DataFrame({'date': ['NA', '05FEB16', '10jan17', '10mar15', '03jan05', 'NA'],
'value': [1, 12, 5, 9, 7, 3]})
stringlist = ['JAN', 'jan', 'FEB', 'feb', 'mar']
print(df[df['date'].str.contains('|'.join(stringlist))])
Output:
date value
1 05FEB16 12
2 10jan17 5
3 10mar15 9
4 03jan05 7
Another play on regular expressions is to extract the characters (assumption here is that the months will alway be sandwiched between day and year), then check if each extract can be found in stringlist
:
(df.assign(months = df.date.str.extract(r'([a-zA-Z]+)'),
date = lambda df: df.where(df.months.isin(stringlist))
)
.iloc[:, :-1]
)
date value
0 NaN 1
1 05FEB16 12
2 10jan17 5
3 10mar15 9
4 03jan05 7
5 NaN 3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With