This is three questions that i just cant figure out, hope someone can help me out.
import pandas as pd
data = {'Col1': ['ONE, ONE, NULL', 'ONE, TWO, THREE', 'TWO, NULL, TEN']}
index = pd.Index(['d1','d2','d3'])
data = pd.DataFrame(data,index=index)
pattern = 'ONE|TWO' <----QUESTION1
data['Col1'].str.findall(pattern) <----QUESTION2
Question1: How can i change this regex so that 'ONE' is only found once in d1? As it is now each instance of ONE found will be returned as shown below.
d1 [ONE, ONE]
d2 [ONE, TWO]
d3 [TWO]
i want this
d1 [ONE]
d2 [ONE, TWO]
d3 [TWO]
Question2:
I want to take list d1, d2 and d3 and make into one list containing only unique values. That is something like this:
set(d1 + d2 + d3) ---> ['ONE', 'TWO']
Question3:
If i would have done something like this:
data['Col2'] = data['Col1'].str.findall(pattern)
How could i iterate over every row in Col2 to get the same results as i asked for in Question2?
You can use reduce (over set.union):
In [11]: reduce(set.union, data['Col1'].str.findall(pattern), set())
Out[11]: {'ONE', 'TWO'}
Another option is to use a list comprehension:
In [12]: [w for w in ['ONE', 'TWO'] if data['Col1'].str.contains(w).any()]
Out[12]: ['ONE', 'TWO']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With