Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Iterate over values in pandas column containing lists and retrieve only unique values

This is three questions that i just cant figure out, hope someone can help me out.

import pandas as pd
data = {'Col1': ['ONE, ONE, NULL', 'ONE, TWO, THREE', 'TWO, NULL, TEN']}
index = pd.Index(['d1','d2','d3'])
data = pd.DataFrame(data,index=index)
pattern = 'ONE|TWO'                 <----QUESTION1
data['Col1'].str.findall(pattern)   <----QUESTION2

Question1: How can i change this regex so that 'ONE' is only found once in d1? As it is now each instance of ONE found will be returned as shown below.

d1    [ONE, ONE]
d2    [ONE, TWO]
d3         [TWO]

i want this

d1         [ONE]
d2    [ONE, TWO]
d3         [TWO]

Question2:
I want to take list d1, d2 and d3 and make into one list containing only unique values. That is something like this:

set(d1 + d2 + d3) ---> ['ONE', 'TWO']


Question3:
If i would have done something like this:

data['Col2'] = data['Col1'].str.findall(pattern)

How could i iterate over every row in Col2 to get the same results as i asked for in Question2?

like image 231
user3139545 Avatar asked Jan 21 '14 18:01

user3139545


1 Answers

You can use reduce (over set.union):

In [11]: reduce(set.union, data['Col1'].str.findall(pattern), set())
Out[11]: {'ONE', 'TWO'}

Another option is to use a list comprehension:

In [12]: [w for w in ['ONE', 'TWO'] if data['Col1'].str.contains(w).any()]
Out[12]: ['ONE', 'TWO']
like image 184
Andy Hayden Avatar answered Nov 09 '22 09:11

Andy Hayden