Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas DataFrame - Combining one column's values with same index into list

Tags:

python

pandas

I've been at this issue for awhile to no avail. This is almost a duplicate of at least one other question on here, but I can't quite figure out how to do exactly what I'm looking for from related answers online.

I have a Pandas DataFrame (we'll call it df) that looks something like:

Name    Value        Value2
'A'     '8.8.8.8'    'x'
'B'     '6.6.6.6'    'y'
'A'     '6.6.6.6'    'x'
'A'     '8.8.8.8'    'x'

Where Name is the index. I want to convert this to something like that looks like:

Name    Value                     Value2
'A'     ['8.8.8.8', '6.6.6.6']    'x'
'B'     ['6.6.6.6']               'y'

So, basically, every Value that corresponds to the same index should be combined into a list (or a set, or a tuple) and that list made to be the Value for the corresponding index. And, as shown, Value2 is the same between like-indexed rows, so it should just stay the same in the end.

All I've done (successfully) is figure out how to make each element in the Value column into a list with:

df['Value'] = pd.Series([[val] for val in df['Value']])

In the question I linked at the start of this post, the recommended way to combine columns with duplicate indices offers a solution using df.groupby(df.index).sum(). I know I need something besides df.index as an argument to groupby since the Value column is treated as special, and I'm not sure what to put in place of sum() since that's not quite what I'm looking for.

Hopefully it's clear what I'm looking for, let me know if there's anything I can elaborate on. I've also tried simply looping through the DataFrame myself, finding rows with the same index, combining the Values into a list and updating df accordingly. After trying to get this method to work for a bit I thought I'd look for a more Pandas-esque way of handling this problem.


Edit: As a follow up to dermen's answer, that solution kind of worked. The Values did seem to concatenate correctly into a list. One thing I realized was that the unique function returns a Series, as opposed to a DataFrame. Also, I do have more columns in the actual setup than just Name, Value, and Value2. But I think I was able to get around both of the issues successfully with the following:

gb = df.groupby(tuple(df.columns.difference(['Value'])))
result = pd.DataFrame(gb['Value'].unique(), columns=df.columns)

Where the first line gives an argument to groupby of the list of columns minus the Value column, and the second line converts the Series returned by unique into a DataFrame with the same columns as df.

But I think with all of that in place (unless anyone sees an issue with this), almost everything works as intended. There does seem to be something that's a bit off here, though. When I try to output this to a file with to_csv, there are duplicate headers across the top (but only certain headers are duplicated, and there's no real pattern as to which as far as I can tell). Also, the Value lists are truncated, which is probably a simpler issue to fix. The csv output currenlty looks like:

Name    Value                   Value2    Name    Value2
'A'     ['8.8.8.8' '7.7.7.7'    'x'                     
'B'     ['6.6.6.6']             'y'

The above looks weird, but that is exactly how it looks in the output. Note that, contrary to the example presented at the start of this post, there are assumed to be more than 2 Values for A (so that I can illustrate this point). When I do this with the actual data, the Value lists get cut off after the first 4 elements.

like image 775
grish Avatar asked Aug 03 '15 21:08

grish


1 Answers

I think you are looking to use pandas.Series.unique. First, make the 'Name' index a column

df
#     Value2  Value
#Name              
#A         x    8.8
#B         y    6.6
#A         x    6.6
#A         x    8.8

df.reset_index(inplace=True)
#  Name Value2  Value
#0    A      x    8.8
#1    B      y    6.6
#2    A      x    6.6
#3    A      x    8.8

Next call groupby and call the unique function on the 'Value' series

gb = df.groupby(['Name','Value2'])
result = gb['Value'].unique()
result.reset_index(inplace=True) #lastly, reset the index
#  Name Value2       Value
#0    A      x  [8.8, 6.6]
#1    B      y       [6.6]

Finally, if you want 'Name' as the index again, just do

result.set_index( 'Name', inplace=True)
#     Value2       Value
#Name                   
#A         x  [8.8, 6.6]
#B         y       [6.6]

UPDATE

As a follow up, make sure you re-assign result after resetting the index

result = gb['Value'].unique()
type(result)
#pandas.core.series.Series

result = result.reset_index()
type(result)
#pandas.core.frame.DataFrame

saving as CSV (rather TSV)

You don't want to use CSV here because there are commas in the Value column entries. Rather, save as TSV, you still use the same method to_csv, just change the sep arg:

result.to_csv( 'result.txt', sep='\t')

If I load result.txt in EXCEL as a TSV, I get

enter image description here

like image 166
dermen Avatar answered Oct 04 '22 16:10

dermen