I've been at this issue for awhile to no avail. This is almost a duplicate of at least one other question on here, but I can't quite figure out how to do exactly what I'm looking for from related answers online.
I have a Pandas DataFrame (we'll call it df
) that looks something like:
Name Value Value2
'A' '8.8.8.8' 'x'
'B' '6.6.6.6' 'y'
'A' '6.6.6.6' 'x'
'A' '8.8.8.8' 'x'
Where Name
is the index. I want to convert this to something like that looks like:
Name Value Value2
'A' ['8.8.8.8', '6.6.6.6'] 'x'
'B' ['6.6.6.6'] 'y'
So, basically, every Value
that corresponds to the same index should be combined into a list (or a set, or a tuple) and that list made to be the Value
for the corresponding index. And, as shown, Value2
is the same between like-indexed rows, so it should just stay the same in the end.
All I've done (successfully) is figure out how to make each element in the Value
column into a list with:
df['Value'] = pd.Series([[val] for val in df['Value']])
In the question I linked at the start of this post, the recommended way to combine columns with duplicate indices offers a solution using df.groupby(df.index).sum()
. I know I need something besides df.index
as an argument to groupby
since the Value
column is treated as special, and I'm not sure what to put in place of sum()
since that's not quite what I'm looking for.
Hopefully it's clear what I'm looking for, let me know if there's anything I can elaborate on. I've also tried simply looping through the DataFrame myself, finding rows with the same index, combining the Values
into a list and updating df
accordingly. After trying to get this method to work for a bit I thought I'd look for a more Pandas-esque way of handling this problem.
Edit: As a follow up to dermen's answer, that solution kind of worked. The Values
did seem to concatenate correctly into a list. One thing I realized was that the unique
function returns a Series
, as opposed to a DataFrame
. Also, I do have more columns in the actual setup than just Name
, Value
, and Value2
. But I think I was able to get around both of the issues successfully with the following:
gb = df.groupby(tuple(df.columns.difference(['Value'])))
result = pd.DataFrame(gb['Value'].unique(), columns=df.columns)
Where the first line gives an argument to groupby
of the list of columns minus the Value
column, and the second line converts the Series
returned by unique
into a DataFrame
with the same columns as df
.
But I think with all of that in place (unless anyone sees an issue with this), almost everything works as intended. There does seem to be something that's a bit off here, though. When I try to output this to a file with to_csv
, there are duplicate headers across the top (but only certain headers are duplicated, and there's no real pattern as to which as far as I can tell). Also, the Value
lists are truncated, which is probably a simpler issue to fix. The csv
output currenlty looks like:
Name Value Value2 Name Value2
'A' ['8.8.8.8' '7.7.7.7' 'x'
'B' ['6.6.6.6'] 'y'
The above looks weird, but that is exactly how it looks in the output. Note that, contrary to the example presented at the start of this post, there are assumed to be more than 2 Values
for A
(so that I can illustrate this point). When I do this with the actual data, the Value
lists get cut off after the first 4 elements.
I think you are looking to use pandas.Series.unique
. First, make the 'Name'
index a column
df
# Value2 Value
#Name
#A x 8.8
#B y 6.6
#A x 6.6
#A x 8.8
df.reset_index(inplace=True)
# Name Value2 Value
#0 A x 8.8
#1 B y 6.6
#2 A x 6.6
#3 A x 8.8
Next call groupby
and call the unique
function on the 'Value'
series
gb = df.groupby(['Name','Value2'])
result = gb['Value'].unique()
result.reset_index(inplace=True) #lastly, reset the index
# Name Value2 Value
#0 A x [8.8, 6.6]
#1 B y [6.6]
Finally, if you want 'Name'
as the index again, just do
result.set_index( 'Name', inplace=True)
# Value2 Value
#Name
#A x [8.8, 6.6]
#B y [6.6]
As a follow up, make sure you re-assign result after resetting the index
result = gb['Value'].unique()
type(result)
#pandas.core.series.Series
result = result.reset_index()
type(result)
#pandas.core.frame.DataFrame
You don't want to use CSV here because there are commas in the Value
column entries. Rather, save as TSV, you still use the same method to_csv
, just change the sep
arg:
result.to_csv( 'result.txt', sep='\t')
If I load result.txt in EXCEL as a TSV, I get
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With