Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Pandas: two columns with same values, alphabetically sorted and stored

Tags:

python

pandas

Problem
"The df has two columns but sometimes filled with the same values. We need to re-save them into two new columns but in alphabetical order"

Context
We have a pandas df like this:

df = pd.DataFrame([{"name_A": "john", "name_B": "mac"}, {"name_A": "mac", "name_B": "john"}])

Like this:

name_A | name_B
john   |  mac 
mac    |  john 
Trump  |  Clinton


Desired Output

name_A | name_B   | restated_A  | restated_B
john   |  mac     |  john       |  mac
mac    |  john    |  john       |  mac
trump  |  clinton |  clinton    | trump   

In words, we wish to have the columns' values name_A and name_B to be alphabetically sorted in restated_A AND restated_B

Tried so far
bunch of lambdas but couldn't get it to work

Specifications
Python: 3.5.2
Pandas: 0.18.1

like image 867
John Avatar asked Oct 22 '16 02:10

John


People also ask

How do you sort a DataFrame based on two columns in Python?

You can sort pandas DataFrame by one or multiple (one or more) columns using sort_values() method and by ascending or descending order. To specify the order, you have to use ascending boolean property; False for descending and True for ascending.

How do you sort DataFrame based on column values?

To sort the DataFrame based on the values in a single column, you'll use . sort_values() . By default, this will return a new DataFrame sorted in ascending order. It does not modify the original DataFrame.


2 Answers

As an alternative vectorized solution, you can use numpy.minimum() and numpy.maximum():

import numpy as np
df['restart_A'] = np.minimum(df['name_A'], df['name_B'])
df['restart_B'] = np.maximum(df['name_A'], df['name_B']) 

enter image description here

Or use apply method:

df[['restated_A', 'restated_B']] = df.apply(lambda r: sorted(r), axis = 1)

enter image description here

like image 61
Psidom Avatar answered Nov 03 '22 05:11

Psidom


Just send df.values to a list and sort that list for each row. Then reassign the elements in the pairs accordingly.

>>> df = pd.DataFrame([{"name_A": "john", "name_B": "mac"}, {"name_A": "mac", "name_B": "john"}])
>>> restated_values = [sorted(pair) for pair in df.values.tolist()]
>>> restated_values
[['john', 'mac'], ['john', 'mac']]
>>> df['restated_A'] = [pair[0] for pair in restated_values]
>>> df
  name_A name_B restated_A
0   john    mac       john
1    mac   john       john
>>> df['restated_b'] = [pair[1] for pair in restated_values]
>>> df
  name_A name_B restated_A restated_b
0   john    mac       john        mac
1    mac   john       john        mac

Or, you could do this, using a dict and a new pandas.DataFrame object:

>>> df = pd.DataFrame([{"name_A": "john", "name_B": "mac"}, {"name_A": "mac", "name_B": "john"}])
>>> restated_values = [sorted(pair) for pair in df.values.tolist()]
>>> restated_values
[['john', 'mac'], ['john', 'mac']]
>>> new_col_rows = {'restated_A': [pair[0] for pair in restated_values], 'restated_B': [pair[1] for pair in restated_values]}
>>> new_col_rows
{'restated_A': ['john', 'john'], 'restated_B': ['mac', 'mac']}
>>> new_df = pd.DataFrame(new_col_rows)
>>> new_df
  restated_A restated_B
0       john        mac
1       john        mac
>>> df = df.join(new_df)
>>> df
  name_A name_B restated_A restated_B
0   john    mac       john        mac
1    mac   john       john        mac
like image 22
blacksite Avatar answered Nov 03 '22 07:11

blacksite