I've currently switched my focus from R to Python. I work with data.table in R a lot, and I find it sometimes quite difficult to find an equivalent for some functions in Python. I have a pandas data frame that looks like this: <blockquote> df = pd.DataFrame({'A':['abc','def', 'def', 'abc', 'def', 'def','abc'],'B':[13123,45,1231,463,142131,4839, 4341]}) <blockquote> <pre class="prettyprint"><code> A B 0 abc 13123 1 def 45 2 def 1231 3 abc 463 4 def 142131 5 def 4839 6 abc 4341 </code></pre> </blockquote> </blockquote> I need to create a column that increments from 1 based on A and B, so that it indicates the increasing order of B. So I first create the sorted data frame, and the column I'm interested in creating is C as below: <blockquote> <pre class="prettyprint"><code> A B C 1 abc 463 1 6 abc 4341 2 0 abc 13123 3 3 def 45 1 2 def 1231 2 5 def 4839 3 4 def 142131 4 </code></pre> </blockquote> In R, using the library(data.table), this can be easily done in one line and creates a column within the original data table: <blockquote> df[, C := 1:.N, by=A] </blockquote> I've looked around and I think I might be able to make use of something like this: <blockquote> df.groupby('A').size() or df['B'].argsort() </blockquote> but not sure how to proceed from here, and how to join the new column back to the original data frame. It would be very helpful if anyone could give me any pointer. Many thanks!

And for comparison, the correct <code>data.table</code> syntax is : <pre class="prettyprint"><code>df[, C := 1:.N, by=A] </code></pre> This adds a new column C by reference to <code>df</code>. The <code>:=</code> operator is part of the <code>data.table</code> package for R. It allows you to add and remove columns and assign to subsets of <code>data.table</code>, by group, by reference with no copy at all.

Create a column which increments based on another column in Python

Tags:

python

pandas

r

data.table

I've currently switched my focus from R to Python. I work with data.table in R a lot, and I find it sometimes quite difficult to find an equivalent for some functions in Python.

I have a pandas data frame that looks like this:

df = pd.DataFrame({'A':['abc','def', 'def', 'abc', 'def', 'def','abc'],'B':[13123,45,1231,463,142131,4839, 4341]})
     A       B  
0  abc   13123    
1  def      45  
2  def    1231  
3  abc     463  
4  def  142131  
5  def    4839
6  abc    4341

I need to create a column that increments from 1 based on A and B, so that it indicates the increasing order of B. So I first create the sorted data frame, and the column I'm interested in creating is C as below:

    A       B   C
1  abc     463  1
6  abc    4341  2
0  abc   13123  3
3  def      45  1
2  def    1231  2
5  def    4839  3
4  def  142131  4

In R, using the library(data.table), this can be easily done in one line and creates a column within the original data table:

df[, C := 1:.N, by=A]

I've looked around and I think I might be able to make use of something like this:

df.groupby('A').size()
or
df['B'].argsort()

but not sure how to proceed from here, and how to join the new column back to the original data frame. It would be very helpful if anyone could give me any pointer.

Many thanks!

641

asked Oct 23 '12 13:10

S.zhen

2 Answers

In [61]: df
Out[61]:
     A       B
1  abc     463
6  abc    4341
0  abc   13123
3  def      45
2  def    1231
5  def    4839
4  def  142131

In [62]: df['C'] =  df.groupby('A')['A'].transform(lambda x: pd.Series(range(1, len(x)+1), index=x.index))

In [63]: df
Out[63]:
     A       B  C
1  abc     463  1
6  abc    4341  2
0  abc   13123  3
3  def      45  1
2  def    1231  2
5  def    4839  3
4  def  142131  4

113

answered Oct 22 '22 12:10

Wouter Overmeire

And for comparison, the correct data.table syntax is :

df[, C := 1:.N, by=A]

This adds a new column C by reference to df. The := operator is part of the data.table package for R. It allows you to add and remove columns and assign to subsets of data.table, by group, by reference with no copy at all.

answered Oct 22 '22 12:10

Matt Dowle

Related questions
                            
                                waiting for user input in separate thread
                            
                                interpolation with matplotlib pcolor
                            
                                Combining job results in celery
                            
                                What are the best ways to compare the contents of two list-like objects?
                            
                                Why Scikit GradientBoostingClassifier won't let me use least squares regression?
                            
                                Remember form data for pagination
                            
                                How can I get Python's unittest to not catch exceptions?
                            
                                Python as "perl -pe", execute Python command for every line in stdin [duplicate]
                            
                                Constructing a tree using Python
                            
                                Sockjs - Send message to sockjs-tornado in Python code
                            
                                Python's os.chdir() and os.getcwd() mismatch when using tempfile.mkdtemp() on Mac OSX Lion
                            
                                What is the best way to check if time is within a certain minute?
                            
                                SqlAlchemy: export table to new database
                            
                                Press multiple keys at once to get my character to move diagonally
                            
                                Pydoc messes up with -*- coding: utf-8 -*-
                            
                                Integer object whose value can be changed after definition?
                            
                                Python accent graves bad practice?
                            
                                Running time using Big Θ notation
                            
                                How to put my Python C-module inside package?
                            
                                R_PPC_REL24 relocation out of range

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With