Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create a column which increments based on another column in Python

I've currently switched my focus from R to Python. I work with data.table in R a lot, and I find it sometimes quite difficult to find an equivalent for some functions in Python.

I have a pandas data frame that looks like this:

df = pd.DataFrame({'A':['abc','def', 'def', 'abc', 'def', 'def','abc'],'B':[13123,45,1231,463,142131,4839, 4341]})

     A       B  
0  abc   13123    
1  def      45  
2  def    1231  
3  abc     463  
4  def  142131  
5  def    4839
6  abc    4341

I need to create a column that increments from 1 based on A and B, so that it indicates the increasing order of B. So I first create the sorted data frame, and the column I'm interested in creating is C as below:

    A       B   C
1  abc     463  1
6  abc    4341  2
0  abc   13123  3
3  def      45  1
2  def    1231  2
5  def    4839  3
4  def  142131  4

In R, using the library(data.table), this can be easily done in one line and creates a column within the original data table:

df[, C := 1:.N, by=A]

I've looked around and I think I might be able to make use of something like this:

df.groupby('A').size()
or
df['B'].argsort()

but not sure how to proceed from here, and how to join the new column back to the original data frame. It would be very helpful if anyone could give me any pointer.

Many thanks!

like image 641
S.zhen Avatar asked Oct 23 '12 13:10

S.zhen


People also ask

How do I create a column with the same value in pandas?

You can use the assign() function to add a new column to the end of a pandas DataFrame: df = df. assign(col_name=[value1, value2, value3, ...])

How do I create a new column in pandas at a specific position?

In pandas you can add/append a new column to the existing DataFrame using DataFrame. insert() method, this method updates the existing DataFrame with a new column. DataFrame. assign() is also used to insert a new column however, this method returns a new Dataframe after adding a new column.

How do you assign a column of data to a variable in Python?

The syntax for the assign method is fairly simple. You type the name of your dataframe, then a “dot”, and then type assign() . Remember, the assign method is a Python method that's associated with dataframe objects, so we can use so-called “dot syntax” to call the method.

How do you repeat a column value in Python?

Pandas str. repeat() method is used to repeat string values in the same position of passed series itself. An array can also be passed in case to define the number of times each element should be repeated in series. For that case, length of array must be same as length of Series.


2 Answers

In [61]: df
Out[61]:
     A       B
1  abc     463
6  abc    4341
0  abc   13123
3  def      45
2  def    1231
5  def    4839
4  def  142131

In [62]: df['C'] =  df.groupby('A')['A'].transform(lambda x: pd.Series(range(1, len(x)+1), index=x.index))

In [63]: df
Out[63]:
     A       B  C
1  abc     463  1
6  abc    4341  2
0  abc   13123  3
3  def      45  1
2  def    1231  2
5  def    4839  3
4  def  142131  4
like image 113
Wouter Overmeire Avatar answered Oct 22 '22 12:10

Wouter Overmeire


And for comparison, the correct data.table syntax is :

df[, C := 1:.N, by=A]

This adds a new column C by reference to df. The := operator is part of the data.table package for R. It allows you to add and remove columns and assign to subsets of data.table, by group, by reference with no copy at all.

like image 45
Matt Dowle Avatar answered Oct 22 '22 12:10

Matt Dowle