I've currently switched my focus from R to Python. I work with data.table in R a lot, and I find it sometimes quite difficult to find an equivalent for some functions in Python.
I have a pandas data frame that looks like this:
df = pd.DataFrame({'A':['abc','def', 'def', 'abc', 'def', 'def','abc'],'B':[13123,45,1231,463,142131,4839, 4341]})
A B 0 abc 13123 1 def 45 2 def 1231 3 abc 463 4 def 142131 5 def 4839 6 abc 4341
I need to create a column that increments from 1 based on A and B, so that it indicates the increasing order of B. So I first create the sorted data frame, and the column I'm interested in creating is C as below:
A B C 1 abc 463 1 6 abc 4341 2 0 abc 13123 3 3 def 45 1 2 def 1231 2 5 def 4839 3 4 def 142131 4
In R, using the library(data.table), this can be easily done in one line and creates a column within the original data table:
df[, C := 1:.N, by=A]
I've looked around and I think I might be able to make use of something like this:
df.groupby('A').size()
or
df['B'].argsort()
but not sure how to proceed from here, and how to join the new column back to the original data frame. It would be very helpful if anyone could give me any pointer.
Many thanks!
You can use the assign() function to add a new column to the end of a pandas DataFrame: df = df. assign(col_name=[value1, value2, value3, ...])
In pandas you can add/append a new column to the existing DataFrame using DataFrame. insert() method, this method updates the existing DataFrame with a new column. DataFrame. assign() is also used to insert a new column however, this method returns a new Dataframe after adding a new column.
The syntax for the assign method is fairly simple. You type the name of your dataframe, then a “dot”, and then type assign() . Remember, the assign method is a Python method that's associated with dataframe objects, so we can use so-called “dot syntax” to call the method.
Pandas str. repeat() method is used to repeat string values in the same position of passed series itself. An array can also be passed in case to define the number of times each element should be repeated in series. For that case, length of array must be same as length of Series.
In [61]: df
Out[61]:
A B
1 abc 463
6 abc 4341
0 abc 13123
3 def 45
2 def 1231
5 def 4839
4 def 142131
In [62]: df['C'] = df.groupby('A')['A'].transform(lambda x: pd.Series(range(1, len(x)+1), index=x.index))
In [63]: df
Out[63]:
A B C
1 abc 463 1
6 abc 4341 2
0 abc 13123 3
3 def 45 1
2 def 1231 2
5 def 4839 3
4 def 142131 4
And for comparison, the correct data.table
syntax is :
df[, C := 1:.N, by=A]
This adds a new column C by reference to df
. The :=
operator is part of the data.table
package for R. It allows you to add and remove columns and assign to subsets of data.table
, by group, by reference with no copy at all.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With