Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - Counter in 2 million row table

As an example, I have the following dataframe:

Date                     Balance
2013-04-01 03:50:00         A
2013-04-01 04:00:00         A
2013-04-01 04:15:00         B
2013-04-01 04:15:00         B
2013-04-01 04:25:00         A
2013-04-01 04:25:00         A
2013-04-01 04:35:00         B
2013-04-01 04:40:00         B
2013-04-02 04:55:00         B
2013-04-02 04:56:00         A
2013-04-02 04:57:00         A
2013-04-03 10:30:00         A
2013-04-03 16:35:00         A
2013-04-03 20:40:00         A

My goal is to add one column 'Counter' that basically shows a balance of the number of A's and B's. So, every time an A appears, the counter column increases one value. Every time B appears, the counter column decreases one value. If two A's appear at the same time (same Date) in two consecutive rows, the balance should increase by two on both of the rows (the same reasoning applies for consecutive B's or for A's and B's at the same time). Therefore, the dataframe would look like this in the end:

 Date                     Balance        Counter
2013-04-01 03:50:00         A               1
2013-04-01 04:00:00         A               2
2013-04-01 04:15:00         B               0
2013-04-01 04:15:00         B               0
2013-04-01 04:25:00         A               2
2013-04-01 04:25:00         A               2
2013-04-01 04:35:00         B               1
2013-04-01 04:40:00         B               0
2013-04-02 04:55:00         B              -1
2013-04-02 04:56:00         A               0
2013-04-02 04:57:00         A               1
2013-04-03 10:30:00         A               2
2013-04-03 16:35:00         A               3
2013-04-03 20:40:00         A               4

The major problem is that the dataframe has more than 2 millions rows, therefore it is really time consuming to perform a loop. Is there any way to implement a vectorized approach to this problem?

Edit (I was able to compile a solution that works well if the dates are not the same on consecutive rows). Anyone could help me to figure out the rest?

d = {'Date': ['2013-04-01 03:50:00', '2013-04-01 04:00:00','2013-04-01 
04:15:00','2013-04-01 04:15:00','2013-04-01 04:25:00',
'2013-04-01 04:25:00','2013-04-01 04:35:00','2013-04-01 04:40:00','2013-04- 
02 04:55:00','2013-04-02 04:56:00',         
'2013-04-02 04:57:00','2013-04-03 10:30:00','2013-04-03 16:35:00','2013-04- 
03 20:40:00'], 'Balance': ['A','A','B','B','A','A','B','B','B',                                                                                                
'A','A','A','A','A',]}

df = pd.DataFrame(data=d)

df['plus_minus'] = np.where(df.Balance == 'A', 1, -1)
df['Counter'] = df['plus_minus'].cumsum()
like image 440
Miguel Lambelho Avatar asked Jul 20 '18 12:07

Miguel Lambelho


People also ask

How to count the rows and columns in Python pandas?

Using count () method in Python Pandas we can count the rows and columns. Count method requires axis information, axis=1 for column and axis=0 for row. To count the rows in Python Pandas type df.count (axis=1), where df is the dataframe and axis=1 refers to column. Sorry, something went wrong. Reload? Sorry, we cannot display this file.

How to count multiple objects in a collection in Python?

In collections, you’ll find a class specially designed to count several different objects in one go. This class is conveniently called Counter. Counter is a subclass of dict that’s specially designed for counting hashable objects in Python. It’s a dictionary that stores objects as keys and counts as values.

How do you update the Count of a counter in Python?

Updating Object Counts Once you have a Counter instance in place, you can use.update () to update it with new objects and counts. Rather than replacing values like its dict counterpart, the.update () implementation provided by Counter adds existing counts together. It also creates new key-count pairs when necessary.

How to count HASHABLE objects in Python?

This class is conveniently called Counter. Counter is a subclass of dict that’s specially designed for counting hashable objects in Python. It’s a dictionary that stores objects as keys and counts as values. To count with Counter, you typically provide a sequence or iterable of hashable objects as an argument to the class’s constructor.


1 Answers

One approach would be to group by the Date and sum the values. The cumulative sum of that gives you the net at end of that datetime, and then we can reindex by the Date to broadcast the result back up to the main frame:

df['plus_minus'] = np.where(df.Balance == 'A', 1, -1)
by_dt = df["plus_minus"].groupby(df["Date"]).sum().cumsum()
df["Counter2"] = by_dt.reindex(df.Date).values

gives me

                   Date Balance  Counter  plus_minus  Counter2
0   2013-04-01 03:50:00       A        1           1         1
1   2013-04-01 04:00:00       A        2           1         2
2   2013-04-01 04:15:00       B        0          -1         0
3   2013-04-01 04:15:00       B        0          -1         0
4   2013-04-01 04:25:00       A        2           1         2
5   2013-04-01 04:25:00       A        2           1         2
6   2013-04-01 04:35:00       B        1          -1         1
7   2013-04-01 04:40:00       B        0          -1         0
8   2013-04-02 04:55:00       B       -1          -1        -1
9   2013-04-02 04:56:00       A        0           1         0
10  2013-04-02 04:57:00       A        1           1         1
11  2013-04-03 10:30:00       A        2           1         2
12  2013-04-03 16:35:00       A        3           1         3
13  2013-04-03 20:40:00       A        4           1         4
like image 178
DSM Avatar answered Sep 25 '22 10:09

DSM