Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python & Pandas - Group by day and count for each day

I am new on pandas and for now i don't get how to arrange my time serie, take a look at it :

date & time of connection 19/06/2017 12:39 19/06/2017 12:40 19/06/2017 13:11 20/06/2017 12:02 20/06/2017 12:04 21/06/2017 09:32 21/06/2017 18:23 21/06/2017 18:51 21/06/2017 19:08 21/06/2017 19:50 22/06/2017 13:22 22/06/2017 13:41 22/06/2017 18:01 23/06/2017 16:18 23/06/2017 17:00 23/06/2017 19:25 23/06/2017 20:58 23/06/2017 21:03 23/06/2017 21:05 

This is a sample of a dataset of 130 k raws,I tried : df.groupby('date & time of connection')['date & time of connection'].apply(list)

Not enough i guess

I think i should :

  • Create a dictionnary with index from dd/mm/yyyy to dd/mm/yyyy
  • Convert "date & time of connection" type dateTime to Date
  • Group and count Date of "date & time of connection"
  • Put the numbers i count inside the dictionary ?

What do you think about my logic ? Do you know some tutos ? Thank you very much

like image 777
Erwan Pesle Avatar asked Feb 24 '18 10:02

Erwan Pesle


People also ask

What is Python used for?

Python is a computer programming language often used to build websites and software, automate tasks, and conduct data analysis. Python is a general-purpose language, meaning it can be used to create a variety of different programs and isn't specialized for any specific problems.

Is Python hard to learn?

Python is widely considered among the easiest programming languages for beginners to learn. If you're interested in learning a programming language, Python is a good place to start. It's also one of the most widely used.

Is Python written in C?

Python is written in C (actually the default implementation is called CPython).

Which software is used for Python?

PyCharm, a proprietary and Open Source IDE for Python development. PyScripter, Free and open-source software Python IDE for Microsoft Windows. PythonAnywhere, an online IDE and Web hosting service. Python Tools for Visual Studio, Free and open-source plug-in for Visual Studio.


2 Answers

You can use dt.floor for convert to dates and then value_counts or groupby with size:

df = (pd.to_datetime(df['date & time of connection'])        .dt.floor('d')        .value_counts()        .rename_axis('date')        .reset_index(name='count')) print (df)         date  count 0 2017-06-23      6 1 2017-06-21      5 2 2017-06-19      3 3 2017-06-22      3 4 2017-06-20      2 

Or:

s = pd.to_datetime(df['date & time of connection']) df = s.groupby(s.dt.floor('d')).size().reset_index(name='count') print (df)   date & time of connection  count 0                2017-06-19      3 1                2017-06-20      2 2                2017-06-21      5 3                2017-06-22      3 4                2017-06-23      6 

Timings:

np.random.seed(1542)  N = 220000 a = np.unique(np.random.randint(N, size=int(N/2))) df = pd.DataFrame(pd.date_range('2000-01-01', freq='37T', periods=N)).drop(a) df.columns = ['date & time of connection'] df['date & time of connection'] = df['date & time of connection'].dt.strftime('%d/%m/%Y %H:%M:%S') print (df.head())   In [193]: %%timeit      ...: df['date & time of connection']=pd.to_datetime(df['date & time of connection'])      ...: df1 = df.groupby(by=df['date & time of connection'].dt.date).count()      ...:  539 ms ± 45.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  In [194]: %%timeit      ...: df1 = (pd.to_datetime(df['date & time of connection'])      ...:        .dt.floor('d')      ...:        .value_counts()      ...:        .rename_axis('date')      ...:        .reset_index(name='count'))      ...:  12.4 ms ± 350 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  In [195]: %%timeit      ...: s = pd.to_datetime(df['date & time of connection'])      ...: df2 = s.groupby(s.dt.floor('d')).size().reset_index(name='count')      ...:  17.7 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 
like image 117
jezrael Avatar answered Oct 09 '22 05:10

jezrael


To make sure your columns in in date format.

df['date & time of connection']=pd.to_datetime(df['date & time of connection']) 

Then you can group the data by date and do a count:

df.groupby(by=df['date & time of connection'].dt.date).count() Out[10]:                             date & time of connection date & time of connection                            2017-06-19                                         3 2017-06-20                                         2 2017-06-21                                         5 2017-06-22                                         3 2017-06-23                                         6 
like image 35
Allen Avatar answered Oct 09 '22 06:10

Allen