Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas how to split dataframe by column by interval

I have a gigantic dataframe with a datetime type column called dt, the data frame is sorted based on dt already. I want to split the dataframe into several dataframes based on dt, each dataframe contains rows within 1 hr range.

Split

   dt                    text
0  20160811 11:05        a
1  20160811 11:35        b
2  20160811 12:03        c
3  20160811 12:36        d
4  20160811 12:52        e
5  20160811 14:32        f

into

   dt                    text
0  20160811 11:05        a
1  20160811 11:35        b
2  20160811 12:03        c

   dt                    text
0  20160811 12:36        d
1  20160811 12:52        e

   dt                    text 
0  20160811 14:32        f
like image 201
9blue Avatar asked Sep 21 '16 06:09

9blue


People also ask

How do I split a column with multiple values in pandas?

Split column by delimiter into multiple columnsApply the pandas series str. split() function on the “Address” column and pass the delimiter (comma in this case) on which you want to split the column. Also, make sure to pass True to the expand parameter.

How do you do column slicing in pandas?

To slice the columns, the syntax is df. loc[:,start:stop:step] ; where start is the name of the first column to take, stop is the name of the last column to take, and step as the number of indices to advance after each extraction; for example, you can select alternate columns.

How do you split the pandas series?

split() function. The str. split() function is used to split strings around given separator/delimiter. The function splits the string in the Series/Index from the beginning, at the specified delimiter string.

How do you split data frame values?

In the above example, the data frame 'df' is split into 2 parts 'df1' and 'df2' on the basis of values of column 'Weight'. Method 2: Using Dataframe. groupby(). This method is used to split the data into groups based on some criteria.


1 Answers

You need groupby by difference of first value of column dt converted to hour by astype:

S = pd.to_datetime(df.dt)
for i, g in df.groupby([(S - S[0]).astype('timedelta64[h]')]):
        print (g.reset_index(drop=True))

               dt text
0  20160811 11:05    a
1  20160811 11:35    b
2  20160811 12:03    c
               dt text
0  20160811 12:36    d
1  20160811 12:52    e
               dt text
0  20160811 14:32    f

List comprehension solution:

S = pd.to_datetime(df.dt)

print ((S - S[0]).astype('timedelta64[h]'))
0    0.0
1    0.0
2    0.0
3    1.0
4    1.0
5    3.0
Name: dt, dtype: float64

L = [g.reset_index(drop=True) for i, g in df.groupby([(S - S[0]).astype('timedelta64[h]')])]

print (L[0])
               dt text
0  20160811 11:05    a
1  20160811 11:35    b
2  20160811 12:03    c

print (L[1])
               dt text
0  20160811 12:36    d
1  20160811 12:52    e

print (L[2])
               dt text
0  20160811 14:32    f

Old solution, which split by hour:

You can use groupby by dt.hour, but first need convert dt to_datetime:

for i, g in df.groupby([pd.to_datetime(df.dt).dt.hour]):
    print (g.reset_index(drop=True))

               dt text
0  20160811 11:05    a
1  20160811 11:35    b
               dt text
0  20160811 12:03    c
1  20160811 12:36    d
2  20160811 12:52    e
               dt text
0  20160811 14:32    f

List comprehension solution:

L = [g.reset_index(drop=True) for i, g in df.groupby([pd.to_datetime(df.dt).dt.hour])]

print (L[0])
               dt text
0  20160811 11:05    a
1  20160811 11:35    b

print (L[1])
               dt text
0  20160811 12:03    c
1  20160811 12:36    d
2  20160811 12:52    e

print (L[2])
               dt text
0  20160811 14:32    f

Or use list comprehension with converting column dt to datetime:

df.dt = pd.to_datetime(df.dt)
L =[g.reset_index(drop=True) for i, g in df.groupby([df['dt'].dt.hour])]

print (L[1])
                   dt text
0 2016-08-11 12:03:00    c
1 2016-08-11 12:36:00    d
2 2016-08-11 12:52:00    e

print (L[2])
                   dt text
0 2016-08-11 14:32:00    f

If need split by dates and hours:

#changed dataframe for testing
print (df)
               dt text
0  20160811 11:05    a
1  20160812 11:35    b
2  20160813 12:03    c
3  20160811 12:36    d
4  20160811 12:52    e
5  20160811 14:32    f

serie = pd.to_datetime(df.dt)
for i, g in df.groupby([serie.dt.date, serie.dt.hour]):
    print (g.reset_index(drop=True))
               dt text
0  20160811 11:05    a
               dt text
0  20160811 12:36    d
1  20160811 12:52    e
               dt text
0  20160811 14:32    f
               dt text
0  20160812 11:35    b
               dt text
0  20160813 12:03    c    
like image 117
jezrael Avatar answered Oct 26 '22 23:10

jezrael