Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Modular arithmetic in python to iterate a pandas dataframe

Ok, I have a big dataframe such as:

      hour    value
  0      0      1
  1      6      2
  2     12      3
  3     18      4
  4      0      5
  5      6      6
  6     12      7
  7     18      8
  8      6      9
  9     12     10
 10     18     11
 11     12     12
 12     18     13
 13      0     14

Let's don't get lost here. The column hour represents the hours of the day, from 6 to 6 hours. Column values is well, exactly that, here the values are as an example, not the actual ones.

If you look closely to the hour column, you can see that there are hours missing. For instance, there is a gap between rows 7 and 8 (the value of hour 0 is missing). There are also bigger gaps, such as in between rows 10 and 11 (hours 00 and 06).

What do I need? I would like to check when an hour (and of course) a value is missing, and complete the dataframe inserting a row there with the corresponding hour and a np.nan as value.

What have I thought? I think this would be easily solved using modular arithmetic, in this case with mod 24, such as when 18 + 6 = 24 = 0 mod 24. So initializing the counter to zero and adding 6 with the caveat that the counter is defined in modular arithmetic mod 24 you can verify if each hour is the corresponding hour, and if not, insert a new row with the corresponding hour and with np.nan as value.

I don't know how to do the implementation of modular arithmetic in python to iterate a dataframe column.

Thank you very much.

like image 372
David Avatar asked May 19 '16 15:05

David


People also ask

How do I iterate through a panda in Python?

In order to iterate over rows, we apply a function itertuples() this function return a tuple for each row in the DataFrame. The first element of the tuple will be the row's corresponding index value, while the remaining values are the row values.

How do you iterate a panda in DF?

DataFrame. iterrows() method is used to iterate over DataFrame rows as (index, Series) pairs. Note that this method does not preserve the dtypes across rows due to the fact that this method will convert each row into a Series .

What is the fastest way to iterate over pandas DataFrame?

Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.

How do you use Panda modulus?

Pandas DataFrame mod() MethodThe mod() method finds the modulo (remainder) of each value in the DataFrame when you divide them by a a specified value. The specified value must be an object that can be divided from the values in the DataFrame.


1 Answers

Solution

group_hours = (df.hour <= df.hour.shift()).cumsum()

def insert_missing_hours(df):
    return df.set_index('hour').reindex([0, 6, 12, 18]).reset_index()

df.groupby(group_hours).apply(insert_missing_hours).reset_index(drop=1)

Looks like:

    hour  value
0      0    1.0
1      6    2.0
2     12    3.0
3     18    4.0
4      0    5.0
5      6    6.0
6     12    7.0
7     18    8.0
8      0    NaN
9      6    9.0
10    12   10.0
11    18   11.0
12     0    NaN
13     6    NaN
14    12   12.0
15    18   13.0
16     0   14.0
17     6    NaN
18    12    NaN
19    18    NaN

Explanation

In order to apply reindex I needed to determine which rows to group. I checked to see if row's hour was less or equal than prior row's hour. If so, that flags a new group.

insert_missing_hours is precisely the reindex of subgroups with [0, 6, 12, 18].

like image 159
piRSquared Avatar answered Nov 07 '22 03:11

piRSquared