Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: How do I return a row value once a column reaches a certain value of another column?

Here is a sample of the data:

enter image description here

Goal:
create a new timestamp column for when running_bid_max greater than or equal to the value in ask_price_target_good. Then create a separate timestamp column for when running_bid_min is less than or equal to ask_price_target_bad.

Note: This will be performed on a large amount of data and needs calculated as fast as possible. I'm hoping I don't have to iterate through all rows via iterrows()

running_bid_min and running_bid_max are calculated using a running.min() and pd.running.max() from a certain time frame in the future (this example is using a 5 minute timeline. So it will be the running min,max 5 minutes from the current time)

copy the data below and then use df = pd.read_clipboard(sep=',')

   time,bid_price,ask_price,running_bid_max,running_bid_min,ask_price_target_good,ask_price_target_bad
2019-07-24 07:59:44.432034,291.06,291.26,291.4,291.09,291.46,291.06
2019-07-24 07:59:46.393418,291.1,291.33,291.4,291.09,291.53,291.13
2019-07-24 07:59:48.425615,291.1,291.33,291.4,291.09,291.53,291.13
2019-07-24 07:59:50.084206,291.12,291.33,291.4,291.09,291.53,291.13
2019-07-24 07:59:52.326455,291.12,291.33,291.4,291.09,291.53,291.13
2019-07-24 07:59:54.428181,291.12,291.33,291.4,291.09,291.53,291.13
2019-07-24 07:59:58.550378,291.14,291.35,291.4,291.2,291.55,291.15
2019-07-24 08:00:00.837238,291.2,291.35,291.4,291.2,291.55,291.15
2019-07-24 08:00:57.338769,291.4,291.46,291.51,291.4,291.66,291.26
2019-07-24 08:00:59.058198,291.4,291.46,291.96,291.4,291.66,291.26
2019-07-24 08:01:00.802679,291.4,291.46,291.96,291.4,291.66,291.26
2019-07-24 08:01:02.781289,291.4,291.46,291.96,291.45,291.66,291.26
2019-07-24 08:01:04.645144,291.45,291.46,291.96,291.45,291.66,291.26
2019-07-24 08:01:06.491997,291.45,291.46,292.07,291.45,291.66,291.26
2019-07-24 08:01:08.586688,291.45,291.46,292.1,291.45,291.66,291.26
like image 277
bbennett36 Avatar asked Jul 24 '19 12:07

bbennett36


People also ask

How do you get a value from a column based on another column pandas?

You can extract a column of pandas DataFrame based on another value by using the DataFrame. query() method. The query() is used to query the columns of a DataFrame with a boolean expression. The blow example returns a Courses column where the Fee column value matches with 25000.

How do you drop a row in pandas based on a column value?

Use drop() method to delete rows based on column value in pandas DataFrame, as part of the data cleansing, you would be required to drop rows from the DataFrame when a column value matches with a static value or on another column value.

What does ILOC return in pandas?

iloc returns a Pandas Series when one row is selected, and a Pandas DataFrame when multiple rows are selected, or if any column in full is selected. To counter this, pass a single-valued list if you require DataFrame output.


2 Answers

From your question:

creating a new timestamp column for when running_bid_max greater than or equal to the value in ask_price_target_good. Then create a separate timestamp column for when running_bid_min is less than or equal to ask_price_target_bad

the problem seems trivial:

df['g'] = np.where(df.running_bid_max.ge(df.ask_price_target_good), df['time'], pd.NaT)

df['l'] = np.where(df.running_bid_min.le(df.ask_price_target_bad), df['time'], pd.NaT)

Or am I missing something?


Update: you might want to ffill and bfill after the above commands:

df['g'] = df['g'].bfill()
df['l'] = df['l'].ffill()

Output, for example df['g']:

0    2019-07-24 08:00:59.058198
1    2019-07-24 08:00:59.058198
2    2019-07-24 08:00:59.058198
3    2019-07-24 08:00:59.058198
4    2019-07-24 08:00:59.058198
5    2019-07-24 08:00:59.058198
6    2019-07-24 08:00:59.058198
7    2019-07-24 08:00:59.058198
8    2019-07-24 08:00:59.058198
9    2019-07-24 08:00:59.058198
10   2019-07-24 08:01:00.802679
11   2019-07-24 08:01:02.781289
12   2019-07-24 08:01:04.645144
13   2019-07-24 08:01:06.491997
14   2019-07-24 08:01:08.586688
like image 139
Quang Hoang Avatar answered Sep 23 '22 10:09

Quang Hoang


It would be very nice if you could print the desired output. Otherwise I may miss the logic.

If you are working on large amount of data, it makes sense to apply steaming analytics*. (This will quite memory efficient and if you use cytoolz even 2-4 times faster)

So basically you would like to partition your data based on either one or the other condition:

partitions = toolz.partitionby(lambda x: (x['running_bid_max'] >= x['ask_price_target_good']) or
                                         (x['running_bid_min'] <= x['ask_price_target_bad']), data_stream)

Whatever you will do with individual partitions is up to you (you can create addtional fields or columns etc.).

print([(part[0]['time'], part[-1]['time'], 
        part[0]['running_bid_max'] > part[0]['ask_price_target_good'],
        part[0]['running_bid_min'] > part[0]['ask_price_target_bad']) 
       for part in partitions])
[('2019-07-24T07:59:46.393418', '2019-07-24T07:59:46.393418', False, False), 
 ('2019-07-24T07:59:44.432034', '2019-07-24T07:59:44.432034', False,  True), 
 ('2019-07-24T07:59:48.425615', '2019-07-24T07:59:54.428181', False, False), 
 ('2019-07-24T07:59:58.550378', '2019-07-24T08:00:57.338769', False,  True), 
 ('2019-07-24T08:00:59.058198', '2019-07-24T08:01:08.586688',  True,  True)]

Also note that it is easy to create individual DataFrames

info_cols = ['running_bid_max', 'ask_price_target_good', 'running_bid_min', 'ask_price_target_bad', 'time'] 
data_frames = [pandas.DataFrame(_)[info_cols] for _ in partitions]
data_frames
   running_bid_max  ask_price_target_good  running_bid_min  ask_price_target_bad                        time
0            291.4                 291.53           291.09                291.13  2019-07-24T07:59:46.393418

   running_bid_max  ask_price_target_good  running_bid_min  ask_price_target_bad                        time
0            291.4                 291.46           291.09                291.06  2019-07-24T07:59:44.432034

   running_bid_max  ask_price_target_good  running_bid_min  ask_price_target_bad                        time
0            291.4                 291.53           291.09                291.13  2019-07-24T07:59:48.425615
1            291.4                 291.53           291.09                291.13  2019-07-24T07:59:50.084206
2            291.4                 291.53           291.09                291.13  2019-07-24T07:59:52.326455
3            291.4                 291.53           291.09                291.13  2019-07-24T07:59:54.428181

   running_bid_max  ask_price_target_good  running_bid_min  ask_price_target_bad                        time
0           291.40                 291.55            291.2                291.15  2019-07-24T07:59:58.550378
1           291.40                 291.55            291.2                291.15  2019-07-24T08:00:00.837238
2           291.51                 291.66            291.4                291.26  2019-07-24T08:00:57.338769

   running_bid_max  ask_price_target_good  running_bid_min  ask_price_target_bad                        time
0           291.96                 291.66           291.40                291.26  2019-07-24T08:00:59.058198
1           291.96                 291.66           291.40                291.26  2019-07-24T08:01:00.802679
2           291.96                 291.66           291.45                291.26  2019-07-24T08:01:02.781289
3           291.96                 291.66           291.45                291.26  2019-07-24T08:01:04.645144
4           292.07                 291.66           291.45                291.26  2019-07-24T08:01:06.491997
5           292.10                 291.66           291.45                291.26  2019-07-24T08:01:08.586688

Unfortunatly I couldn't find a one liner pytition_by for DataFrame. It surely is hidden somewhere. (But again, pandas usually loads all data into memory - if you want to aggregate during I/O than streaming could be a way to go.)


*Streaming Example

For example, lets us create a simple csv stream:

def data_stream():
    with open('blubb.csv') as tsfile:
        reader = csv.DictReader(tsfile, delimiter='\t')
        number_keys = [_ for _ in reader.fieldnames if _ != 'time']

        def update_values(data_item):
            for k in number_keys:
                data_item[k] = float(data_item[k])
            return data_item
        for row in reader:
            yield update_values(dict(row))

that yields one processed row at a time:

next(data_stream())

{'time': '2019-07-24T07:59:46.393418',
 'bid_price': 291.1,
 'ask_price': 291.33,
 'running_bid_max': 291.4,
 'running_bid_min': 291.09,
 'ask_price_target_good': 291.53,
 'ask_price_target_bad': 291.13}
like image 23
Drey Avatar answered Sep 22 '22 10:09

Drey