After processing, my data is one table with several columns that are features and one column which is a label. I would like to use featuretools.dfs to help me predict the label. Is it possible to do it directly, or do I need to split my single table into multiple?
It is possible to run DFS on a single table. As an example, if you have a pandas dataframe df with index 'index', you would write:
import featuretools as ft
es = ft.EntitySet('Transactions')
es.entity_from_dataframe(dataframe=df,
entity_id='log',
index='index')
fm, features = ft.dfs(entityset=es,
target_entity='log',
trans_primitives=['day', 'weekday', 'month'])
The generated feature matrix will look like
In [1]: fm
Out[1]:
location pies sold WEEKDAY(date) MONTH(date) DAY(date)
index
1 main street 3 4 12 29
2 main street 4 5 12 30
3 main street 5 6 12 31
4 arlington ave. 18 0 1 1
5 arlington ave. 1 1 1 2
This will apply “transform” primitives to your data. You usually want to add more entities to give ft.dfs, in order to use aggregation primitives. You can read about the difference in our documentation.
A standard workflow is to normalize your single entity by an interesting categorical. If your df was the single table
| index | location | pies sold | date |
|-------+----------------+-------+------------|
| 1 | main street | 3 | 2017-12-29 |
| 2 | main street | 4 | 2017-12-30 |
| 3 | main street | 5 | 2017-12-31 |
| 4 | arlington ave. | 18 | 2018-01-01 |
| 5 | arlington ave. | 1 | 2018-01-02 |
you would probably be interested in normalizing by location:
es.normalize_entity(base_entity_id='log',
new_entity_id='locations',
index='location')
Your new entity locations would have the table
| location | first_log_time |
|----------------+----------------|
| main street | 2018-12-29 |
| arlington ave. | 2000-01-01 |
which would make features like locations.SUM(log.pies sold) or locations.MEAN(log.pies sold) to add or average all values by location. You can see these features created in the example below
In [1]: import pandas as pd
...: import featuretools as ft
...: df = pd.DataFrame({'index': [1, 2, 3, 4, 5],
...: 'location': ['main street',
...: 'main street',
...: 'main street',
...: 'arlington ave.',
...: 'arlington ave.'],
...: 'pies sold': [3, 4, 5, 18, 1]})
...: df['date'] = pd.date_range('12/29/2017', periods=5, freq='D')
...: df
...:
Out[1]:
index location pies sold date
0 1 main street 3 2017-12-29
1 2 main street 4 2017-12-30
2 3 main street 5 2017-12-31
3 4 arlington ave. 18 2018-01-01
4 5 arlington ave. 1 2018-01-02
In [2]: es = ft.EntitySet('Transactions')
...: es.entity_from_dataframe(dataframe=df, entity_id='log', index='index', t
...: ime_index='date')
...: es.normalize_entity(base_entity_id='log', new_entity_id='locations', ind
...: ex='location')
...:
Out[2]:
Entityset: Transactions
Entities:
log [Rows: 5, Columns: 4]
locations [Rows: 2, Columns: 2]
Relationships:
log.location -> locations.location
In [3]: fm, features = ft.dfs(entityset=es,
...: target_entity='log',
...: agg_primitives=['sum', 'mean'],
...: trans_primitives=['day'])
...: fm
...:
Out[3]:
location pies sold DAY(date) locations.DAY(first_log_time) locations.MEAN(log.pies sold) locations.SUM(log.pies sold)
index
1 main street 3 29 29 4.0 12
2 main street 4 30 29 4.0 12
3 main street 5 31 29 4.0 12
4 arlington ave. 18 1 1 9.5 19
5 arlington ave. 1 2 1 9.5 19
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With