After processing, my data is one table with several columns that are features and one column which is a label. I would like to use featuretools.dfs
to help me predict the label. Is it possible to do it directly, or do I need to split my single table into multiple?
It is possible to run DFS on a single table. As an example, if you have a pandas dataframe df
with index 'index'
, you would write:
import featuretools as ft
es = ft.EntitySet('Transactions')
es.entity_from_dataframe(dataframe=df,
entity_id='log',
index='index')
fm, features = ft.dfs(entityset=es,
target_entity='log',
trans_primitives=['day', 'weekday', 'month'])
The generated feature matrix will look like
In [1]: fm
Out[1]:
location pies sold WEEKDAY(date) MONTH(date) DAY(date)
index
1 main street 3 4 12 29
2 main street 4 5 12 30
3 main street 5 6 12 31
4 arlington ave. 18 0 1 1
5 arlington ave. 1 1 1 2
This will apply “transform” primitives to your data. You usually want to add more entities to give ft.dfs
, in order to use aggregation primitives. You can read about the difference in our documentation.
A standard workflow is to normalize your single entity by an interesting categorical. If your df
was the single table
| index | location | pies sold | date |
|-------+----------------+-------+------------|
| 1 | main street | 3 | 2017-12-29 |
| 2 | main street | 4 | 2017-12-30 |
| 3 | main street | 5 | 2017-12-31 |
| 4 | arlington ave. | 18 | 2018-01-01 |
| 5 | arlington ave. | 1 | 2018-01-02 |
you would probably be interested in normalizing by location
:
es.normalize_entity(base_entity_id='log',
new_entity_id='locations',
index='location')
Your new entity locations
would have the table
| location | first_log_time |
|----------------+----------------|
| main street | 2018-12-29 |
| arlington ave. | 2000-01-01 |
which would make features like locations.SUM(log.pies sold)
or locations.MEAN(log.pies sold)
to add or average all values by location. You can see these features created in the example below
In [1]: import pandas as pd
...: import featuretools as ft
...: df = pd.DataFrame({'index': [1, 2, 3, 4, 5],
...: 'location': ['main street',
...: 'main street',
...: 'main street',
...: 'arlington ave.',
...: 'arlington ave.'],
...: 'pies sold': [3, 4, 5, 18, 1]})
...: df['date'] = pd.date_range('12/29/2017', periods=5, freq='D')
...: df
...:
Out[1]:
index location pies sold date
0 1 main street 3 2017-12-29
1 2 main street 4 2017-12-30
2 3 main street 5 2017-12-31
3 4 arlington ave. 18 2018-01-01
4 5 arlington ave. 1 2018-01-02
In [2]: es = ft.EntitySet('Transactions')
...: es.entity_from_dataframe(dataframe=df, entity_id='log', index='index', t
...: ime_index='date')
...: es.normalize_entity(base_entity_id='log', new_entity_id='locations', ind
...: ex='location')
...:
Out[2]:
Entityset: Transactions
Entities:
log [Rows: 5, Columns: 4]
locations [Rows: 2, Columns: 2]
Relationships:
log.location -> locations.location
In [3]: fm, features = ft.dfs(entityset=es,
...: target_entity='log',
...: agg_primitives=['sum', 'mean'],
...: trans_primitives=['day'])
...: fm
...:
Out[3]:
location pies sold DAY(date) locations.DAY(first_log_time) locations.MEAN(log.pies sold) locations.SUM(log.pies sold)
index
1 main street 3 29 29 4.0 12
2 main street 4 30 29 4.0 12
3 main street 5 31 29 4.0 12
4 arlington ave. 18 1 1 9.5 19
5 arlington ave. 1 2 1 9.5 19
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With