I know featuretools has ft.calculate_feature_matrix method, but it calculate data use test. I need when I get the feature use train data,and join to test data not use the same feature on test data. for example: train data:
id sex score
1 f 100
2 f 200
3 m 10
4 m 20
after dfs, I get:
id sex score sex.mean(score)
1 f 100 150
2 f 200 150
3 m 10 15
4 m 20 15
i want get like this on test set:
id sex score sex.mean(score)
5 f 30 150
6 f 40 150
7 m 50 15
8 m 60 15
not
id sex score sex.mean(score)
5 f 30 35
6 f 40 35
7 m 50 55
8 m 60 55
how can i realization it, thanks you。
Featuretools works best with data that has been annotated directly with time information to handle cases like this. The, when calculating your features, you specify a "cutoff time" that you want to filter data out afterwards. If we restructure your data, and add in some time information, Featuretools can accomplish what you want.
First, let me create a DataFrame of people
import pandas as pd
people = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6, 7, 8],
"sex": ['f', 'f', 'm', 'm', 'f', 'f', 'm', 'm']})
which looks like this
id sex
0 1 f
1 2 f
2 3 m
3 4 m
4 5 f
5 6 f
6 7 m
7 8 m
Then, let's create a separate DataFrame of scores where we annotate each score with the time it occurred. This can be either an datetime or an integer. For simplicity in this example, I'll use time 0 for training data and time 1 for the test data.
scores = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6, 7, 8],
"person_id": [1, 2, 3, 4, 5, 6, 7, 8],
"time": [0, 0, 0, 0, 1, 1, 1, 1],
"score": [100, 200, 10, 20, 30, 40, 50, 60]})
which looks like this
id person_id score time
0 1 1 100 0
1 2 2 200 0
2 3 3 10 0
3 4 4 20 0
4 5 5 30 1
5 6 6 40 1
6 7 7 50 1
7 8 8 60 1
Now, let's create an EntitySet in Featuretools specifying the "time index" in the scores entity
import featuretools as ft
es = ft.EntitySet('example')
es.entity_from_dataframe(dataframe=people,
entity_id='people',
index='id')
es.entity_from_dataframe(dataframe=scores,
entity_id='scores',
index='id',
time_index= "time")
# create a sexes entity
es.normalize_entity(base_entity_id="people", new_entity_id="sexes", index="sex")
# add relationship for scores to person
scores_relationship = ft.Relationship(es["people"]["id"],
es["scores"]["person_id"])
es = es.add_relationship(scores_relationship)
es
Here is our entity set
Entityset: example
Entities:
scores [Rows: 8, Columns: 4]
sexes [Rows: 2, Columns: 1]
people [Rows: 8, Columns: 2]
Relationships:
scores.person_id -> people.id
people.sex -> sexes.sex
Next, let's calculate the feature of interest. Notice when we use the cutoff_time argument to specify the last time data is allowed to be used for the calculation. This ensures none of our testing data is made available during calculation.
from featuretools.primitives import Mean
mean_by_sex = ft.Feature(Mean(es["scores"]["score"], es["sexes"]), es["people"])
ft.calculate_feature_matrix(entityset=es, features=[mean_by_sex], cutoff_time=0)
The output is now
sexes.MEAN(scores.score)
id
1 150
2 150
3 15
4 15
5 150
6 150
7 15
8 15
This functionality is powerful because we can handle time in a more fine grained manner than a single train / test split.
For information on how time indexes work in Featuretools read the Handling Time page in the documentation.
EDIT
If you want to automatically define many features, you can use Deep Feature Synthesis by calling ft.dfs
feature_list = ft.dfs(target_entity="people",
entityset=es,
agg_primitives=["count", "std", "max"],
features_only=True)
feature_list
this returns feature definitions you can pass to ft.calculate_feature_matrix
[<Feature: sex>,
<Feature: MAX(scores.score)>,
<Feature: STD(scores.time)>,
<Feature: STD(scores.score)>,
<Feature: COUNT(scores)>,
<Feature: MAX(scores.time)>,
<Feature: sexes.STD(scores.score)>,
<Feature: sexes.COUNT(people)>,
<Feature: sexes.STD(scores.time)>,
<Feature: sexes.MAX(scores.score)>,
<Feature: sexes.MAX(scores.time)>,
<Feature: sexes.COUNT(scores)>]
Read more about DFS in this write-up
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With