Here is the code I tried:
# normalizing the train data
cols_to_norm = ["WORK_EDUCATION", "SHOP", "OTHER",'AM','PM','MIDDAY','NIGHT', 'AVG_VEH_CNT', 'work_traveltime', 'shop_traveltime','work_tripmile','shop_tripmile', 'TRPMILES_sum',
'TRVL_MIN_sum', 'TRPMILES_mean', 'HBO', 'HBSHOP', 'HBW', 'NHB', 'DWELTIME_mean','TRVL_MIN_mean', 'work_dweltime', 'shop_dweltime', 'firsttrip_time', 'lasttrip_time']
dataframe[cols_to_norm] = dataframe[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max()-x.min()))
# labels
y = dataframe.R_SEX.values
# splitting train and test set
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.33, random_state=42)
model = Sequential()
model.add(Dense(256, input_shape=(X_train.shape[1],), activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(layers.Dropout(0.3))
model.add(Dense(256, activation='relu'))
model.add(layers.Dropout(0.3))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam' , metrics=['acc'])
print(model.summary())
model.fit(X_train, y_train , batch_size=128, epochs=30, validation_split=0.2)
Epoch 23/30
1014/1014 [==============================] - 4s 4ms/step - loss: 0.6623 - acc: 0.5985 - val_loss: 0.6677 - val_acc: 0.5918
Epoch 24/30
1014/1014 [==============================] - 4s 4ms/step - loss: 0.6618 - acc: 0.5993 - val_loss: 0.6671 - val_acc: 0.5925
Epoch 25/30
1014/1014 [==============================] - 4s 4ms/step - loss: 0.6618 - acc: 0.5997 - val_loss: 0.6674 - val_acc: 0.5904
Epoch 26/30
1014/1014 [==============================] - 4s 4ms/step - loss: 0.6614 - acc: 0.6001 - val_loss: 0.6669 - val_acc: 0.5911
Epoch 27/30
1014/1014 [==============================] - 4s 4ms/step - loss: 0.6608 - acc: 0.6004 - val_loss: 0.6668 - val_acc: 0.5920
Epoch 28/30
1014/1014 [==============================] - 4s 4ms/step - loss: 0.6605 - acc: 0.6002 - val_loss: 0.6679 - val_acc: 0.5895
Epoch 29/30
1014/1014 [==============================] - 4s 4ms/step - loss: 0.6602 - acc: 0.6009 - val_loss: 0.6663 - val_acc: 0.5932
Epoch 30/30
1014/1014 [==============================] - 4s 4ms/step - loss: 0.6597 - acc: 0.6027 - val_loss: 0.6674 - val_acc: 0.5910
<tensorflow.python.keras.callbacks.History at 0x7fdd8143a278>
I have tried modifying the neural network and double-cheking the data.
Is there anything I can do to improve the outcome? Is the model not deep enough? Is there any alternative models suited for my data? Does this mean these features have no predictive value? I'm kind of confused what to do next.
thank you
Update:
I tried adding new column do my dataframe which is the outcome of a KNN model for sex classification. Here is what I did:
#Import knearest neighbors Classifier model
from sklearn.neighbors import KNeighborsClassifier
#Create KNN Classifier
knn = KNeighborsClassifier(n_neighbors=41)
#Train the model using the training sets
knn.fit(X, y)
#predict sex for the train set so that it can be fed to the nueral net
y_pred = knn.predict(X)
#add the outcome of knn to the train set
X = X.assign(KNN_result=y_pred)
It improved the training and validation accuracy up to 61 percent.
Epoch 26/30
1294/1294 [==============================] - 8s 6ms/step - loss: 0.6525 - acc: 0.6166 - val_loss: 0.6604 - val_acc: 0.6095
Epoch 27/30
1294/1294 [==============================] - 8s 6ms/step - loss: 0.6523 - acc: 0.6173 - val_loss: 0.6596 - val_acc: 0.6111
Epoch 28/30
1294/1294 [==============================] - 8s 6ms/step - loss: 0.6519 - acc: 0.6177 - val_loss: 0.6614 - val_acc: 0.6101
Epoch 29/30
1294/1294 [==============================] - 8s 6ms/step - loss: 0.6512 - acc: 0.6178 - val_loss: 0.6594 - val_acc: 0.6131
Epoch 30/30
1294/1294 [==============================] - 8s 6ms/step - loss: 0.6510 - acc: 0.6183 - val_loss: 0.6603 - val_acc: 0.6103
<tensorflow.python.keras.callbacks.History at 0x7fe981bbe438>
Thank you
In short: NNs are rarely the best models for classifying either small amounts data or the data that is already compactly represented by a few non-heterogeneous columns. Often enough, boosted methods or GLM would produce better results from a similar amount of effort.
What can you do with your model? Counterintuitively, sometimes hindering the network capacity can be beneficial, especially when the number of network parameters exceeds number of training points. One can reduce the number of neurons, like in your case setting layer sizes to 16 or so and simultaneously removing layers; introduce regularizations (label smoothing, weight decay, etc); or generate more data by adding more derived columns in different (log, binary) scales.
Another approach would be to search for NNs models designed for your type of data. As, for example, Self-Normalizing Neural Networks or Wide & Deep Learning for Recommender Systems.
If you get to try only 1 thing, I would recommend doing a grid search of the learning rate or trying a few different optimizers.
How to make a better decision about which model to use? Look through finished kaggle.com competitions and find datasets similar to the one at hand, then check out the techniques used by the top places.
It seems to me that for a neural network your data is not variate enough. You have a lot of similar values in your dataset. That might be a reason of the low accuracy. Try a simple regressor and not a neural network.
If you want to use a neural network at any rate, you should change the followings:
Generally for regression you should set the activation function for your last layer to 'relu' or 'linear', sigmoid is usually used for the hiden layers.
Try to change these first. If it does not work, try also different strategies as :
For whitening you can do:
from sklearn.decomposition import PCA
pca = PCA(whiten=True)
pca.fit(X)
X = pca.transform(X)
# make here train test split ...
X_test = pca.transform(X_test) # use the same pca model for the test set.
You have a lot of zeros in your dataset. Here you have a list of percentage of zero values per column (between 0 and 1):
0.6611697598907094 WORK_EDUCATION
0.5906196483663051 SHOP
0.15968546556987515 OTHER
0.4517919980835284 AM
0.3695455825652879 PM
0.449195697003247 MIDDAY
0.8160996565242585 NIGHT
0.03156998520561604 AVG_VEH_CNT
1.618641571247746e-05 work_traveltime
2.2660981997468445e-05 shop_traveltime
0.6930343378622924 work_tripmile
0.605410795044367 shop_tripmile
0.185622578107549 TRPMILES_sum
3.237283142495492e-06 TRVL_MIN_sum
0.185622578107549 TRPMILES_mean
0.469645614614391 HBO
0.5744850291841075 HBSHOP
0.8137429143965219 HBW
0.5307266729469959 NHB
0.2017960446874565 DWELTIME_mean
1.618641571247746e-05 TRVL_MIN_mean
0.6959996892208183 work_dweltime
0.6099365168775757 shop_dweltime
0.0009258629787537107 firsttrip_time
0.002949164942813393 lasttrip_time
0.7442934791405661 age_2.0
0.7541995655566023 age_3.0
0.7081200773063214 age_4.0
0.9401296855626884 age_5.0
0.3490503429901489 KNN_result
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With