I am trying to shuffle and split a data file into a training set and test set using pandas and numpy, so I did the following:
import pandas as pd
import numpy as np
data_path = "/path_to_data_file/"
train = pd.read_csv(data_path+"product.txt", header=0, delimiter="|")
ts = train.shape
#print "data dimension", ts
#print "product attributes \n", train.columns.values
#shuffle data set, and split to train and test set.
df = pd.DataFrame(train)
new_train = df.reindex(np.random.permutation(df.index))
indice_90_percent = int((ts[0]/100.0)* 90)
print "90% indice", indice_90_percent
#write train products to csv
#new_train.to_csv(sep="|")
with open('train_products.txt', 'w') as f:
for i in new_train[:indice_90_percent]:
f.write(i+'\n')
with open('test_products.txt', 'w') as f:
for i in new_train[indice_90_percent:]:
f.write(i+'\n')
But instead of getting the training and test files with data rows, I get two files containing the names of the columns. What did I miss?
You can use to_csv to write the rows, if you don't want the the column names use header=False
.
new_train[indice_90_percent:].to_csv('test_products.txt',header=False)
new_train[:indice_90_percent].to_csv('train_products.txt',header=False)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With