Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to import csv data file into scikit-learn?

From my understanding, the scikit-learn accepts data in (n-sample, n-feature) format which is a 2D array. Assuming I have data in the form ...

Stock prices    indicator1    indicator2 2.0             123           1252 1.0             ..            .. ..              .             .  . 

How do I import this?

like image 380
user1234440 Avatar asked Jun 13 '12 21:06

user1234440


1 Answers

A very good alternative to numpy loadtxt is read_csv from Pandas. The data is loaded into a Pandas dataframe with the big advantage that it can handle mixed data types such as some columns contain text and other columns contain numbers. You can then easily select only the numeric columns and convert to a numpy array with as_matrix. Pandas will also read/write excel files and a bunch of other formats.

If we have a csv file named "mydata.csv":

point_latitude,point_longitude,line,construction,point_granularity 30.102261, -81.711777, Residential, Masonry, 1 30.063936, -81.707664, Residential, Masonry, 3 30.089579, -81.700455, Residential, Wood   , 1 30.063236, -81.707703, Residential, Wood   , 3 30.060614, -81.702675, Residential, Wood   , 1 

This will read in the csv and convert the numeric columns into a numpy array for scikit_learn, then modify the order of columns and write it out to an excel spreadsheet:

import numpy as np import pandas as pd  input_file = "mydata.csv"   # comma delimited is the default df = pd.read_csv(input_file, header = 0)  # for space delimited use: # df = pd.read_csv(input_file, header = 0, delimiter = " ")  # for tab delimited use: # df = pd.read_csv(input_file, header = 0, delimiter = "\t")  # put the original column names in a python list original_headers = list(df.columns.values)  # remove the non-numeric columns df = df._get_numeric_data()  # put the numeric column names in a python list numeric_headers = list(df.columns.values)  # create a numpy array with the numeric values for input into scikit-learn numpy_array = df.as_matrix()  # reverse the order of the columns numeric_headers.reverse() reverse_df = df[numeric_headers]  # write the reverse_df to an excel spreadsheet reverse_df.to_excel('path_to_file.xls') 
like image 187
3 revs, 2 users 94% Avatar answered Sep 18 '22 11:09

3 revs, 2 users 94%