Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas data frame headers are shifted over when perfoming csv read

Tags:

python

pandas

csv

I'm trying to read data from a csv file into a pandas data frame but the headers are shifting over two columns when read into data frame.

I think it has to do with there being two blank rows after the header, but I'm not sure. It seems to be reading in the first two columns as row titles/indexes.

CSV Format:

VendorID,lpep_pickup_datetime,Lpep_dropoff_datetime,Store_and_fwd_flag,RateCodeID,Pickup_longitude,Pickup_latitude,Dropoff_longitude,Dropoff_latitude,Passenger_count,Trip_distance,Fare_amount,Extra,MTA_tax,Tip_amount,Tolls_amount,Ehail_fee,Total_amount,Payment_type,Trip_type 


2,2014-04-01 00:00:00,2014-04-01 14:24:20,N,1,0,0,0,0,1,7.45,23,0,0.5,0,0,,23.5,2,1,,
2,2014-04-01 00:00:00,2014-04-01 17:21:33,N,1,0,0,-73.987663269042969,40.780872344970703,1,8.95,31,1,0.5,0,0,,32.5,2,1,,

Data Frame Format:

                                   VendorID lpep_pickup_datetime  \
2 2014-04-01 00:00:00  2014-04-01 14:24:20                    N   
  2014-04-01 00:00:00  2014-04-01 17:21:33                    N   
  2014-04-01 00:00:00  2014-04-01 15:06:18                    N   
  2014-04-01 00:00:00  2014-04-01 08:09:27                    N   
  2014-04-01 00:00:00  2014-04-01 16:15:13                    N   

                       Lpep_dropoff_datetime  Store_and_fwd_flag  RateCodeID  \
2 2014-04-01 00:00:00                      1                   0           0   
  2014-04-01 00:00:00                      1                   0           0   
  2014-04-01 00:00:00                      1                   0           0   
  2014-04-01 00:00:00                      1                   0           0   
  2014-04-01 00:00:00                      1                   0           0  

Code Below:

file ='green_tripdata_2014-04.csv'
df4 = pd.read_csv(file)
print(df4.head(5))

I just need it to read into the data frame with the headers in the correct location.

like image 715
Ben Price Avatar asked Nov 17 '15 18:11

Ben Price


1 Answers

Your csv data does look strange - you have 20 column headers, but 22 entries in the first line with data.

Assuming this is only a copy-paste error*, you can try the following:

df = pd.read_csv(file, skiprows=[1,2], index_col=False)

skiprows will skip the two empty rows, and index_col might mitigate the effect of data being interpreted as index columns.

See http://pandas.pydata.org/pandas-docs/version/0.16.2/generated/pandas.read_csv.html for all options to the csv parser.

Edit:

*: If your data look exactly as you posted, then your csv is malformed. You have two more data columns (see the last two commas ,,).

When you delete both commas, the parser works fine.

Another option is to specify the columns to be used:

pd.read_csv("file.csv", skiprows=[1,2], usecols=np.arange(20))

Here, np.arange(20) tells the parser to only parse columns 1-20, that is, the columns that have a valid header (in your first line).

like image 102
chris-sc Avatar answered Oct 16 '22 14:10

chris-sc