pandas data frame headers are shifted over when perfoming csv read

Question

I'm trying to read data from a csv file into a pandas data frame but the headers are shifting over two columns when read into data frame.

I think it has to do with there being two blank rows after the header, but I'm not sure. It seems to be reading in the first two columns as row titles/indexes.

CSV Format:

VendorID,lpep_pickup_datetime,Lpep_dropoff_datetime,Store_and_fwd_flag,RateCodeID,Pickup_longitude,Pickup_latitude,Dropoff_longitude,Dropoff_latitude,Passenger_count,Trip_distance,Fare_amount,Extra,MTA_tax,Tip_amount,Tolls_amount,Ehail_fee,Total_amount,Payment_type,Trip_type 


2,2014-04-01 00:00:00,2014-04-01 14:24:20,N,1,0,0,0,0,1,7.45,23,0,0.5,0,0,,23.5,2,1,,
2,2014-04-01 00:00:00,2014-04-01 17:21:33,N,1,0,0,-73.987663269042969,40.780872344970703,1,8.95,31,1,0.5,0,0,,32.5,2,1,,

Data Frame Format:

                                   VendorID lpep_pickup_datetime  \
2 2014-04-01 00:00:00  2014-04-01 14:24:20                    N   
  2014-04-01 00:00:00  2014-04-01 17:21:33                    N   
  2014-04-01 00:00:00  2014-04-01 15:06:18                    N   
  2014-04-01 00:00:00  2014-04-01 08:09:27                    N   
  2014-04-01 00:00:00  2014-04-01 16:15:13                    N   

                       Lpep_dropoff_datetime  Store_and_fwd_flag  RateCodeID  \
2 2014-04-01 00:00:00                      1                   0           0   
  2014-04-01 00:00:00                      1                   0           0   
  2014-04-01 00:00:00                      1                   0           0   
  2014-04-01 00:00:00                      1                   0           0   
  2014-04-01 00:00:00                      1                   0           0

Code Below:

file ='green_tripdata_2014-04.csv'
df4 = pd.read_csv(file)
print(df4.head(5))

I just need it to read into the data frame with the headers in the correct location.

chris-sc · Accepted Answer

Your csv data does look strange - you have 20 column headers, but 22 entries in the first line with data.

Assuming this is only a copy-paste error*, you can try the following:

df = pd.read_csv(file, skiprows=[1,2], index_col=False)

skiprows will skip the two empty rows, and index_col might mitigate the effect of data being interpreted as index columns.

See http://pandas.pydata.org/pandas-docs/version/0.16.2/generated/pandas.read_csv.html for all options to the csv parser.

Edit:

*: If your data look exactly as you posted, then your csv is malformed. You have two more data columns (see the last two commas ,,).

When you delete both commas, the parser works fine.

Another option is to specify the columns to be used:

pd.read_csv("file.csv", skiprows=[1,2], usecols=np.arange(20))

Here, np.arange(20) tells the parser to only parse columns 1-20, that is, the columns that have a valid header (in your first line).

pandas data frame headers are shifted over when perfoming csv read

Tags:

python

pandas

csv

Ben Price

1 Answers

Edit:

chris-sc

Recent Activity

Donate For Us

pandas data frame headers are shifted over when perfoming csv read

Tags:

python

pandas

csv

Ben Price

1 Answers

Edit:

chris-sc

Related questions

Recent Activity

Donate For Us