I'm trying to read data from a csv file into a pandas data frame but the headers are shifting over two columns when read into data frame.
I think it has to do with there being two blank rows after the header, but I'm not sure. It seems to be reading in the first two columns as row titles/indexes.
CSV Format:
VendorID,lpep_pickup_datetime,Lpep_dropoff_datetime,Store_and_fwd_flag,RateCodeID,Pickup_longitude,Pickup_latitude,Dropoff_longitude,Dropoff_latitude,Passenger_count,Trip_distance,Fare_amount,Extra,MTA_tax,Tip_amount,Tolls_amount,Ehail_fee,Total_amount,Payment_type,Trip_type
2,2014-04-01 00:00:00,2014-04-01 14:24:20,N,1,0,0,0,0,1,7.45,23,0,0.5,0,0,,23.5,2,1,,
2,2014-04-01 00:00:00,2014-04-01 17:21:33,N,1,0,0,-73.987663269042969,40.780872344970703,1,8.95,31,1,0.5,0,0,,32.5,2,1,,
Data Frame Format:
VendorID lpep_pickup_datetime \
2 2014-04-01 00:00:00 2014-04-01 14:24:20 N
2014-04-01 00:00:00 2014-04-01 17:21:33 N
2014-04-01 00:00:00 2014-04-01 15:06:18 N
2014-04-01 00:00:00 2014-04-01 08:09:27 N
2014-04-01 00:00:00 2014-04-01 16:15:13 N
Lpep_dropoff_datetime Store_and_fwd_flag RateCodeID \
2 2014-04-01 00:00:00 1 0 0
2014-04-01 00:00:00 1 0 0
2014-04-01 00:00:00 1 0 0
2014-04-01 00:00:00 1 0 0
2014-04-01 00:00:00 1 0 0
Code Below:
file ='green_tripdata_2014-04.csv'
df4 = pd.read_csv(file)
print(df4.head(5))
I just need it to read into the data frame with the headers in the correct location.
Your csv data does look strange - you have 20 column headers, but 22 entries in the first line with data.
Assuming this is only a copy-paste error*, you can try the following:
df = pd.read_csv(file, skiprows=[1,2], index_col=False)
skiprows
will skip the two empty rows, and index_col
might mitigate the effect of data being interpreted as index columns.
See http://pandas.pydata.org/pandas-docs/version/0.16.2/generated/pandas.read_csv.html for all options to the csv parser.
*: If your data look exactly as you posted, then your csv is malformed. You have two more data columns (see the last two commas ,,
).
When you delete both commas, the parser works fine.
Another option is to specify the columns to be used:
pd.read_csv("file.csv", skiprows=[1,2], usecols=np.arange(20))
Here, np.arange(20)
tells the parser to only parse columns 1-20, that is, the columns that have a valid header (in your first line).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With