I am using the ALL.zip file located here. My goal is to create a pandas DataFrame with it. However, if I run
data=pd.read_csv(foo.csv)
the column names do not match up. The first column has no name, and then the second column is labeled with the first, and the last column is a Series of NaN. So I tried
colnames=[list of colnames]
data=pd.read_csv(foo.csv, names=colnames, header=False)
which gave me the exact same thing, so I ran
data=pd.read_csv(foo.csv, names=colnames)
which lined the colnames up perfectly, but had the csv assigned column names(the first line in the csv document) perfectly aligned as the first row of data it. So I ran
data=data[1:]
which did the trick.
So I found a work around without solving the actual problem. I looked at the read_csv document and found it a bit overwhelming, and could not figure out a way using only pd.read_csv to fix this problem.
What was the fundamental problem (I am assuming it is either user error or a problem with the file)? Is there a way to fix it with one of the commands from the read_csv?
Here is the first 2 rows from the csv file
cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,contb_receipt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num,tran_id,election_tp
C00458844,"P60006723","Rubio, Marco","HEFFERNAN, MICHAEL","APO","AE","090960009","INFORMATION REQUESTED PER BEST EFFORTS","INFORMATION REQUESTED PER BEST EFFORTS",210,27-JUN-15,"","","","SA17A","1015697","SA17.796904","P2016",
Reorder Columns using Pandas . Another way to reorder columns is to use the Pandas . reindex() method. This allows you to pass in the columns= parameter to pass in the order of columns that you want to use.
The basic idea to move a column in a pandas dataframe is to remove the column from its current place and insert it in the desired position. The pandas library offers many useful functions such as pop() and insert(). We will make use of these two functions to manipulate with our dataframe.
Use pandas. read_csv() to set column names when reading in a csv. Call pandas. read_csv(filepath_or_buffer, names = None) with filepath_or_buffer set to the filename of the .
Source from Kaggle character encoding. The Pandas read_csv() function has an argument call encoding that allows you to specify an encoding to use when reading a file.
It's not the column that you're having a problem with, it's the index
import pandas as pd
df = pd.read_csv('P00000001-ALL.csv', index_col=False, low_memory=False)
print(df.head(1))
cmte_id cand_id cand_nm contbr_nm contbr_city \
0 C00458844 P60006723 Rubio, Marco HEFFERNAN, MICHAEL APO
contbr_st contbr_zip contbr_employer \
0 AE 090960009 INFORMATION REQUESTED PER BEST EFFORTS
contbr_occupation contb_receipt_amt contb_receipt_dt \
0 INFORMATION REQUESTED PER BEST EFFORTS 210 27-JUN-15
receipt_desc memo_cd memo_text form_tp file_num tran_id election_tp
0 NaN NaN NaN SA17A 1015697 SA17.796904 P2016
The low_memory=False
is because column 6 has mixed datatype.
The problem comes from having every line in the file except for the first terminating in a comma (the separator character). Pandas thinks there's an empty column there if it needs to consider the first 'column name' as the index column.
Try
data= pd.read_csv('P00000001-AL.csv',index_col=False)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With