read_csv with missing/incomplete header or irregular number of columns

Tags:

I have a file.csv with ~15k rows that looks like this

SAMPLE_TIME,          POS,        OFF,  HISTOGRAM
2015-07-15 16:41:56,  0-0-0-0-3,   1,    2,0,5,59,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,
2015-07-15 16:42:55,  0-0-0-0-3,   1,    0,0,5,9,0,0,0,0,0,2,0,0,0,50,0,
2015-07-15 16:43:55,  0-0-0-0-3,   1,    0,0,5,5,0,0,0,0,0,2,0,0,0,0,4,0,0,0,
2015-07-15 16:44:56,  0-0-0-0-3,   1,    2,0,5,0,0,0,0,0,0,2,0,0,0,6,0,0,0,0

I wanted it to be imported to pandas.DataFrame with any random value given to the column that don't have a header, something like this:

SAMPLE_TIME,          POS,        OFF,  HISTOGRAM   1  2  3   4  5    6  
2015-07-15 16:41:56,  0-0-0-0-3,   1,    2,         0, 5, 59, 4, 0,   0, 
2015-07-15 16:42:55,  0-0-0-0-3,   1,    0,         0, 5,  0, 6, 0,   nan
2015-07-15 16:43:55,  0-0-0-0-3,   1,    0,         0, 5,  0, 7, nan  nan
2015-07-15 16:44:56,  0-0-0-0-3,   1,    2,         0, 5,  0, 0, 2,   nan

This has been impossible to import, as i tried different solution, such as giving a specific a header, But still no joy, the only way i was able to make it work is to add a header manually in the .csv file. which kinda defeat the purpose of automation!

Then i tried this solution: Doing this

lines=list(csv.reader(open('file.csv')))    
header, values = lines[0], lines[1:]

it correctly reads the files giving me a list of ~15k element values, each element is a list of string, where each string is correctly parsed data field from the file, but when i try to do this:

data = {h:v for h,v in zip (header, zip(*values))}
df = pd.DataFrame.from_dict(data)

or this:

data2 = {h:v for h,v in zip (str(xrange(16)), zip(*values))}
df2 = pd.DataFrame.from_dict(data)

Then the non headered columns disappear and the order of columns is completely mixed. any idea of a possible solution ?

888

asked Dec 18 '15 14:12

InsaneBot

2 Answers

You can create columns based on the length of the first actual row:

from tempfile import TemporaryFile
with open("out.txt") as f, TemporaryFile("w+") as t:
    h, ln = next(f), len(next(f).split(","))
    header = h.strip().split(",")
    f.seek(0), next(f)
    header += range(ln)
    print(pd.read_csv(f, names=header))

Which will give you:

          SAMPLE_TIME           POS          OFF    HISTOGRAM  0  1   2  3  \
0  2015-07-15 16:41:56     0-0-0-0-3            1            2  0  5  59  0   
1  2015-07-15 16:42:55     0-0-0-0-3            1            0  0  5   9  0   
2  2015-07-15 16:43:55     0-0-0-0-3            1            0  0  5   5  0   
3  2015-07-15 16:44:56     0-0-0-0-3            1            2  0  5   0  0   

   4  5 ...  13  14  15  16  17  18  19  20  21  22  
0  0  0 ...   0   0   0   0   0 NaN NaN NaN NaN NaN  
1  0  0 ...   0 NaN NaN NaN NaN NaN NaN NaN NaN NaN  
2  0  0 ...   4   0   0   0 NaN NaN NaN NaN NaN NaN  
3  0  0 ...   0   0   0   0 NaN NaN NaN NaN NaN NaN  

[4 rows x 27 columns]

Or you could clean the file before passing to pandas:

import pandas as pd

from tempfile import TemporaryFile
with open("in.csv") as f, TemporaryFile("w+") as t:
    for line in f:
        t.write(line.replace(" ", ""))
    t.seek(0)
    ln = len(line.strip().split(","))
    header = t.readline().strip().split(",")
    header += range(ln)
    print(pd.read_csv(t,names=header))

Which gives you:

          SAMPLE_TIME        POS  OFF  HISTOGRAM  0  1   2  3  4  5 ...  11  \
0  2015-07-1516:41:56  0-0-0-0-3    1          2  0  5  59  0  0  0 ...   0   
1  2015-07-1516:42:55  0-0-0-0-3    1          0  0  5   9  0  0  0 ...   0   
2  2015-07-1516:43:55  0-0-0-0-3    1          0  0  5   5  0  0  0 ...   0   
3  2015-07-1516:44:56  0-0-0-0-3    1          2  0  5   0  0  0  0 ...   0   

   12  13  14  15  16  17  18  19  20  
0   0   0   0   0   0   0 NaN NaN NaN  
1  50   0 NaN NaN NaN NaN NaN NaN NaN  
2   0   4   0   0   0 NaN NaN NaN NaN  
3   6   0   0   0   0 NaN NaN NaN NaN  

[4 rows x 25 columns]

or to drop the columns will all nana:

print(pd.read_csv(f, names=header).dropna(axis=1,how="all"))

Gives you:

           SAMPLE_TIME           POS          OFF    HISTOGRAM  0  1   2  3  \
0  2015-07-15 16:41:56     0-0-0-0-3            1            2  0  5  59  0   
1  2015-07-15 16:42:55     0-0-0-0-3            1            0  0  5   9  0   
2  2015-07-15 16:43:55     0-0-0-0-3            1            0  0  5   5  0   
3  2015-07-15 16:44:56     0-0-0-0-3            1            2  0  5   0  0   

   4  5 ...  8  9  10  11  12  13  14  15  16  17  
0  0  0 ...  2  0   0   0   0   0   0   0   0   0  
1  0  0 ...  2  0   0   0  50   0 NaN NaN NaN NaN  
2  0  0 ...  2  0   0   0   0   4   0   0   0 NaN  
3  0  0 ...  2  0   0   0   6   0   0   0   0 NaN  

[4 rows x 22 columns]

179

answered Nov 03 '22 00:11

Padraic Cunningham

You can split column HISTOGRAM to new DataFrame and concat it to original.

print df
         SAMPLE_TIME,        POS, OFF,  \
0 2015-07-15 16:41:56  0-0-0-0-3,   1,   
1 2015-07-15 16:42:55  0-0-0-0-3,   1,   
2 2015-07-15 16:43:55  0-0-0-0-3,   1,   
3 2015-07-15 16:44:56  0-0-0-0-3,   1,   

                                 HISTOGRAM  
0  2,0,5,59,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,  
1          0,0,5,9,0,0,0,0,0,2,0,0,0,50,0,  
2     0,0,5,5,0,0,0,0,0,2,0,0,0,0,4,0,0,0,  
3      2,0,5,0,0,0,0,0,0,2,0,0,0,6,0,0,0,0

#create new dataframe from column HISTOGRAM
h = pd.DataFrame([ x.split(',') for x in df['HISTOGRAM'].tolist()])
print h
  0  1  2   3  4  5  6  7  8  9  10 11 12  13 14 15    16    17    18    19
0  2  0  5  59  0  0  0  0  0  2  0  0  0   0  0  0     0     0     0      
1  0  0  5   9  0  0  0  0  0  2  0  0  0  50  0     None  None  None  None
2  0  0  5   5  0  0  0  0  0  2  0  0  0   0  4  0     0     0        None
3  2  0  5   0  0  0  0  0  0  2  0  0  0   6  0  0     0     0  None  None

#append to original, rename 0 column
df = pd.concat([df, h], axis=1).rename(columns={0:'HISTOGRAM'})
print df
                                 HISTOGRAM HISTOGRAM  1  2   3  4  5  ...  10  \
0  2,0,5,59,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,         2  0  5  59  0  0  ...   0   
1          0,0,5,9,0,0,0,0,0,2,0,0,0,50,0,         0  0  5   9  0  0  ...   0   
2     0,0,5,5,0,0,0,0,0,2,0,0,0,0,4,0,0,0,         0  0  5   5  0  0  ...   0   
3      2,0,5,0,0,0,0,0,0,2,0,0,0,6,0,0,0,0         2  0  5   0  0  0  ...   0   

  11 12  13 14 15    16    17    18    19  
0  0  0   0  0  0     0     0     0        
1  0  0  50  0     None  None  None  None  
2  0  0   0  4  0     0     0        None  
3  0  0   6  0  0     0     0  None  None  

[4 rows x 24 columns]

answered Nov 03 '22 02:11

jezrael

Related questions
                            
                                BLE subscribe to notification using gatttool or bluepy
                            
                                How to use Decorators with Flask-RESTPlus for each function call?
                            
                                convert timedelta object to time object
                            
                                How does app engine (python) manage memory across requests (Exceeded soft private memory limit)
                            
                                Advanced: How to use href in Jinja2
                            
                                python pandas insert column
                            
                                No module named _testcapi
                            
                                Mail Client in Python using sockets only(no smtplib)
                            
                                How to use surf and sift detector in OpenCV for Python
                            
                                How can I run a script as part of a Travis CI build?
                            
                                matrix determinant differentiation in tensorflow
                            
                                Flask-Login shows 401 instead of redirecting to login view
                            
                                How does Python's OrderedDict remember elements inserted?
                            
                                Python libvirt API - create a Virtual Machine
                            
                                AMQP 1.0 Library for Python
                            
                                How to auto-generate full HTML from a WTForms form
                            
                                How to make MxN piechart plots with one legend and removed y-axis titles in Matplotlib
                            
                                Bin size seaborn jointplot
                            
                                Distribute pre-compiled Cython code to Windows
                            
                                change index number in jupyter notebook

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

read_csv with missing/incomplete header or irregular number of columns

Tags:

python

pandas

dataframe

csv

python-2.7

InsaneBot

People also ask

2 Answers

Padraic Cunningham

jezrael

Recent Activity

Donate For Us