i have an excel data that i read in with python pandas: <pre class="prettyprint"><code>import pandas as pd data = pd.read_csv('..../file.txt', sep='\t' ) </code></pre> the mock data looks like this: <pre class="prettyprint"><code>unwantedjunkline1 unwantedjunkline2 unwantedjunkline3 ID ColumnA ColumnB ColumnC 1 A B C 2 A B C 3 A B C ... </code></pre> the data in this case contains 3 junk lines(lines i don't want to read in) before hitting the header and sometimes it contains 4 or more suck junk lines. so in this case i read in the data : <pre class="prettyprint"><code>data = pd.read_csv('..../file.txt', sep='\t', skiprows = 3 ) </code></pre> data looks like: <pre class="prettyprint"><code> ID ColumnA ColumnB ColumnC 1 A B C 2 A B C 3 A B C ... </code></pre> But each time the number of unwanted lines is different, is there a way to read in a table file using pandas without using 'skiprows=' but instead using some command that matches the header so it knows to start reading from the header? so I don't have to click open the file to count how many unwanted lines the file contains each time and then manually change the 'skiprows=' option.

If you know what the header startswith: <pre class="prettyprint"><code>def skip_to(fle, line,**kwargs): if os.stat(fle).st_size == 0: raise ValueError("File is empty") with open(fle) as f: pos = 0 cur_line = f.readline() while not cur_line.startswith(line): pos = f.tell() cur_line = f.readline() f.seek(pos) return pd.read_csv(f, **kwargs) </code></pre> Demo: <pre class="prettyprint"><code>In [18]: cat test.txt 1,2 3,4 The,header foo,bar foobar,foo In [19]: df = skip_to("test.txt","The,header", sep=",") In [20]: df Out[20]: The header 0 foo bar 1 foobar foo </code></pre> By calling <code>.tell</code> we keep track of where the pointer is for the previous line so when we hit the header we seek back to that line and just pass the file object to pandas. Or using the junk if they all started with something in common: <pre class="prettyprint"><code>def skip_to(fle, junk,**kwargs): if os.stat(fle).st_size == 0: raise ValueError("File is empty") with open(fle) as f: pos = 0 cur_line = f.readline() while cur_line.startswith(junk): pos = f.tell() cur_line = f.readline() f.seek(pos) return pd.read_csv(f, **kwargs) df = skip_to("test.txt", "junk",sep="\t") </code></pre>

skipping unknown number of lines to read the header python pandas

Tags:

python

pandas

i have an excel data that i read in with python pandas:

import pandas as pd
data = pd.read_csv('..../file.txt', sep='\t' )

the mock data looks like this:

unwantedjunkline1
unwantedjunkline2
unwantedjunkline3
 ID     ColumnA     ColumnB     ColumnC
 1         A          B            C
 2         A          B            C
 3         A          B            C
...

the data in this case contains 3 junk lines(lines i don't want to read in) before hitting the header and sometimes it contains 4 or more suck junk lines. so in this case i read in the data :

data = pd.read_csv('..../file.txt', sep='\t', skiprows = 3 )

data looks like:

 ID     ColumnA     ColumnB     ColumnC
 1         A          B            C
 2         A          B            C
 3         A          B            C
...

But each time the number of unwanted lines is different, is there a way to read in a table file using pandas without using 'skiprows=' but instead using some command that matches the header so it knows to start reading from the header? so I don't have to click open the file to count how many unwanted lines the file contains each time and then manually change the 'skiprows=' option.

573

asked Dec 01 '15 19:12

Jessica

2 Answers

If you know what the header startswith:

def skip_to(fle, line,**kwargs):
    if os.stat(fle).st_size == 0:
        raise ValueError("File is empty")
    with open(fle) as f:
        pos = 0
        cur_line = f.readline()
        while not cur_line.startswith(line):
            pos = f.tell()
            cur_line = f.readline()
        f.seek(pos)
        return pd.read_csv(f, **kwargs)

Demo:

In [18]: cat test.txt
1,2
3,4
The,header
foo,bar
foobar,foo
In [19]: df = skip_to("test.txt","The,header", sep=",")

In [20]: df
Out[20]: 
      The header
0     foo    bar
1  foobar    foo

By calling .tell we keep track of where the pointer is for the previous line so when we hit the header we seek back to that line and just pass the file object to pandas.

Or using the junk if they all started with something in common:

def skip_to(fle, junk,**kwargs):
    if os.stat(fle).st_size == 0:
        raise ValueError("File is empty")
    with open(fle) as f:
        pos = 0
        cur_line = f.readline()
        while cur_line.startswith(junk):
            pos = f.tell()
            cur_line = f.readline()
        f.seek(pos)
        return pd.read_csv(f, **kwargs)

 df = skip_to("test.txt", "junk",sep="\t")

110

answered Sep 28 '22 11:09

Padraic Cunningham

Another simple way to achieve a dynamic skiprows would something like this which worked for me:

# Open the file
with open('test.csv', encoding='utf-8') as readfile:
        ls_readfile = readfile.readlines()
        
        #Find the skiprows number with ID as the startswith
        skip = next(filter(lambda x: x[1].startswith('ID'), enumerate(ls_readfile)))[0]
        print(skip)

#import the file with the separator \t
df = pd.read_csv(r'test.txt', skiprows=skip, sep ='\t')

answered Sep 28 '22 09:09

The AG

Related questions
                            
                                Cartopy: order of rendering layers with scatter data
                            
                                LDA ignoring n_components?
                            
                                'str' object has no attribute 'decode'
                            
                                Prediction in Caffe - Exception: Input blob arguments do not match net inputs
                            
                                Dynamodb: query using more than two attributes
                            
                                Python regex: splitting on pattern match that is an empty string
                            
                                Numpy: calculate based on previous element?
                            
                                Coroutine in python between 3.4 and 3.5, How can I keep backwords compatibility?
                            
                                Python how to index multidimensional array with string key, like a dict
                            
                                Replacing an imported module dependency
                            
                                How can I identify invisible characters in python strings?
                            
                                How to make type cast for python custom class
                            
                                Adding extra contour lines using matplotlib 2D contour plotting
                            
                                Python & Pandas: Combine columns into a date
                            
                                How to fix error Xlib.error.DisplayConnectionError: Can't connect to display ":0": b'No protocol specified\n'
                            
                                Attribute error Django REST serializing
                            
                                Install Poppler for Python on Mac
                            
                                How to provide temporary download url in Flask?
                            
                                How to Label patch in matplotlib
                            
                                Tkinter/ttk themed Message Box?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With