Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

skipping unknown number of lines to read the header python pandas

Tags:

python

pandas

i have an excel data that i read in with python pandas:

import pandas as pd
data = pd.read_csv('..../file.txt', sep='\t' )

the mock data looks like this:

unwantedjunkline1
unwantedjunkline2
unwantedjunkline3
 ID     ColumnA     ColumnB     ColumnC
 1         A          B            C
 2         A          B            C
 3         A          B            C
...

the data in this case contains 3 junk lines(lines i don't want to read in) before hitting the header and sometimes it contains 4 or more suck junk lines. so in this case i read in the data :

data = pd.read_csv('..../file.txt', sep='\t', skiprows = 3 )

data looks like:

 ID     ColumnA     ColumnB     ColumnC
 1         A          B            C
 2         A          B            C
 3         A          B            C
...

But each time the number of unwanted lines is different, is there a way to read in a table file using pandas without using 'skiprows=' but instead using some command that matches the header so it knows to start reading from the header? so I don't have to click open the file to count how many unwanted lines the file contains each time and then manually change the 'skiprows=' option.

like image 573
Jessica Avatar asked Dec 01 '15 19:12

Jessica


People also ask

How do you skip blank lines in pandas?

If you're looking to drop rows (or columns) containing empty data, you're in luck: Pandas' dropna() method is specifically for this. Technically you could run df. dropna() without any parameters, and this would default to dropping all rows where are completely empty.

How do I skip a header in a data frame?

Example: Skip Header when Reading CSV File as pandas DataFrame. In this example, I'll explain how to remove the header when importing a CSV file as a pandas DataFrame. For this task, we can apply the read_csv function as shown below. Within the read_csv function, we have to set the skiprows argument to be equal to 1.

What does the pandas head () method do?

The head() method returns a specified number of rows, string from the top. The head() method returns the first 5 rows if a number is not specified. Note: The column names will also be returned, in addition to the specified rows.

What is Notnull in pandas?

notnull is a pandas function that will examine one or multiple values to validate that they are not null. In Python, null values are reflected as NaN (not a number) or None to signify no data present. . notnull will return False if either NaN or None is detected. If these values are not present, it will return True.


2 Answers

If you know what the header startswith:

def skip_to(fle, line,**kwargs):
    if os.stat(fle).st_size == 0:
        raise ValueError("File is empty")
    with open(fle) as f:
        pos = 0
        cur_line = f.readline()
        while not cur_line.startswith(line):
            pos = f.tell()
            cur_line = f.readline()
        f.seek(pos)
        return pd.read_csv(f, **kwargs)

Demo:

In [18]: cat test.txt
1,2
3,4
The,header
foo,bar
foobar,foo
In [19]: df = skip_to("test.txt","The,header", sep=",")

In [20]: df
Out[20]: 
      The header
0     foo    bar
1  foobar    foo

By calling .tell we keep track of where the pointer is for the previous line so when we hit the header we seek back to that line and just pass the file object to pandas.

Or using the junk if they all started with something in common:

def skip_to(fle, junk,**kwargs):
    if os.stat(fle).st_size == 0:
        raise ValueError("File is empty")
    with open(fle) as f:
        pos = 0
        cur_line = f.readline()
        while cur_line.startswith(junk):
            pos = f.tell()
            cur_line = f.readline()
        f.seek(pos)
        return pd.read_csv(f, **kwargs)

 df = skip_to("test.txt", "junk",sep="\t")
like image 110
Padraic Cunningham Avatar answered Sep 28 '22 11:09

Padraic Cunningham


Another simple way to achieve a dynamic skiprows would something like this which worked for me:

# Open the file
with open('test.csv', encoding='utf-8') as readfile:
        ls_readfile = readfile.readlines()
        
        #Find the skiprows number with ID as the startswith
        skip = next(filter(lambda x: x[1].startswith('ID'), enumerate(ls_readfile)))[0]
        print(skip)

#import the file with the separator \t
df = pd.read_csv(r'test.txt', skiprows=skip, sep ='\t')
like image 38
The AG Avatar answered Sep 28 '22 09:09

The AG