i have an excel data that i read in with python pandas:
import pandas as pd
data = pd.read_csv('..../file.txt', sep='\t' )
the mock data looks like this:
unwantedjunkline1
unwantedjunkline2
unwantedjunkline3
ID ColumnA ColumnB ColumnC
1 A B C
2 A B C
3 A B C
...
the data in this case contains 3 junk lines(lines i don't want to read in) before hitting the header and sometimes it contains 4 or more suck junk lines. so in this case i read in the data :
data = pd.read_csv('..../file.txt', sep='\t', skiprows = 3 )
data looks like:
ID ColumnA ColumnB ColumnC
1 A B C
2 A B C
3 A B C
...
But each time the number of unwanted lines is different, is there a way to read in a table file using pandas without using 'skiprows=' but instead using some command that matches the header so it knows to start reading from the header? so I don't have to click open the file to count how many unwanted lines the file contains each time and then manually change the 'skiprows=' option.
If you're looking to drop rows (or columns) containing empty data, you're in luck: Pandas' dropna() method is specifically for this. Technically you could run df. dropna() without any parameters, and this would default to dropping all rows where are completely empty.
Example: Skip Header when Reading CSV File as pandas DataFrame. In this example, I'll explain how to remove the header when importing a CSV file as a pandas DataFrame. For this task, we can apply the read_csv function as shown below. Within the read_csv function, we have to set the skiprows argument to be equal to 1.
The head() method returns a specified number of rows, string from the top. The head() method returns the first 5 rows if a number is not specified. Note: The column names will also be returned, in addition to the specified rows.
notnull is a pandas function that will examine one or multiple values to validate that they are not null. In Python, null values are reflected as NaN (not a number) or None to signify no data present. . notnull will return False if either NaN or None is detected. If these values are not present, it will return True.
If you know what the header startswith:
def skip_to(fle, line,**kwargs):
if os.stat(fle).st_size == 0:
raise ValueError("File is empty")
with open(fle) as f:
pos = 0
cur_line = f.readline()
while not cur_line.startswith(line):
pos = f.tell()
cur_line = f.readline()
f.seek(pos)
return pd.read_csv(f, **kwargs)
Demo:
In [18]: cat test.txt
1,2
3,4
The,header
foo,bar
foobar,foo
In [19]: df = skip_to("test.txt","The,header", sep=",")
In [20]: df
Out[20]:
The header
0 foo bar
1 foobar foo
By calling .tell
we keep track of where the pointer is for the previous line so when we hit the header we seek back to that line and just pass the file object to pandas.
Or using the junk if they all started with something in common:
def skip_to(fle, junk,**kwargs):
if os.stat(fle).st_size == 0:
raise ValueError("File is empty")
with open(fle) as f:
pos = 0
cur_line = f.readline()
while cur_line.startswith(junk):
pos = f.tell()
cur_line = f.readline()
f.seek(pos)
return pd.read_csv(f, **kwargs)
df = skip_to("test.txt", "junk",sep="\t")
Another simple way to achieve a dynamic skiprows would something like this which worked for me:
# Open the file
with open('test.csv', encoding='utf-8') as readfile:
ls_readfile = readfile.readlines()
#Find the skiprows number with ID as the startswith
skip = next(filter(lambda x: x[1].startswith('ID'), enumerate(ls_readfile)))[0]
print(skip)
#import the file with the separator \t
df = pd.read_csv(r'test.txt', skiprows=skip, sep ='\t')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With