Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Pandas Bad Lines Warning Capture

Is there any way in Pandas to capture the warning produced by setting error_bad_lines = False and warn_bad_lines = True? For instance the following script:

import pandas as pd
from StringIO import StringIO
data = StringIO("""a,b,c
pd.read_csv(data, warn_bad_lines=True, error_bad_lines=False)

produces the warning:

Skipping line 4: expected 3 fields, saw 4

I'd like to store this output to a string so that I can eventually write it to a log file to keep track of records that are being skipped.

I tried using the warning module but it doesn't appear as though this "warning" is of the traditional sense. I'm using Python 2.7 and Pandas 0.16.

Any help would be greatly appreciated.

like image 493
eroma934 Avatar asked Sep 01 '15 15:09


People also ask

How do you skip bad lines in pandas?

Instead, use on_bad_lines = 'warn' to achieve the same effect to skip over bad data lines.

2 Answers

I can't help you with older than Python 3, but I've had very good success with the following:

import pandas as pd
from contextlib import redirect_stderr
import io

# Redirect stderr to something we can report on.
f = io.StringIO()
with redirect_stderr(f):
    df = pd.read_csv(
        new_file_name, header=None, error_bad_lines=False, warn_bad_lines=True, dtype=header_types
if f.getvalue():
    logger.warning("Had parsing errors: {}".format(f.getvalue()))

I searched for this issue a number of times and kept being pointed to this questions. Hope it helps someone else, later on.

like image 198
staylorx Avatar answered Nov 14 '22 18:11


I think it isn't implemented to pandas.
source1, source2

My solutions:

1. Pre or after processing

import pandas as pd
import csv      

df = pd.read_csv('data.csv', warn_bad_lines=True, error_bad_lines=False)

#compare length of rows by recommended value:

with open('data.csv') as csv_file:
    reader = csv.reader(csv_file, delimiter=',')
    for row in reader:
        if (len(row) != RECOMMENDED):
            print ("Length of row is: %r" % len(row) )
            print row

#compare length of rows by length of columns in df
lencols = len(df.columns)
print lencols

with open('data.csv') as csv_file:
    reader = csv.reader(csv_file, delimiter=',')
    for row in reader:
        if (len(row) != lencols):
            print ("Length of row is: %r" % len(row) )
            print row

2. Replaces sys.stdout

import pandas as pd
import os
import sys

class RedirectStdStreams(object):
    def __init__(self, stdout=None, stderr=None):
        self._stdout = stdout or sys.stdout
        self._stderr = stderr or sys.stderr

    def __enter__(self):
        self.old_stdout, self.old_stderr = sys.stdout, sys.stderr
        self.old_stdout.flush(); self.old_stderr.flush()
        sys.stdout, sys.stderr = self._stdout, self._stderr

    def __exit__(self, exc_type, exc_value, traceback):
        self._stdout.flush(); self._stderr.flush()
        sys.stdout = self.old_stdout
        sys.stderr = self.old_stderr

if __name__ == '__main__':

    devnull = open('log.txt', 'w')

    #replaces sys.stdout, sys.stderr, see http://stackoverflow.com/a/6796752/2901002
    with RedirectStdStreams(stdout=devnull, stderr=devnull):
        df = pd.read_csv('data.csv', warn_bad_lines=True, error_bad_lines=False)
like image 42
jezrael Avatar answered Nov 14 '22 20:11
