I have a large dataframe containing, amongst other things, a (Norwegian) social security number. It is possible to get the date of birth out of this number via a special algorithm. However, every now and then an illegal social security number creeps into the database corrupting the calculation. What I would like to do is to tag every line having an illegal social security number, along with a log message showing the error raised. Consider the following, constructed, example <pre class="prettyprint"><code>import pandas as pd from datetime import date sample_data = pd.DataFrame({'id' : [1, 2, 3], \ 'sec_num' : [19790116, 19480631, 19861220]}) # The actual algorithm transforming the sec number is more complicated # this is just for illustration purposes def int2date(argdate: int): try: year = int(argdate / 10000) month = int((argdate % 10000) / 100) day = int(argdate % 100) return date(year, month, day) except ValueError: raise ValueError("Value:{0} not a legal date.".format(argdate)) </code></pre> I want to create the following output <pre class="prettyprint"><code> id sec_num date_of_birth is_in_error error_msg 0 1 19790116 1979-01-16 False 1 2 19480631 None True 19480631 is not a legal date 2 3 19861220 1986-12-20 False </code></pre> I have tried <pre class="prettyprint"><code>try: sample_data['date_of_birth'] = [int2date(sec_num) for \ sec_num in sample_data['sec_num']] sample_data['is_in_error'] = False sample_data['error_msg'] = '' except ValueError as e: sample_data['is_in_error'] = True sample_data['error_msg'] = str(e) </code></pre> but this produces this <pre class="prettyprint"><code> id sec_num is_in_error error_msg 0 1 19790116 True Value:19480631 not a legal date. 1 2 19480631 True Value:19480631 not a legal date. 2 3 19861220 True Value:19480631 not a legal date. </code></pre> I guess the problem is that I assign the <code>date_of_birth</code>-column in one operation and the errors in another. I am not sure how to catch and create the <code>is_in_error</code> and <code>error_msg</code> columns simulationously.

This happens because of the way you're populating the dataframe. <pre class="prettyprint"><code>sample_data['error_msg'] = str(e) </code></pre> Will actually overwrite the entire column with <code>str(e)</code>. This is probably the most efficient way to do it: <pre class="prettyprint"><code>def int2date(argdate: int): try: year = int(argdate / 10000) month = int((argdate % 10000) / 100) day = int(argdate % 100) return date(year, month, day) except ValueError as e: pass # you could write the row and the error to your logs here df['date_of_birth'] = df.sec_num.apply(int2date) df['is_in_error'] = df.date_of_birth.isnull() </code></pre> However if you also want to write the errors to the dataframe, you can use this approach although it might be much slower (there might be faster solutions to this). <pre class="prettyprint"><code>df['date_of_birth'] = None df['error_msg'] = None df['is_in_error'] = False for i, row in df.iterrows(): try: date_of_birth = int2date(row['sec_num']) df.set_value(i, 'date_of_birth', date_of_birth) except ValueError as e: df.set_value(i, 'is_in_error', True) df.set_value(i, 'error_msg', str(e)) </code></pre> This handles each row separately and will only write the error to the correct index instead of updating the entire column.

How to tag corrupted data in dataframe after an error has been raised

Tags:

python

pandas

error-handling

I have a large dataframe containing, amongst other things, a (Norwegian) social security number. It is possible to get the date of birth out of this number via a special algorithm. However, every now and then an illegal social security number creeps into the database corrupting the calculation.

What I would like to do is to tag every line having an illegal social security number, along with a log message showing the error raised.

Consider the following, constructed, example

import pandas as pd
from datetime import date

sample_data = pd.DataFrame({'id' : [1, 2, 3], \
                            'sec_num' : [19790116, 19480631, 19861220]})


# The actual algorithm transforming the sec number is more complicated
# this is just for illustration purposes
def int2date(argdate: int):

    try:
        year = int(argdate / 10000)
        month = int((argdate % 10000) / 100)
        day = int(argdate % 100)
        return date(year, month, day)
    except ValueError:
        raise ValueError("Value:{0} not a legal date.".format(argdate))

I want to create the following output

   id   sec_num date_of_birth  is_in_error                    error_msg
0   1  19790116    1979-01-16        False  
1   2  19480631          None         True 19480631 is not a legal date         
2   3  19861220    1986-12-20        False

I have tried

try:
    sample_data['date_of_birth'] = [int2date(sec_num) for \
                   sec_num in sample_data['sec_num']]
    sample_data['is_in_error'] = False
    sample_data['error_msg'] = ''
 except ValueError as e:
    sample_data['is_in_error'] = True
    sample_data['error_msg'] = str(e)

but this produces this

   id   sec_num  is_in_error                         error_msg
0   1  19790116         True  Value:19480631 not a legal date.
1   2  19480631         True  Value:19480631 not a legal date.
2   3  19861220         True  Value:19480631 not a legal date.

I guess the problem is that I assign the date_of_birth-column in one operation and the errors in another. I am not sure how to catch and create the is_in_error and error_msg columns simulationously.

469

asked Sep 13 '17 12:09

mortysporty

2 Answers

This happens because of the way you're populating the dataframe.

sample_data['error_msg'] = str(e)

Will actually overwrite the entire column with str(e).

This is probably the most efficient way to do it:

def int2date(argdate: int):

    try:
        year = int(argdate / 10000)
        month = int((argdate % 10000) / 100)
        day = int(argdate % 100)
        return date(year, month, day)
    except ValueError as e:
        pass # you could write the row and the error to your logs here

df['date_of_birth'] = df.sec_num.apply(int2date)
df['is_in_error'] = df.date_of_birth.isnull()

However if you also want to write the errors to the dataframe, you can use this approach although it might be much slower (there might be faster solutions to this).

df['date_of_birth'] = None
df['error_msg'] = None
df['is_in_error'] = False
for i, row in df.iterrows():
    try:
        date_of_birth = int2date(row['sec_num'])
        df.set_value(i, 'date_of_birth', date_of_birth)
    except ValueError as e:
        df.set_value(i, 'is_in_error', True)
        df.set_value(i, 'error_msg', str(e))

This handles each row separately and will only write the error to the correct index instead of updating the entire column.

answered Nov 14 '22 21:11

Jan Zeiseweis

You are in the realm of handling large data. Throwing exceptions out of a loop is often not the best idea there because it will normally abort the loop. As many others you do not seem to want that.

To achieve that a typical approach is to use a function which does not throw the exception but which returns it instead.

def int2date(argdate: int):
    try:
        year = int(argdate / 10000)
        month = int((argdate % 10000) / 100)
        day = int(argdate % 100)
        return date(year, month, day)
    except ValueError:
        return ValueError("Value:{0} not a legal date.".format(argdate))

With this you simply can map a list of values to the function and will receive the exceptions (which lack a trace of course, but in such a case this should not be a problem) as values in the result list:

You then can walk over the list, replace the found exceptions by None values and fill other columns instead with the message contained in the exception.

answered Nov 14 '22 21:11

Alfe

Related questions
                            
                                Get the auto id for inserted row into Redshift table using psycopg2 in Python
                            
                                Tensorflow Java API - sophisticated example
                            
                                Stop a thread: flag vs. Event [duplicate]
                            
                                why Cython compiler generates a so with suffix 'cpython-35m-x86_64-linux-gnu.so'
                            
                                query an object using uuid in django
                            
                                Color gets dull: opencv cv2.imread cv2.imwrite
                            
                                Best way to access and close a postgres database using python dataset
                            
                                Initializing Cython objects with existing C Objects
                            
                                InvalidArgumentError: Expected dimension in the range [-1, 1) but got 1
                            
                                Read a Google Sheet with Python using API Key instead of OAuth Client ID [Python]
                            
                                Python Reading from a file to create a weighted directed graph using networkx
                            
                                Getting error: Cannot reshape array of size 122304 into shape (52,28,28)
                            
                                Using Pandas json_normalize on nested Json with arrays
                            
                                Django Full Text Search Not Matching Partial Words
                            
                                Python Firebase issue No module named firebase_admin
                            
                                How can I JSON serialize an object from google's natural language API? (No __dict__ attribute)
                            
                                Django - missing 1 required positional argument: 'request'
                            
                                How can you implement a C callable from Numba for efficient integration with nquad?
                            
                                Numba: calling jit with explicit signature using arguments with default values
                            
                                Maximum value of timestamp

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With