Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to tag corrupted data in dataframe after an error has been raised

I have a large dataframe containing, amongst other things, a (Norwegian) social security number. It is possible to get the date of birth out of this number via a special algorithm. However, every now and then an illegal social security number creeps into the database corrupting the calculation.

What I would like to do is to tag every line having an illegal social security number, along with a log message showing the error raised.

Consider the following, constructed, example

import pandas as pd
from datetime import date

sample_data = pd.DataFrame({'id' : [1, 2, 3], \
                            'sec_num' : [19790116, 19480631, 19861220]})


# The actual algorithm transforming the sec number is more complicated
# this is just for illustration purposes
def int2date(argdate: int):

    try:
        year = int(argdate / 10000)
        month = int((argdate % 10000) / 100)
        day = int(argdate % 100)
        return date(year, month, day)
    except ValueError:
        raise ValueError("Value:{0} not a legal date.".format(argdate))

I want to create the following output

   id   sec_num date_of_birth  is_in_error                    error_msg
0   1  19790116    1979-01-16        False  
1   2  19480631          None         True 19480631 is not a legal date         
2   3  19861220    1986-12-20        False 

I have tried

try:
    sample_data['date_of_birth'] = [int2date(sec_num) for \
                   sec_num in sample_data['sec_num']]
    sample_data['is_in_error'] = False
    sample_data['error_msg'] = ''
 except ValueError as e:
    sample_data['is_in_error'] = True
    sample_data['error_msg'] = str(e)

but this produces this

   id   sec_num  is_in_error                         error_msg
0   1  19790116         True  Value:19480631 not a legal date.
1   2  19480631         True  Value:19480631 not a legal date.
2   3  19861220         True  Value:19480631 not a legal date.

I guess the problem is that I assign the date_of_birth-column in one operation and the errors in another. I am not sure how to catch and create the is_in_error and error_msg columns simulationously.

like image 469
mortysporty Avatar asked Sep 13 '17 12:09

mortysporty


People also ask

How do you handle corrupted data?

Try to get the file running again by trying to fix the problem. Use the build-in OS error tools to fix the disk and after successfully doing that, try to open it again. Use a data recovery tool. In some cases, using a software tool will be able to fix a corrupted file.

What happens when data corrupted?

Data corruption refers to any unwanted change that happens to a file during storage, transmission, or processing. A corrupted file can become unusable, inaccurate, unreadable, or in some way inaccessible to a user or a related app.

Why does my PS4 keep saying corrupted data?

The corrupted data PS4 is also caused by the installation process of the game being interrupted. Factors of this issue include your internet connection, the device's storage capacity, incompatible games, and many more. Delete the game, don't worry as your saved data will be safe.


2 Answers

This happens because of the way you're populating the dataframe.

sample_data['error_msg'] = str(e)

Will actually overwrite the entire column with str(e).

This is probably the most efficient way to do it:

def int2date(argdate: int):

    try:
        year = int(argdate / 10000)
        month = int((argdate % 10000) / 100)
        day = int(argdate % 100)
        return date(year, month, day)
    except ValueError as e:
        pass # you could write the row and the error to your logs here

df['date_of_birth'] = df.sec_num.apply(int2date)
df['is_in_error'] = df.date_of_birth.isnull()

However if you also want to write the errors to the dataframe, you can use this approach although it might be much slower (there might be faster solutions to this).

df['date_of_birth'] = None
df['error_msg'] = None
df['is_in_error'] = False
for i, row in df.iterrows():
    try:
        date_of_birth = int2date(row['sec_num'])
        df.set_value(i, 'date_of_birth', date_of_birth)
    except ValueError as e:
        df.set_value(i, 'is_in_error', True)
        df.set_value(i, 'error_msg', str(e))

This handles each row separately and will only write the error to the correct index instead of updating the entire column.

like image 66
Jan Zeiseweis Avatar answered Nov 14 '22 21:11

Jan Zeiseweis


You are in the realm of handling large data. Throwing exceptions out of a loop is often not the best idea there because it will normally abort the loop. As many others you do not seem to want that.

To achieve that a typical approach is to use a function which does not throw the exception but which returns it instead.

def int2date(argdate: int):
    try:
        year = int(argdate / 10000)
        month = int((argdate % 10000) / 100)
        day = int(argdate % 100)
        return date(year, month, day)
    except ValueError:
        return ValueError("Value:{0} not a legal date.".format(argdate))

With this you simply can map a list of values to the function and will receive the exceptions (which lack a trace of course, but in such a case this should not be a problem) as values in the result list:

You then can walk over the list, replace the found exceptions by None values and fill other columns instead with the message contained in the exception.

like image 41
Alfe Avatar answered Nov 14 '22 21:11

Alfe