I have a large dataframe containing, amongst other things, a (Norwegian) social security number. It is possible to get the date of birth out of this number via a special algorithm. However, every now and then an illegal social security number creeps into the database corrupting the calculation.
What I would like to do is to tag every line having an illegal social security number, along with a log message showing the error raised.
Consider the following, constructed, example
import pandas as pd
from datetime import date
sample_data = pd.DataFrame({'id' : [1, 2, 3], \
'sec_num' : [19790116, 19480631, 19861220]})
# The actual algorithm transforming the sec number is more complicated
# this is just for illustration purposes
def int2date(argdate: int):
try:
year = int(argdate / 10000)
month = int((argdate % 10000) / 100)
day = int(argdate % 100)
return date(year, month, day)
except ValueError:
raise ValueError("Value:{0} not a legal date.".format(argdate))
I want to create the following output
id sec_num date_of_birth is_in_error error_msg
0 1 19790116 1979-01-16 False
1 2 19480631 None True 19480631 is not a legal date
2 3 19861220 1986-12-20 False
I have tried
try:
sample_data['date_of_birth'] = [int2date(sec_num) for \
sec_num in sample_data['sec_num']]
sample_data['is_in_error'] = False
sample_data['error_msg'] = ''
except ValueError as e:
sample_data['is_in_error'] = True
sample_data['error_msg'] = str(e)
but this produces this
id sec_num is_in_error error_msg
0 1 19790116 True Value:19480631 not a legal date.
1 2 19480631 True Value:19480631 not a legal date.
2 3 19861220 True Value:19480631 not a legal date.
I guess the problem is that I assign the date_of_birth
-column in one operation and the errors in another. I am not sure how to catch and create the is_in_error
and error_msg
columns simulationously.
Try to get the file running again by trying to fix the problem. Use the build-in OS error tools to fix the disk and after successfully doing that, try to open it again. Use a data recovery tool. In some cases, using a software tool will be able to fix a corrupted file.
Data corruption refers to any unwanted change that happens to a file during storage, transmission, or processing. A corrupted file can become unusable, inaccurate, unreadable, or in some way inaccessible to a user or a related app.
The corrupted data PS4 is also caused by the installation process of the game being interrupted. Factors of this issue include your internet connection, the device's storage capacity, incompatible games, and many more. Delete the game, don't worry as your saved data will be safe.
This happens because of the way you're populating the dataframe.
sample_data['error_msg'] = str(e)
Will actually overwrite the entire column with str(e)
.
This is probably the most efficient way to do it:
def int2date(argdate: int):
try:
year = int(argdate / 10000)
month = int((argdate % 10000) / 100)
day = int(argdate % 100)
return date(year, month, day)
except ValueError as e:
pass # you could write the row and the error to your logs here
df['date_of_birth'] = df.sec_num.apply(int2date)
df['is_in_error'] = df.date_of_birth.isnull()
However if you also want to write the errors to the dataframe, you can use this approach although it might be much slower (there might be faster solutions to this).
df['date_of_birth'] = None
df['error_msg'] = None
df['is_in_error'] = False
for i, row in df.iterrows():
try:
date_of_birth = int2date(row['sec_num'])
df.set_value(i, 'date_of_birth', date_of_birth)
except ValueError as e:
df.set_value(i, 'is_in_error', True)
df.set_value(i, 'error_msg', str(e))
This handles each row separately and will only write the error to the correct index instead of updating the entire column.
You are in the realm of handling large data. Throwing exceptions out of a loop is often not the best idea there because it will normally abort the loop. As many others you do not seem to want that.
To achieve that a typical approach is to use a function which does not throw the exception but which returns it instead.
def int2date(argdate: int):
try:
year = int(argdate / 10000)
month = int((argdate % 10000) / 100)
day = int(argdate % 100)
return date(year, month, day)
except ValueError:
return ValueError("Value:{0} not a legal date.".format(argdate))
With this you simply can map a list of values to the function and will receive the exceptions (which lack a trace of course, but in such a case this should not be a problem) as values in the result list:
You then can walk over the list, replace the found exceptions by None
values and fill other columns instead with the message contained in the exception.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With