I get value errors when trying to read in the CSV file to my datatype. I need to be sure that it works and that every line is read in and is correct.
Errors are for example:
Pandas: ValueError: Integer column has NA values in column 2
I am trying to cast to integer in Pandas Python library, but there is a missing value.
However, the CSV file that I read in seems to have some erroneous entries, as it consists of manually entered test results.
I read in using this command:
test = pd.read_csv(
"test.csv",
sep=";",
names=pandasframe_names,
dtype=pandasframe_datatypes,
skiprows=1,
)
names is A, B, C, D and E and is defined correctly.
If there is an erroneous entry, I need a way of handling this without losing the full row.
So here is my case:
I have a Pandas dataframe that reads in a CSV table with 5 columns with the
headers A, B, C, D, E. I skip row one with the parameter skiprows=1
pandas_datatypes = {
"A": pd.np.int64,
"B": pd.np.int64,
"C": pd.np.float64,
"D": object,
"E": object,
}
My row has 5 column and the first 2 are int64 and the 3rd is float64 and the next 2 are object (e.g. string).
Those are equivalent to my dtype when I read it in. Meaning dtype=pandas_datatypes
Now I have entries like so:
entry 1: 5; 5; 2.2; pedagogy; teacher (correct)
entry 2: 8; 7.0; 2.2; pedagogy; teacher (incorrect, as second is float instead of int)
entry 3: NA; 5; 2.2; pedagogy; teacher (incorrect, as first value has entered NA as is missing)
entry 4: none; 5; 2.2; pedagogy; teacher (incorrect, as first value has entered none as is missing)
entry 5: 8; 5; 2; pedagogy; teacher (incorrect, as third is int instead of float)
How do I best handle this and what do I have to add to make this work for sure? In case that there is one incorrect entry, I don't want to lose the full line. Should I enter NULL? But then I would need to flag this for someone to manually look at it.
Pandas now has extension types, for which integer support NA values. You will get pd.NA in those fields.
https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes
Use Pandas Int64 type, you'll be fine!
pandas_datatypes={'A': 'Int64', 'B': 'Int64', 'C':pd.np.float64, 'D':object, 'E':object}
Just tested it with pandas 1.3.5, works like a charm.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With