Pandas: ValueError: Integer column has NA values in column 2

Question

I get value errors when trying to read in the CSV file to my datatype. I need to be sure that it works and that every line is read in and is correct.

Errors are for example:

Pandas: ValueError: Integer column has NA values in column 2

I am trying to cast to integer in Pandas Python library, but there is a missing value.

However, the CSV file that I read in seems to have some erroneous entries, as it consists of manually entered test results.

I read in using this command:

test = pd.read_csv(
    "test.csv",
    sep=";",
    names=pandasframe_names,
    dtype=pandasframe_datatypes,
    skiprows=1,
)

names is A, B, C, D and E and is defined correctly.

If there is an erroneous entry, I need a way of handling this without losing the full row.

So here is my case: I have a Pandas dataframe that reads in a CSV table with 5 columns with the headers A, B, C, D, E. I skip row one with the parameter skiprows=1

pandas_datatypes = {
    "A": pd.np.int64,
    "B": pd.np.int64,
    "C": pd.np.float64,
    "D": object,
    "E": object,
}

My row has 5 column and the first 2 are int64 and the 3rd is float64 and the next 2 are object (e.g. string).

Those are equivalent to my dtype when I read it in. Meaning dtype=pandas_datatypes

Now I have entries like so:

entry 1: 5; 5; 2.2; pedagogy; teacher (correct)
entry 2: 8; 7.0; 2.2; pedagogy; teacher (incorrect, as second is float instead of int)
entry 3: NA; 5; 2.2; pedagogy; teacher (incorrect, as first value has entered NA as is missing)
entry 4: none; 5; 2.2; pedagogy; teacher (incorrect, as first value has entered none as is missing)
entry 5: 8; 5; 2; pedagogy; teacher (incorrect, as third is int instead of float)

How do I best handle this and what do I have to add to make this work for sure? In case that there is one incorrect entry, I don't want to lose the full line. Should I enter NULL? But then I would need to flag this for someone to manually look at it.

Etienne · Accepted Answer

Pandas now has extension types, for which integer support NA values. You will get pd.NA in those fields.

https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes

Use Pandas Int64 type, you'll be fine!

pandas_datatypes={'A': 'Int64', 'B': 'Int64', 'C':pd.np.float64, 'D':object, 'E':object}

Just tested it with pandas 1.3.5, works like a charm.

Pandas: ValueError: Integer column has NA values in column 2

Tags:

python

pandas

Smiley

1 Answers

Etienne

Recent Activity

Donate For Us

Pandas: ValueError: Integer column has NA values in column 2

Tags:

python

pandas

Smiley

1 Answers

Etienne

Related questions

Recent Activity

Donate For Us