Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mixed types when reading csv files. Causes, fixes and consequences

Tags:

python

pandas

csv

What exactly happens when Pandas issues this warning? Should I worry about it?

In [1]: read_csv(path_to_my_file)
/Users/josh/anaconda/envs/py3k/lib/python3.3/site-packages/pandas/io/parsers.py:1139: 
DtypeWarning: Columns (4,13,29,51,56,57,58,63,87,96) have mixed types. Specify dtype option on import or set low_memory=False.              

  data = self._reader.read(nrows)

I assume that this means that Pandas is unable to infer the type from values on those columns. But if that is the case, what type does Pandas end up using for those columns?

Also, can the type always be recovered after the fact? (after getting the warning), or are there cases where I may not be able to recover the original info correctly, and I should pre-specify the type?

Finally, how exactly does low_memory=False fix the problem?

like image 469
Amelio Vazquez-Reina Avatar asked Aug 25 '14 14:08

Amelio Vazquez-Reina


People also ask

Which is the correct method to read a csv file?

You can also open CSV files in spreadsheet programs, which make them easier to read. For example, if you have Microsoft Excel installed on your computer, you can just double-click a . csv file to open it in Excel by default. If it doesn't open in Excel, you can right-click the CSV file and select Open With > Excel.

What is wrong with my csv file?

The most common CSV import errors include: The file size is too large - The CSV import tool of the program you're using might have a file size requirement. To reduce the file size, you can delete unnecessary data values, columns, and rows.

Why are CSV files so hard to read?

The columns of csv files are formatted for easy storing and processing by computers. This results in the files that are not easily read by humans. A few thing to note about this data file.

How to avoid mixed data types in a Dataframe?

To ensure no mixed types either set False, or specify the type with the dtype parameter. Note that the entire file is read into a single DataFrame regardless, use the chunksize or iterator parameter to return the data in chunks. (Only valid with C parser)

What are the most common errors in CSV import?

Missing data Missing data is one of the most common errors for CSV imports. Examples include incomplete data that can be fixed by a user such as invoices that have month and day, but no year information.

How to avoid mixed type inference when parsing a file?

Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set False, or specify the type with the dtype parameter.


1 Answers

Revisiting mbatchkarov's link, low_memory is not deprecated. It is now documented:

low_memory : boolean, default True

Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set False, or specify the type with the dtype parameter. Note that the entire file is read into a single DataFrame regardless, use the chunksize or iterator parameter to return the data in chunks. (Only valid with C parser)

I have asked what resulting in mixed type inference means, and chris-b1 answered:

It is deterministic - types are consistently inferred based on what's in the data. That said, the internal chunksize is not a fixed number of rows, but instead bytes, so whether you can a mixed dtype warning or not can feel a bit random.

So, what type does Pandas end up using for those columns?

This is answered by the following self-contained example:

df=pd.read_csv(StringIO('\n'.join([str(x) for x in range(1000000)] + ['a string'])))
DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.

type(df.loc[524287,'0'])
Out[50]: int

type(df.loc[524288,'0'])
Out[51]: str

The first part of the csv data was seen as only int, so converted to int, the second part also had a string, so all entries were kept as string.

Can the type always be recovered after the fact? (after getting the warning)?

I guess re-exporting to csv and re-reading with low_memory=False should do the job.

How exactly does low_memory=False fix the problem?

It reads all of the file before deciding the type, therefore needing more memory.

like image 82
Robert Pollak Avatar answered Oct 26 '22 22:10

Robert Pollak