Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data Type Recognition/Guessing of CSV data in python

Tags:

My problem is in the context of processing data from large CSV files.

I'm looking for the most efficient way to determine (that is, guess) the data type of a column based on the values found in that column. I'm potentially dealing with very messy data. Therefore, the algorithm should be error-tolerant to some extent.

Here's an example:

arr1 = ['0.83', '-0.26', '-', '0.23', '11.23']               # ==> recognize as float arr2 = ['1', '11', '-1345.67', '0', '22']                    # ==> regognize as int arr3 = ['2/7/1985', 'Jul 03 1985, 00:00:00', '', '4/3/2011'] # ==> recognize as date arr4 = ['Dog', 'Cat', '0.13', 'Mouse']                       # ==> recognize as str 

Bottom line: I'm looking for a python package or an algorithm that can detect either

  • the schema of a CSV file, or even better
  • the data type of an individual column as an array

Method for guessing type of data represented currently represented as strings goes in a similar direction. I'm worried about performance, though, since I'm possibly dealing with many large spreadsheets (where the data stems from)

like image 250
poezn Avatar asked Jul 26 '11 03:07

poezn


People also ask

What is the data type of reader in CSV file?

A CSV file (Comma Separated Values file) is a type of plain text file that uses specific structuring to arrange tabular data. Because it's a plain text file, it can contain only actual text data—in other words, printable ASCII or Unicode characters.

How does pandas determine data type?

To check the data type in pandas DataFrame we can use the “dtype” attribute. The attribute returns a series with the data type of each column. And the column names of the DataFrame are represented as the index of the resultant series object and the corresponding data types are returned as values of the series object.

What is the use of CSV reader in python?

Example 1: Read CSV files with csv. reader() is used to read the file, which returns an iterable reader object. The reader object is then iterated using a for loop to print the contents of each row. Now, we will look at CSV files with different formats.


1 Answers

You may be interested in this python library which does exactly this kind of type guessing on CSVs and XLS files for you:

  • https://github.com/okfn/messytables
  • https://messytables.readthedocs.org/ - docs

It happily scales to very large files, to streaming data off the internet etc.

There is also an even simpler wrapper library that includes a command line tool named dataconverters: http://okfnlabs.org/dataconverters/ (and an online service: https://github.com/okfn/dataproxy!)

The core algorithm that does the type guessing is here: https://github.com/okfn/messytables/blob/7e4f12abef257a4d70a8020e0d024df6fbb02976/messytables/types.py#L164

like image 150
Rufus Pollock Avatar answered Sep 17 '22 11:09

Rufus Pollock