I've looked a a number of questions on this site and cannot find an answer to the question: How to create multiple NEW tables in a database (in my case I am using PostgreSQL) from multiple CSV source files, where the new database table columns accurately reflect the data within the CSV columns? I can write the CREATE TABLE syntax just fine, and I can read the rows/values of a CSV file(s), but does a method already exist to inspect the CSV file(s) and accurately determine the column type? Before I build my own, I wanted to check if this already existed. If it doesn't exist already, my idea would be to use Python, CSV module, and psycopg2 module to build a python script that would: <ol> <li>Read the CSV file(s).</li> <li>Based upon a subset of records (10-100 rows?), iteratively inspect each column of each row to automatically determine the right column type of the data in the CSV. Therefore, if row 1, column A had a value of 12345 (int), but row 2 of column A had a value of ABC (varchar), the system would automatically determine it should be a format varchar(5) based upon the combination of the data it found in the first two passes. This process could go on as many times as the user felt necessary to determine the likely type and size of the column.</li> <li>Build the CREATE TABLE query as defined by the column inspection of the CSV.</li> <li>Execute the create table query.</li> <li>Load the data into the new table.</li> </ol> Does a tool like this already exist within either SQL, PostgreSQL, Python, or is there another application I should be be using to accomplish this (similar to pgAdmin3)?

I have been dealing with something similar, and ended up writing my own module to sniff datatypes by inspecting the source file. There is some wisdom among all the naysayers, but there can also be reasons this is worth doing, particularly when we don't have any control of the input data format (e.g. working with government open data), so here are some things I learned in the process: <ol> <li>Even though it's very time consuming, it's worth running through the entire file rather than a small sample of rows. More time is wasted by having a column flagged as numeric that turns out to have text every few thousand rows and therefore fails to import.</li> <li>If in doubt, fail over to a text type, because it's easier to cast those to numeric or date/times later than to try and infer the data that was lost in a bad import.</li> <li>Check for leading zeroes in what appear otherwise to be integer columns, and import them as text if there are any - this is a common issue with ID / account numbers.</li> <li>Give yourself some way of manually overriding the automatically detected types for some columns, so that you can blend some semantic awareness with the benefits of automatically typing most of them.</li> <li>Date/time fields are a nightmare, and in my experience generally require manual processing.</li> <li>If you ever add data to this table later, don't attempt to repeat the type detection - get the types from the database to ensure consistency.</li> </ol> If you can avoid having to do automatic type detection it's worth avoiding it, but that's not always practical so I hope these tips are of some help.

It seems that you need to know the structure up front. Just read the first line to know how many columns you got. CSV does not carry any type information, so it has to be deduced from the context of data. Improving on the slightly wrong answer before, you can create a temporary table with x number of text columns, fill it up with data and then process the data. <pre class="prettyprint"><code>BEGIN; CREATE TEMPORARY TABLE foo(a TEXT, b TEXT, c TEXT, ...) ON COMMIT DROP; COPY foo FROM 'file.csv' WITH CSV; <do the work> END; </code></pre> Word of warning, the file needs to be accessible by the postgresql process itself. That creates some security issues. Other option is to feed it through STDIN. HTH

Create SQL table with correct column types from CSV

Tags:

python

sql

postgresql

pgadmin

I've looked a a number of questions on this site and cannot find an answer to the question: How to create multiple NEW tables in a database (in my case I am using PostgreSQL) from multiple CSV source files, where the new database table columns accurately reflect the data within the CSV columns?

I can write the CREATE TABLE syntax just fine, and I can read the rows/values of a CSV file(s), but does a method already exist to inspect the CSV file(s) and accurately determine the column type? Before I build my own, I wanted to check if this already existed.

If it doesn't exist already, my idea would be to use Python, CSV module, and psycopg2 module to build a python script that would:

Read the CSV file(s).
Based upon a subset of records (10-100 rows?), iteratively inspect each column of each row to automatically determine the right column type of the data in the CSV. Therefore, if row 1, column A had a value of 12345 (int), but row 2 of column A had a value of ABC (varchar), the system would automatically determine it should be a format varchar(5) based upon the combination of the data it found in the first two passes. This process could go on as many times as the user felt necessary to determine the likely type and size of the column.
Build the CREATE TABLE query as defined by the column inspection of the CSV.
Execute the create table query.
Load the data into the new table.

Does a tool like this already exist within either SQL, PostgreSQL, Python, or is there another application I should be be using to accomplish this (similar to pgAdmin3)?

290

asked Nov 05 '12 19:11

RyanKDalton

2 Answers

I have been dealing with something similar, and ended up writing my own module to sniff datatypes by inspecting the source file. There is some wisdom among all the naysayers, but there can also be reasons this is worth doing, particularly when we don't have any control of the input data format (e.g. working with government open data), so here are some things I learned in the process:

Even though it's very time consuming, it's worth running through the entire file rather than a small sample of rows. More time is wasted by having a column flagged as numeric that turns out to have text every few thousand rows and therefore fails to import.
If in doubt, fail over to a text type, because it's easier to cast those to numeric or date/times later than to try and infer the data that was lost in a bad import.
Check for leading zeroes in what appear otherwise to be integer columns, and import them as text if there are any - this is a common issue with ID / account numbers.
Give yourself some way of manually overriding the automatically detected types for some columns, so that you can blend some semantic awareness with the benefits of automatically typing most of them.
Date/time fields are a nightmare, and in my experience generally require manual processing.
If you ever add data to this table later, don't attempt to repeat the type detection - get the types from the database to ensure consistency.

If you can avoid having to do automatic type detection it's worth avoiding it, but that's not always practical so I hope these tips are of some help.

125

answered Sep 18 '22 19:09

Eldan Goldenberg

It seems that you need to know the structure up front. Just read the first line to know how many columns you got.

CSV does not carry any type information, so it has to be deduced from the context of data.

Improving on the slightly wrong answer before, you can create a temporary table with x number of text columns, fill it up with data and then process the data.

Click to copy

BEGIN;
CREATE TEMPORARY TABLE foo(a TEXT, b TEXT, c TEXT, ...) ON COMMIT DROP;
COPY foo FROM 'file.csv' WITH CSV;
<do the work>
END;

Word of warning, the file needs to be accessible by the postgresql process itself. That creates some security issues. Other option is to feed it through STDIN.

HTH

answered Sep 17 '22 19:09

GregJaskiewicz

Related questions
                            
                                Using pandas, how do I subsample a large DataFrame by group in an efficient manner?
                            
                                Python regex to match VT100 escape sequences
                            
                                Saving Python Pickled objects in MySQL db
                            
                                CoffeeScript-like language written in Python
                            
                                Multiple projects using multiple setup.py scripts?
                            
                                Frequency Analysis in Python
                            
                                Python framework for task execution and dependencies handling
                            
                                How to assert that a method is decorated with python unittest?
                            
                                How to use Google application-specific password in script?
                            
                                having a separate database for django-admin in django
                            
                                Django custom management command running Scrapy: How to include Scrapy's options?
                            
                                Selenium/WebDriver script gets interrupted by alert - exception "Message: u'Modal dialog present'"
                            
                                Python package that supports weighted covariance computation
                            
                                Reading a parent's scope in python
                            
                                Can a game made with pygame be submitted to Steam? [closed]
                            
                                Fast(er) numpy fancy indexing and reduction?
                            
                                In Python, how do I voxelize a 3D mesh
                            
                                How to exclude a file from coverage.py?
                            
                                fix pyflakes dealing with @property setter decorator
                            
                                Scale images with PIL preserving transparency and color?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With