Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create SQL table with correct column types from CSV

I've looked a a number of questions on this site and cannot find an answer to the question: How to create multiple NEW tables in a database (in my case I am using PostgreSQL) from multiple CSV source files, where the new database table columns accurately reflect the data within the CSV columns?

I can write the CREATE TABLE syntax just fine, and I can read the rows/values of a CSV file(s), but does a method already exist to inspect the CSV file(s) and accurately determine the column type? Before I build my own, I wanted to check if this already existed.

If it doesn't exist already, my idea would be to use Python, CSV module, and psycopg2 module to build a python script that would:

  1. Read the CSV file(s).
  2. Based upon a subset of records (10-100 rows?), iteratively inspect each column of each row to automatically determine the right column type of the data in the CSV. Therefore, if row 1, column A had a value of 12345 (int), but row 2 of column A had a value of ABC (varchar), the system would automatically determine it should be a format varchar(5) based upon the combination of the data it found in the first two passes. This process could go on as many times as the user felt necessary to determine the likely type and size of the column.
  3. Build the CREATE TABLE query as defined by the column inspection of the CSV.
  4. Execute the create table query.
  5. Load the data into the new table.

Does a tool like this already exist within either SQL, PostgreSQL, Python, or is there another application I should be be using to accomplish this (similar to pgAdmin3)?

like image 290
RyanKDalton Avatar asked Nov 05 '12 19:11

RyanKDalton


People also ask

How do I extract data from a CSV file in SQL?

To proceed, follow the below-mentioned steps: Step 1: First of all, start SQL Server Management Studio and connect to the database. Step 2: Next, under Object Explorer search for the database you want to export data in CSV. Step 3: Right-click on the desired database >> go to Tasks >> Export Data.

Does CSV support data types?

A CSV file typically stores tabular data (numbers and text) in plain text, in which case each line will have the same number of fields. The CSV file format is not fully standardized.


2 Answers

I have been dealing with something similar, and ended up writing my own module to sniff datatypes by inspecting the source file. There is some wisdom among all the naysayers, but there can also be reasons this is worth doing, particularly when we don't have any control of the input data format (e.g. working with government open data), so here are some things I learned in the process:

  1. Even though it's very time consuming, it's worth running through the entire file rather than a small sample of rows. More time is wasted by having a column flagged as numeric that turns out to have text every few thousand rows and therefore fails to import.
  2. If in doubt, fail over to a text type, because it's easier to cast those to numeric or date/times later than to try and infer the data that was lost in a bad import.
  3. Check for leading zeroes in what appear otherwise to be integer columns, and import them as text if there are any - this is a common issue with ID / account numbers.
  4. Give yourself some way of manually overriding the automatically detected types for some columns, so that you can blend some semantic awareness with the benefits of automatically typing most of them.
  5. Date/time fields are a nightmare, and in my experience generally require manual processing.
  6. If you ever add data to this table later, don't attempt to repeat the type detection - get the types from the database to ensure consistency.

If you can avoid having to do automatic type detection it's worth avoiding it, but that's not always practical so I hope these tips are of some help.

like image 125
Eldan Goldenberg Avatar answered Sep 18 '22 19:09

Eldan Goldenberg


It seems that you need to know the structure up front. Just read the first line to know how many columns you got.

CSV does not carry any type information, so it has to be deduced from the context of data.

Improving on the slightly wrong answer before, you can create a temporary table with x number of text columns, fill it up with data and then process the data.

BEGIN;
CREATE TEMPORARY TABLE foo(a TEXT, b TEXT, c TEXT, ...) ON COMMIT DROP;
COPY foo FROM 'file.csv' WITH CSV;
<do the work>
END;

Word of warning, the file needs to be accessible by the postgresql process itself. That creates some security issues. Other option is to feed it through STDIN.

HTH

like image 36
GregJaskiewicz Avatar answered Sep 17 '22 19:09

GregJaskiewicz