How to generate a schema from a CSV for a PostgreSQL Copy

Tags:

Given a CSV with several dozen or more columns, how can a 'schema' be created that can be used in a CREATE TABLE SQL expression in PostgreSQL for use with the COPY tool?

I see plenty of examples for the COPY tool, and basic CREATE TABLE expressions, but nothing goes into detail about cases when you have a potentially prohibitive number of columns for manual creation of the schema.

698

asked Feb 06 '16 16:02

DPSSpatial

2 Answers

If the CSV is not excessively large and available on your local machine then csvkit is the simplest solution. It also contains a number of other utilities for working with CSVs, so it is a usefull tool to know in general.

At its simplest typing into the shell:

$ csvsql myfile.csv

will print out the required CREATE TABLE SQL command, which can be saved to a file using output redirection.

If you also provide a connection string csvsql will create the table and upload the file in one go:

$ csvsql --db "$MY_DB_URI" --insert myfile.csv

There are also options to specify the flavor of SQL and CSV you are working with. They are documented in the builtin help:

$ csvsql -h usage: csvsql [-h] [-d DELIMITER] [-t] [-q QUOTECHAR] [-u {0,1,2,3}] [-b]               [-p ESCAPECHAR] [-z MAXFIELDSIZE] [-e ENCODING] [-S] [-H] [-v]               [--zero] [-y SNIFFLIMIT]               [-i {access,sybase,sqlite,informix,firebird,mysql,oracle,maxdb,postgresql,mssql}]               [--db CONNECTION_STRING] [--query QUERY] [--insert]               [--tables TABLE_NAMES] [--no-constraints] [--no-create]               [--blanks] [--no-inference] [--db-schema DB_SCHEMA]               [FILE [FILE ...]]  Generate SQL statements for one or more CSV files, create execute those statements directly on a database, and execute one or more SQL queries. positional arguments:   FILE                  The CSV file(s) to operate on. If omitted, will accept                         input on STDIN.  optional arguments:   -h, --help            show this help message and exit   -d DELIMITER, --delimiter DELIMITER                         Delimiting character of the input CSV file.   -t, --tabs            Specifies that the input CSV file is delimited with                         tabs. Overrides "-d".   -q QUOTECHAR, --quotechar QUOTECHAR                         Character used to quote strings in the input CSV file.   -u {0,1,2,3}, --quoting {0,1,2,3}                         Quoting style used in the input CSV file. 0 = Quote                         Minimal, 1 = Quote All, 2 = Quote Non-numeric, 3 =                         Quote None.   -b, --doublequote     Whether or not double quotes are doubled in the input                         CSV file.   -p ESCAPECHAR, --escapechar ESCAPECHAR                         Character used to escape the delimiter if --quoting 3                         ("Quote None") is specified and to escape the                         QUOTECHAR if --doublequote is not specified.   -z MAXFIELDSIZE, --maxfieldsize MAXFIELDSIZE                         Maximum length of a single field in the input CSV                         file.   -e ENCODING, --encoding ENCODING                         Specify the encoding the input CSV file.   -S, --skipinitialspace                         Ignore whitespace immediately following the delimiter.   -H, --no-header-row   Specifies that the input CSV file has no header row.                         Will create default headers.   -v, --verbose         Print detailed tracebacks when errors occur.   --zero                When interpreting or displaying column numbers, use                         zero-based numbering instead of the default 1-based                         numbering.   -y SNIFFLIMIT, --snifflimit SNIFFLIMIT                         Limit CSV dialect sniffing to the specified number of                         bytes. Specify "0" to disable sniffing entirely.   -i {access,sybase,sqlite,informix,firebird,mysql,oracle,maxdb,postgresql,mssql}, --dialect {access,sybase,sqlite,informix,firebird,mysql,oracle,maxdb,postgresql,mssql}                         Dialect of SQL to generate. Only valid when --db is                         not specified.   --db CONNECTION_STRING                         If present, a sqlalchemy connection string to use to                         directly execute generated SQL on a database.   --query QUERY         Execute one or more SQL queries delimited by ";" and                         output the result of the last query as CSV.   --insert              In addition to creating the table, also insert the                         data into the table. Only valid when --db is                         specified.   --tables TABLE_NAMES  Specify one or more names for the tables to be                         created. If omitted, the filename (minus extension) or                         "stdin" will be used.   --no-constraints      Generate a schema without length limits or null                         checks. Useful when sampling big tables.   --no-create           Skip creating a table. Only valid when --insert is                         specified.   --blanks              Do not coerce empty strings to NULL values.   --no-inference        Disable type inference when parsing the input.   --db-schema DB_SCHEMA                         Optional name of database schema to create table(s)                         in.

Several other tools also do schema inference including:

Apache Spark
Pandas (Python)
Blaze (Python)
read.csv + your favorite db package in R

Each of these have functionality to read a CSV (and other formats) into a tabular data structure usually called a DataFrame or similar, inferring the column types in the process. They then have other commands to either write out an equivalent SQL schema or upload the DataFrame directly into a specified database. The choice of tool will depend on the volume of data, how it is stored, idiosyncrasies of your CSV, the target database and the language you prefer to work in.

answered Sep 28 '22 02:09

Daniel Mahler

Basically you should prepare the data (including its structure) outside the database, with ready-made tools or using python, ruby or a language of your choice. However, in the lack of such opportunities you can do much using plpgsql.

Creating a table with text columns

Files in csv format do not contain any information about column types, primary or foreign keys etc. You can relatively easily create a table with text columns and copy data to it. After that you should manually alter types of columns and add constraints.

create or replace function import_csv(csv_file text, table_name text) returns void language plpgsql as $$ begin     create temp table import (line text) on commit drop;     execute format('copy import from %L', csv_file);      execute format('create table %I (%s);',          table_name, concat(replace(line, ',', ' text, '), ' text'))     from import limit 1;      execute format('copy %I from %L (format csv, header)', table_name, csv_file); end $$;

Example data in the file c:\data\test.csv:

id,a_text,a_date,a_timestamp,an_array 1,str 1,2016-08-01,2016-08-01 10:10:10,"{1,2}" 2,str 2,2016-08-02,2016-08-02 10:10:10,"{1,2,3}" 3,str 3,2016-08-03,2016-08-03 10:10:10,"{1,2,3,4}"

Import:

select import_csv('c:\data\test.csv', 'new_table');  select * from new_table;   id | a_text |   a_date   |     a_timestamp     | an_array   ----+--------+------------+---------------------+-----------  1  | str 1  | 2016-08-01 | 2016-08-01 10:10:10 | {1,2}  2  | str 2  | 2016-08-02 | 2016-08-02 10:10:10 | {1,2,3}  3  | str 3  | 2016-08-03 | 2016-08-03 10:10:10 | {1,2,3,4} (3 rows)

Large csv files

The above function imports data two times (to temporary and target tables). For large files this may be a serious loss of time and unnecessary load on the server. A solution would be splitting a csv file into two files, one with header and one with data. Then the function should look like this:

create or replace function import_csv(header_file text, data_file text, table_name text) returns void language plpgsql as $$ begin     create temp table import (line text) on commit drop;     execute format('copy import from %L', header_file);      execute format('create table %I (%s);',          table_name, concat(replace(line, ',', ' text, '), ' text'))     from import;      execute format('copy %I from %L (format csv)', table_name, data_file); end $$;

Altering column types

You can try to automatically change the column types based on their content. You can succeed if you are dealing with simple types and data in the file consistently retains a specific format. However, in general it is a complex task and functions listed below should be considered only as an example.

Determine a column type based on its content (edit the function to add desired conversions):

create or replace function column_type(val text) returns text language sql as $$     select          case              when val ~ '^[\+-]{0,1}\d+$' then 'integer'             when val ~ '^[\+-]{0,1}\d*\.\d+$' then 'numeric'             when val ~ '^\d\d\d\d-\d\d-\d\d$' then 'date'             when val ~ '^\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d$' then 'timestamp'         end $$;

Alter column types using the above function:

create or replace function alter_column_types(table_name text) returns void language plpgsql as $$ declare     rec record;     qry text; begin     for rec in         execute format(             'select key, column_type(value) ctype             from (                 select row_to_json(t) a_row                  from %I t                  limit 1             ) s, json_each_text (a_row)',             table_name)     loop         if rec.ctype is not null then             qry:= format(                 '%salter table %I alter %I type %s using %s::%s;',                  qry, table_name, rec.key, rec.ctype, rec.key, rec.ctype);         end if;     end loop;     execute(qry); end $$;

Use:

select alter_column_types('new_table');  \d new_table                 Table "public.new_table"    Column    |            Type             | Modifiers  -------------+-----------------------------+-----------  id          | integer                     |   a_text      | text                        |   a_date      | date                        |   a_timestamp | timestamp without time zone |   an_array    | text                        |

(well, proper recognition of array types is quite complicated)

answered Sep 28 '22 03:09

klin

Related questions
                            
                                python pip install psycopg2 install error
                            
                                How do I get min, median and max from my query in postgresql?
                            
                                Postgres - ERROR: prepared statement "S_1" already exists
                            
                                Why does rake db:migrate sometimes add trailing whitespace to structure.sql?
                            
                                Setting up PostGis on Amazon RDS
                            
                                Javascript serialization and performance with V8 and PostgreSQL
                            
                                Check last refreshed time for materialized view
                            
                                Return value from anonymous function postgresql
                            
                                Postgresql prefix wildcard for full text
                            
                                If PostgreSQL count(*) is always slow how to paginate complex queries?
                            
                                PostgreSQL character varying length limit
                            
                                Is there anything as good as TOAD for Postgres (Windows)? [closed]
                            
                                Filtering by window function result in Postgresql
                            
                                Row numbering in PostgreSQL
                            
                                Entity Framework Core jsonb column type
                            
                                Postgres "missing FROM-clause entry" error on query with WITH clause
                            
                                PGError: Error: column of relation does not exist
                            
                                PostgreSQL drop role fails because of default privileges
                            
                                Postgres: define a default value for CAST failures?
                            
                                PostgreSQL table variable

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to generate a schema from a CSV for a PostgreSQL Copy

Tags:

csv

postgresql

postgresql-copy

schema