Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Omitting columns when importing CSV into Sqlite

Imagine that you have the following data in a CSV:

Name, Age, Gender
Jake, 40, M
Bill, 17, M
Suzie, 21, F

Is it possible to exclude the Age variable when importing the above CSV? My current approach is to simply use the cut shell command.

Update

iluvcapra has a great answer for small CSVs. However, for very large CSVs this approach is inefficient. For example, imagine that the above CSV was very large, 30Gb lets say. Loading all that Age data only to immediately remove is a waste of time. With this in mind, is there a more efficient way to load subsets of columns into sqlite databases?

I suspect that the best option is to use the shell command cut to cull out unnecessary columns. Is that intuition correct? Is it common to use shell commands to pre-process CSV files into more sqlite friendly versions?

like image 225
Jacob H Avatar asked Aug 05 '15 01:08

Jacob H


Video Answer


2 Answers

Create a temporary table with the age column, and then use an INSERT... SELECT to move the data from the temporary table into your main one:

CREATE TEMP TABLE _csv_import (name text, age integer, gender text);
.separator ","
.import file.csv test

INSERT INTO names_genders (name, gender) SELECT name, gender
    FROM _csv_import WHERE 1;
DROP TABLE _csv_import;

EDIT: Updating into a view with a phantom age column:

CREATE VIEW names_ages_genders AS 
    SELECT (name, 0 AS age ,gender) FROM names_genders;
CREATE TRIGGER lose_age
    INSTEAD OF INSERT ON names_ages_genders
    BEGIN
        INSERT INTO names_genders (name, gender) 
            VALUES (NEW.name, NEW.gender)
    END;

This will create a view called names_ages_genders that will say everybody is zero years old, and will silently drop the age field from any INSERT statement called on it. Not tested! (I'm actually not sure .import can import into views.)

like image 89
iluvcapra Avatar answered Sep 22 '22 18:09

iluvcapra


If you wish to avoid reading more than necessary into SQLite, and if you wish to avoid the hazards of using standard text-processing tools (such as cut and awk) on CSV files, one possibility would be to use your favorite csv2tsv converter (*) along the following lines:

csv2tsv input.csv | cut -f 1,3- > tmp.tsv

cat << EOF | sqlite3 demo.db
drop table if exists demo;
.mode csv
.separator "\t"
.import tmp.tsv demo
EOF

/bin/rm tmp.tsv

Note, though, that if input.csv has literal tabs or newlines or escaped double-quotes, then whether the above will have the desired effect will depend on the csv2tsv that is used.

(*) csv2tsv

In case you don't have ready access to a suitable csv2tsv converter, here is a simple python3 script that does the job, handling embedded literal newlines, tabs, and the two-character sequences "\t" and "\n", in the CSV:

#!/usr/bin/env python3

# Take care of embedded tabs and newlines in the CSV

import csv, re, sys

if len(sys.argv) > 2 or (len(sys.argv) > 1 and sys.argv[1] == '--help'):
    sys.exit("Usage: " + sys.argv[0] + " [input.csv [output.tsv]]")

csv.field_size_limit(sys.maxsize)

if len(sys.argv) == 3:
    out=open(sys.argv[2], 'w+')
else:
    out=sys.stdout

if len(sys.argv) == 1:
    csvfile=sys.stdin
else:
    csvfile=open(sys.argv[1])

# tabs and newlines ...
def edit(s):
  s=re.sub(r'\\t', r'\\\\t', s)
  s=re.sub(r'\\n', r'\\\\n', s)
  s=re.sub('\t', r'\\t', s)
  return re.sub('\n', r'\\n', s)

reader = csv.reader(csvfile, dialect='excel')
for row in reader:
    line=""
    for s in row:
       s=edit(s)
       if len(line) == 0:
          line = s
       else:
          line += '\t' + s
    print(line)

like image 45
peak Avatar answered Sep 24 '22 18:09

peak