I have a (syntactically-valid, not-huge) CSV file. I want to run an arbitrary SQL SELECT query on it; let's assume that it can be a complex query, with some inner queries, with self-joins, maybe with window functions etc.
Now, obviously, I could install a DBMS, run the server process, create a new DB, maybe create an appropriate table, load that CSV data into a table, then finally use a client for the DB to send the query and get the result.
But - that seems like quite a hassle, and I don't need the loaded DB table again for another query.
So, is there an easier way to do this, from a command-line/shell session, hopefully without so much overhead and waste of space?
Notes:
You can use sqlite for this:
sqlite3 :memory: -cmd '.import -csv taxi.csv taxi' \
'SELECT passenger_count, COUNT(*), AVG(total_amount) FROM taxi GROUP BY passenger_count'
There are many wrappers that try to simplify this (e.g. csvsql --query, dsq or q), but you may find they have limitations of their own (e.g. q doesn't support CTEs, though sqlite obiously does).
duckdb, as an analytic-focused alternative to sqlite, makes this use-case very simple and fast out of the box, while providing some flexibility too:
duckdb -c "SELECT * FROM 'data_source.csv.gz'"
You can run complex SQL queries from data in a CSV file, if you have Java installed in your OS (that's pretty common) and by combining Ant (scripting) and H2 (in-memory database).
For example, if you have the file my_file.csv as:
"name", "sex", "age", "height", "weight"
"Alex", "M", 41, 74, 170
"Bert", "M", 42, 66, 166
"Dave", "M", 39, 72, 167
"Elly", "F", 30, 70, 124
"Luke", "M", 34, 72, 163
"Omar", "M", 38, 69, 145
"Page", "F", 31, 67, 135
"Ruth", "F", 28, 65, 131
Then you can create the script1.xml Ant script:
<project default="sql">
<target name="sql">
<sql driver="org.h2.Driver" url="jdbc:h2:mem:" userid="sa" password=""
classpath="./h2-2.1.214.jar" onerror="abort" print="true"
showheaders="true" showtrailers="true" showWarnings="true" >
create table my_data as select * from csvread('my_file.csv');
select *
from my_data a
join (
select * from my_data where sex = 'M'
) b on b.height < a.height
where a.sex = 'F'
</sql>
</target>
</project>
Download and unzip Apache Ant 1.10.14 (or other version) under the folder apache-ant-1.10.14/. Then download the H2 database jar file in the folder. Your dir should look like:
apache-ant-1.10.14/
h2-2.1.214.jar
my_file.csv
script1.xml
Then, just run it:
apache-ant-1.10.14/bin/ant -f script1.xml
The first SQL statement loads the CSV data in memory. The second one is your "complex" query:
[sql] Executing commands
[sql] NAME,SEX,AGE,HEIGHT,WEIGHT,NAME,SEX,AGE,HEIGHT,WEIGHT
[sql] Elly,F,30,70,124,Bert,M,42,66,166
[sql] Elly,F,30,70,124,Omar,M,38,69,145
[sql] Page,F,31,67,135,Bert,M,42,66,166
[sql]
[sql] 2 of 2 SQL statements executed successfully
BUILD SUCCESSFUL
Total time: 0 seconds
That's it!
Notes:
For a variable query string, you could externalize it in a file and then use <transaction src="command.sql" /> instead of the embedded SQL I included. You can use multiple tags as needed.
For a variable CSV file name (or any parameter value) you can add -Dcsvfile=my_file.csv in the command line and then use it in the script as ${csvfile}. You can define many parameters, as needed.
H2 by definition is not a system-level installed database, but and embeded one. Nevertheless, you can place the JAR file anywhere you want, and then use <classpath> tags to refer to the system classpath, instead of the simpler classpath attrubute in this example. See Ant -> Ant Tasks -> List of Tasks -> Sql and
<classpath> for details.
As a database, H2 is quite powerful. I would qualify its SQL support to medium-high, so it implements most of the common SQL syntax and features you would find in expensive databases.
Note that the SQL query is embedded in the script using XML format. That's why the < symbol is escaped as <. If you don't like this, you can place the query in an external file (no escaping necessary) using the <transaction> tag.
You can also output the result as CSV or other format for further processing or to be consumed by other tools.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With