Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL query engine for text files on Linux?

We use grep, cut, sort, uniq, and join at the command line all the time to do data analysis. They work great, although there are shortcomings. For example, you have to give column numbers to each tool. We often have wide files (many columns) and a column header that gives column names. In fact, our files look a lot like SQL tables. I'm sure there is a driver (ODBC?) that will operate on delimited text files, and some query engine that will use that driver, so we could just use SQL queries on our text files. Since doing analysis is usually ad hoc, it would have to be minimal setup to query new files (just use the files I specify in this directory) rather than declaring particular tables in some config.

Practically speaking, what's the easiest? That is, the SQL engine and driver that is easiest to set up and use to apply against text files?

like image 808
mcassano Avatar asked Apr 22 '09 16:04

mcassano


2 Answers

David Malcolm wrote a little tool named "squeal" (formerly "show"), which allows you to use SQL-like command-line syntax to parse text files of various formats, including CSV.

An example on squeal's home page:

$ squeal "count(*)", source from /var/log/messages* group by source order by "count(*)" desc
count(*)|source              |
--------+--------------------+
1633    |kernel              |
1324    |NetworkManager      |
98      |ntpd                |
70      |avahi-daemon        |
63      |dhclient            |
48      |setroubleshoot      |
39      |dnsmasq             |
29      |nm-system-settings  |
27      |bluetoothd          |
14      |/usr/sbin/gpm       |
13      |acpid               |
10      |init                |
9       |pcscd               |
9       |pulseaudio          |
6       |gnome-keyring-ask   |
6       |gnome-keyring-daemon|
6       |gnome-session       |
6       |rsyslogd            |
5       |rpc.statd           |
4       |vpnc                |
3       |gdm-session-worker  |
2       |auditd              |
2       |console-kit-daemon  |
2       |libvirtd            |
2       |rpcbind             |
1       |nm-dispatcher.action|
1       |restorecond         |
like image 120
ephemient Avatar answered Sep 29 '22 05:09

ephemient


q - Run SQL directly on CSV or TSV files:

https://github.com/harelba/q

like image 43
bambam Avatar answered Sep 29 '22 05:09

bambam