We use grep, cut, sort, uniq, and join at the command line all the time to do data analysis. They work great, although there are shortcomings. For example, you have to give column numbers to each tool. We often have wide files (many columns) and a column header that gives column names. In fact, our files look a lot like SQL tables. I'm sure there is a driver (ODBC?) that will operate on delimited text files, and some query engine that will use that driver, so we could just use SQL queries on our text files. Since doing analysis is usually ad hoc, it would have to be minimal setup to query new files (just use the files I specify in this directory) rather than declaring particular tables in some config.
Practically speaking, what's the easiest? That is, the SQL engine and driver that is easiest to set up and use to apply against text files?
David Malcolm wrote a little tool named "squeal" (formerly "show"), which allows you to use SQL-like command-line syntax to parse text files of various formats, including CSV.
An example on squeal's home page:
$ squeal "count(*)", source from /var/log/messages* group by source order by "count(*)" desc
count(*)|source |
--------+--------------------+
1633 |kernel |
1324 |NetworkManager |
98 |ntpd |
70 |avahi-daemon |
63 |dhclient |
48 |setroubleshoot |
39 |dnsmasq |
29 |nm-system-settings |
27 |bluetoothd |
14 |/usr/sbin/gpm |
13 |acpid |
10 |init |
9 |pcscd |
9 |pulseaudio |
6 |gnome-keyring-ask |
6 |gnome-keyring-daemon|
6 |gnome-session |
6 |rsyslogd |
5 |rpc.statd |
4 |vpnc |
3 |gdm-session-worker |
2 |auditd |
2 |console-kit-daemon |
2 |libvirtd |
2 |rpcbind |
1 |nm-dispatcher.action|
1 |restorecond |
q - Run SQL directly on CSV or TSV files:
https://github.com/harelba/q
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With