Does anyone know of any tools to provide simple, fast queries of flat files using a SQL-like declarative query language? I'd rather not pay the overhead of loading the file into a DB since the input data is typically thrown out almost immediately after the query is run.
Consider the data file, "animals.txt":
dog 15
cat 20
dog 10
cat 30
dog 5
cat 40
Suppose I want to extract the highest value for each unique animal. I would like to write something like:
cat animals.txt | foo "select $1, max(convert($2 using decimal)) group by $1"
I can get nearly the same result using sort
:
cat animals.txt | sort -t " " -k1,1 -k2,2nr
And I can always drop into awk
from there, but this all feels a bit awk
ward (couldn't resist) when a SQL-like language would seem to solve the problem so cleanly.
I've considered writing a wrapper for SQLite that would automatically create a table based on the input data, and I've looked into using Hive in single-processor mode, but I can't help but feel this problem has been solved before. Am I missing something? Is this functionality already implemented by another standard tool?
Halp!
A flat file is a collection of data stored in a two-dimensional database in which similar yet discrete strings of information are stored as records in a table. The columns of the table represent one dimension of the database, while each row is a separate record.
I wrote TxtSushi mostly to do SQL selects on flat files. Here is the command chain for your example (all of these commands are from TxtSushi):
tabtocsv animals.txt | namecolumns - | tssql -table animals - \ 'select col1, max(as_int(col2)) from animals group by col1'
namecolumns is only required because animals.txt doesn't have a header row. You can get a quick sense of what is possible by looking through the example scripts. There are also links to similar tools on the bottom of the main page.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With