Does anyone know of any tools to provide simple, fast queries of flat files using a SQL-like declarative query language? I'd rather not pay the overhead of loading the file into a DB since the input data is typically thrown out almost immediately after the query is run. Consider the data file, "animals.txt": <pre class="prettyprint"><code>dog 15 cat 20 dog 10 cat 30 dog 5 cat 40 </code></pre> Suppose I want to extract the highest value for each unique animal. I would like to write something like: <pre class="prettyprint"><code>cat animals.txt | foo "select $1, max(convert($2 using decimal)) group by $1" </code></pre> I can get nearly the same result using <code>sort</code>: <pre class="prettyprint"><code>cat animals.txt | sort -t " " -k1,1 -k2,2nr </code></pre> And I can always drop into <code>awk</code> from there, but this all feels a bit <code>awk</code>ward (couldn't resist) when a SQL-like language would seem to solve the problem so cleanly. I've considered writing a wrapper for SQLite that would automatically create a table based on the input data, and I've looked into using Hive in single-processor mode, but I can't help but feel this problem has been solved before. Am I missing something? Is this functionality already implemented by another standard tool? Halp!

I wrote TxtSushi mostly to do SQL selects on flat files. Here is the command chain for your example (all of these commands are from TxtSushi): <pre class="prettyprint">tabtocsv animals.txt | namecolumns - | tssql -table animals - \ 'select col1, max(as_int(col2)) from animals group by col1'</pre> namecolumns is only required because animals.txt doesn't have a header row. You can get a quick sense of what is possible by looking through the example scripts. There are also links to similar tools on the bottom of the main page.

Simple, fast SQL queries for flat files

Tags:

linux

sql

sorting

flat-file

Does anyone know of any tools to provide simple, fast queries of flat files using a SQL-like declarative query language? I'd rather not pay the overhead of loading the file into a DB since the input data is typically thrown out almost immediately after the query is run.

Consider the data file, "animals.txt":

dog 15
cat 20
dog 10
cat 30
dog 5
cat 40

Suppose I want to extract the highest value for each unique animal. I would like to write something like:

cat animals.txt | foo "select $1, max(convert($2 using decimal)) group by $1"

I can get nearly the same result using sort:

cat animals.txt | sort -t " " -k1,1 -k2,2nr

And I can always drop into awk from there, but this all feels a bit awkward (couldn't resist) when a SQL-like language would seem to solve the problem so cleanly.

I've considered writing a wrapper for SQLite that would automatically create a table based on the input data, and I've looked into using Hive in single-processor mode, but I can't help but feel this problem has been solved before. Am I missing something? Is this functionality already implemented by another standard tool?

Halp!

696

asked Feb 17 '10 02:02

plinehan

1 Answers

I wrote TxtSushi mostly to do SQL selects on flat files. Here is the command chain for your example (all of these commands are from TxtSushi):

tabtocsv animals.txt | namecolumns - | tssql -table animals - \
'select col1, max(as_int(col2)) from animals group by col1'

namecolumns is only required because animals.txt doesn't have a header row. You can get a quick sense of what is possible by looking through the example scripts. There are also links to similar tools on the bottom of the main page.

114

answered Sep 20 '22 05:09

Keith

Related questions
                            
                                Tricky postgresql query optimization (distinct row aggregation with ordering)
                            
                                Compute 2,3 quartile average in SQL
                            
                                Visual Studio 2012 Schema Compare: Column Order and Constraint Names
                            
                                Parsing SQL Statement With Irony
                            
                                SqlConnection vs Sql Session. Do their lifetimes coincide?
                            
                                try....catch in mysql for transaction?
                            
                                RSConfig generates a Dsn Connection String doesn't work
                            
                                SQL Server 2008 Hierarchy Data Type Performance?
                            
                                Oracle hierarchical query on non-hierarchical data
                            
                                How expensive are MySQL events?
                            
                                PostgreSQL cast record to composite type
                            
                                How to efficiently get a range of ranked users (for a leaderboard) using Postgresql
                            
                                IndexedDB, WebSQL in 4 Months
                            
                                Alternatives to SQL? (Alternative declarative query languages for relational databases?) [closed]
                            
                                Migrating from MySQL to SQL Server, issues with constraints
                            
                                Allen's Interval Algebra operations in SQL
                            
                                Oracle IN vs Exists difference?
                            
                                SQL - Relationship between a SubQuery and an Outer Table
                            
                                Calculation using Date function in SQL Server 2008
                            
                                result of prepared select statement as array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With