List comprehensions are very good. But some kind of "... Join ..." would be very useful. Thanks. So for example. I have a Set A= {1,0}, a list B = [[1,1],[2,3]]. I would like to find all rows in B where the second colomu is one of the values in A. Or some thing more general, I have 2 CSV files. I want to find out all the rows where the values of some colonm from the two files match. Just like some kind of 'join' of two files. One of the files is GB size. sqldf is "SQL select on R data frames."
Pandasql is a python library that allows manipulation of a Pandas Dataframe using SQL. Under the hood, Pandasql creates an SQLite table from the Pandas Dataframe of interest and allow users to query from the SQLite table using SQL.
Use pandasql to Run SQL Queries in Python We will import the sqldf method from the pandasql module to run a query. Then we will call the sqldf method that takes two arguments. The first argument is a SQL query in string format. The second argument is a set of session/environment variables ( globals() or locals() ).
You can use pandasql, which allows for SQL style querying of pandas DataFrames. It's very similar to sqldf.
https://github.com/yhat/pandasql/
(full disclaimer, I wrote it)
EDIT: blog post documenting some of the features found here: http://blog.yhathq.com/posts/pandasql-sql-for-pandas-dataframes.html
I'm unaware of a library doing what you ask (but I only glanced at the sqldf
documentation), however nothing of what you asked really requires a library, they are one-liners in python (and you could of course abstract the functionality creating a function rather than a simple list comprehension...)
Set A= {1,0}, a list B = [[1,1],[2,3]]. I would like to find all rows in B where the second column is one of the values in A.
>>> a = set([1, 0])
>>> b = [[1,1],[2,3]]
>>> [l for l in b if l[1] in a]
[[1, 1]]
I have 2 CSV files. I want to find out all the rows where the values of some column from the two files match.
>>> f1 = [[1, 2, 3], [4, 5, 6]]
>>> f2 = [[0, 2, 8], [7, 7, 7]]
>>> [tuple_ for tuple_ in zip(f1, f2) if tuple_[0][1] == tuple_[1][1]]
[([1, 2, 3], [0, 2, 8])]
EDIT: If memory usage is a problem you should use generators instead of lists. For example:
>>> zip(f1, f2)
[([1, 2, 3], [0, 2, 8]), ([4, 5, 6], [7, 7, 7])]
but using generators:
>>> import itertools as it
>>> gen = it.izip(f1, f2)
>>> gen
<itertools.izip object at 0x1f24ab8>
>>> next(gen)
([1, 2, 3], [0, 2, 8])
>>> next(gen)
([4, 5, 6], [7, 7, 7])
And for the data source:
>>> [line for line in f1]
[[1, 2, 3], [4, 5, 6]]
translate as generator as:
>>> gen = (line for line in f1)
>>> gen
<generator object <genexpr> at 0x1f159b0>
>>> next(gen)
[1, 2, 3]
>>> next(gen)
[4, 5, 6]
Before you can do the functionality of sqldf you need the functionality of 'df', ie dataframes. Python has a cuddly version: pandas:
http://pandas.sourceforge.net/
Perhaps the section on joining and merging will help:
http://pandas.sourceforge.net/merging.html
I recommend you start with something smaller than your gigabyte files though!
There is a package available now which does exactly this! Check the link below:
pysqldf => https://pypi.org/project/pysqldf/
This package will allow you to query pandas dataframe using SQL just like sqldf
did in R
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With