Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a module in Python that does something like "sqldf" for R?

List comprehensions are very good. But some kind of "... Join ..." would be very useful. Thanks. So for example. I have a Set A= {1,0}, a list B = [[1,1],[2,3]]. I would like to find all rows in B where the second colomu is one of the values in A. Or some thing more general, I have 2 CSV files. I want to find out all the rows where the values of some colonm from the two files match. Just like some kind of 'join' of two files. One of the files is GB size. sqldf is "SQL select on R data frames."

like image 638
gstar2002 Avatar asked Dec 24 '11 22:12

gstar2002


People also ask

Can I write SQL queries in pandas?

Pandasql is a python library that allows manipulation of a Pandas Dataframe using SQL. Under the hood, Pandasql creates an SQLite table from the Pandas Dataframe of interest and allow users to query from the SQLite table using SQL.

How do I run a SQL query on a Dataframe in Python?

Use pandasql to Run SQL Queries in Python We will import the sqldf method from the pandasql module to run a query. Then we will call the sqldf method that takes two arguments. The first argument is a SQL query in string format. The second argument is a set of session/environment variables ( globals() or locals() ).


4 Answers

You can use pandasql, which allows for SQL style querying of pandas DataFrames. It's very similar to sqldf.

https://github.com/yhat/pandasql/

(full disclaimer, I wrote it)

EDIT: blog post documenting some of the features found here: http://blog.yhathq.com/posts/pandasql-sql-for-pandas-dataframes.html

like image 60
Greg Avatar answered Oct 24 '22 09:10

Greg


I'm unaware of a library doing what you ask (but I only glanced at the sqldf documentation), however nothing of what you asked really requires a library, they are one-liners in python (and you could of course abstract the functionality creating a function rather than a simple list comprehension...)

Set A= {1,0}, a list B = [[1,1],[2,3]]. I would like to find all rows in B where the second column is one of the values in A.

>>> a = set([1, 0])
>>> b = [[1,1],[2,3]]
>>> [l for l in b if l[1] in a]
[[1, 1]]

I have 2 CSV files. I want to find out all the rows where the values of some column from the two files match.

>>> f1 = [[1, 2, 3], [4, 5, 6]]
>>> f2 = [[0, 2, 8], [7, 7, 7]]
>>> [tuple_ for tuple_ in zip(f1, f2) if tuple_[0][1] == tuple_[1][1]]
[([1, 2, 3], [0, 2, 8])]

EDIT: If memory usage is a problem you should use generators instead of lists. For example:

>>> zip(f1, f2)
[([1, 2, 3], [0, 2, 8]), ([4, 5, 6], [7, 7, 7])]

but using generators:

>>> import itertools as it
>>> gen = it.izip(f1, f2)
>>> gen
<itertools.izip object at 0x1f24ab8>
>>> next(gen)
([1, 2, 3], [0, 2, 8])
>>> next(gen)
([4, 5, 6], [7, 7, 7])

And for the data source:

>>> [line for line in f1]
[[1, 2, 3], [4, 5, 6]]

translate as generator as:

>>> gen = (line for line in f1)
>>> gen
<generator object <genexpr> at 0x1f159b0>
>>> next(gen)
[1, 2, 3]
>>> next(gen)
[4, 5, 6]
like image 39
mac Avatar answered Oct 24 '22 10:10

mac


Before you can do the functionality of sqldf you need the functionality of 'df', ie dataframes. Python has a cuddly version: pandas:

http://pandas.sourceforge.net/

Perhaps the section on joining and merging will help:

http://pandas.sourceforge.net/merging.html

I recommend you start with something smaller than your gigabyte files though!

like image 31
Spacedman Avatar answered Oct 24 '22 11:10

Spacedman


There is a package available now which does exactly this! Check the link below:

pysqldf => https://pypi.org/project/pysqldf/

This package will allow you to query pandas dataframe using SQL just like sqldfdid in R

like image 31
Gary Avatar answered Oct 24 '22 09:10

Gary