Is there a module in Python that does something like "sqldf" for R?

Tags:

List comprehensions are very good. But some kind of "... Join ..." would be very useful. Thanks. So for example. I have a Set A= {1,0}, a list B = [[1,1],[2,3]]. I would like to find all rows in B where the second colomu is one of the values in A. Or some thing more general, I have 2 CSV files. I want to find out all the rows where the values of some colonm from the two files match. Just like some kind of 'join' of two files. One of the files is GB size. sqldf is "SQL select on R data frames."

638

asked Dec 24 '11 22:12

gstar2002

4 Answers

You can use pandasql, which allows for SQL style querying of pandas DataFrames. It's very similar to sqldf.

https://github.com/yhat/pandasql/

(full disclaimer, I wrote it)

EDIT: blog post documenting some of the features found here: http://blog.yhathq.com/posts/pandasql-sql-for-pandas-dataframes.html

answered Oct 24 '22 09:10

Greg

I'm unaware of a library doing what you ask (but I only glanced at the sqldf documentation), however nothing of what you asked really requires a library, they are one-liners in python (and you could of course abstract the functionality creating a function rather than a simple list comprehension...)

Set A= {1,0}, a list B = [[1,1],[2,3]]. I would like to find all rows in B where the second column is one of the values in A.

>>> a = set([1, 0])
>>> b = [[1,1],[2,3]]
>>> [l for l in b if l[1] in a]
[[1, 1]]

I have 2 CSV files. I want to find out all the rows where the values of some column from the two files match.

>>> f1 = [[1, 2, 3], [4, 5, 6]]
>>> f2 = [[0, 2, 8], [7, 7, 7]]
>>> [tuple_ for tuple_ in zip(f1, f2) if tuple_[0][1] == tuple_[1][1]]
[([1, 2, 3], [0, 2, 8])]

EDIT: If memory usage is a problem you should use generators instead of lists. For example:

>>> zip(f1, f2)
[([1, 2, 3], [0, 2, 8]), ([4, 5, 6], [7, 7, 7])]

but using generators:

>>> import itertools as it
>>> gen = it.izip(f1, f2)
>>> gen
<itertools.izip object at 0x1f24ab8>
>>> next(gen)
([1, 2, 3], [0, 2, 8])
>>> next(gen)
([4, 5, 6], [7, 7, 7])

And for the data source:

>>> [line for line in f1]
[[1, 2, 3], [4, 5, 6]]

translate as generator as:

>>> gen = (line for line in f1)
>>> gen
<generator object <genexpr> at 0x1f159b0>
>>> next(gen)
[1, 2, 3]
>>> next(gen)
[4, 5, 6]

answered Oct 24 '22 10:10

mac

Before you can do the functionality of sqldf you need the functionality of 'df', ie dataframes. Python has a cuddly version: pandas:

http://pandas.sourceforge.net/

Perhaps the section on joining and merging will help:

http://pandas.sourceforge.net/merging.html

I recommend you start with something smaller than your gigabyte files though!

answered Oct 24 '22 11:10

Spacedman

There is a package available now which does exactly this! Check the link below:

pysqldf => https://pypi.org/project/pysqldf/

This package will allow you to query pandas dataframe using SQL just like sqldfdid in R

answered Oct 24 '22 09:10

Gary

Related questions
                            
                                Problem of loading mod_wsgi module into apache on Windows 64-bit
                            
                                How to alternate around directories using subprocess
                            
                                gae Model get_by_id() vs get_by_key_name()
                            
                                Python nested generators
                            
                                What type to store time length in python?
                            
                                calling an overridden method from base class?
                            
                                Replacing each match with a different word
                            
                                splitting and concatenating a string
                            
                                Highlighting and Selecting text with Python curses
                            
                                Virtualenvwrapper errors on Mac OS X Lion
                            
                                How to write a call back function for ignore in shutil.copytree
                            
                                Can Python unittest automatically reattempt a failed testcase / suite?
                            
                                Numpy multidimensional array slicing
                            
                                Multiple data set plotting with matplotlib.pyplot.plot_date
                            
                                How do I configure Tastypie to treat a field as unique?
                            
                                Stop running tests if setUp raises an exception in Python unittest
                            
                                What's try-else good for in Python?
                            
                                Why does the python datetime class have a 'fromtimestamp' method, but not a 'totimestamp' method?
                            
                                Is there anything I need aware of using Tkinter and pygame together?
                            
                                What is the proper way to handle Redis connection in Tornado ? (Async - Pub/Sub)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there a module in Python that does something like "sqldf" for R?

Tags:

python

sql

dataframe

r