Create a pandas DataFrame from generator?

Tags:

I've create a tuple generator that extract information from a file filtering only the records of interest and converting it to a tuple that generator returns.

I've try to create a DataFrame from:

Click to copy

import pandas as pd df = pd.DataFrame.from_records(tuple_generator, columns = tuple_fields_name_list)

but throws an error:

Click to copy

...  C:\Anaconda\envs\py33\lib\site-packages\pandas\core\frame.py in from_records(cls, data, index, exclude, columns, coerce_float, nrows)    1046                 values.append(row)    1047                 i += 1 -> 1048                 if i >= nrows:    1049                     break    1050   TypeError: unorderable types: int() >= NoneType()

I managed it to work consuming the generator in a list, but uses twice memory:

Click to copy

df = pd.DataFrame.from_records(list(tuple_generator), columns = tuple_fields_name_list)

The files I want to load are big, and memory consumption matters. The last try my computer spends two hours trying to increment virtual memory :(

The question: Anyone knows a method to create a DataFrame from a record generator directly, without previously convert it to a list?

Note: I'm using python 3.3 and pandas 0.12 with Anaconda on Windows.

Update:

It's not problem of reading the file, my tuple generator do it well, it scan a text compressed file of intermixed records line by line and convert only the wanted data to the correct types, then it yields fields in a generator of tuples form. Some numbers, it scans 2111412 records on a 130MB gzip file, about 6.5GB uncompressed, in about a minute and with little memory used.

Pandas 0.12 does not allow generators, dev version allows it but put all the generator in a list and then convert to a frame. It's not efficient but it's something that have to deal internally pandas. Meanwhile I've must think about buy some more memory.

683

asked Sep 20 '13 11:09

tinproject

2 Answers

You certainly can construct a pandas.DataFrame() from a generator of tuples, as of version 0.19 (and probably earlier). Don't use .from_records(); just use the constructor, for example:

Click to copy

import pandas as pd someGenerator = ( (x, chr(x)) for x in range(48,127) ) someDf = pd.DataFrame(someGenerator)

Produces:

Click to copy

type(someDf) #pandas.core.frame.DataFrame  someDf.dtypes #0     int64 #1    object #dtype: object  someDf.tail(10) #      0  1 #69  117  u #70  118  v #71  119  w #72  120  x #73  121  y #74  122  z #75  123  { #76  124  | #77  125  } #78  126  ~

answered Sep 22 '22 02:09

C8H10N4O2

You cannot create a DataFrame from a generator with the 0.12 version of pandas. You can either update yourself to the development version (get it from the github and compile it - which is a little bit painful on windows but I would prefer this option).

Or you can, since you said you are filtering the lines, first filter them, write them to a file and then load them using read_csv or something else...

If you want to get super complicated you can create a file like object that will return the lines:

Click to copy

def gen():     lines = [         'col1,col2\n',         'foo,bar\n',         'foo,baz\n',         'bar,baz\n'     ]     for line in lines:         yield line  class Reader(object):     def __init__(self, g):         self.g = g     def read(self, n=0):         try:             return next(self.g)         except StopIteration:             return ''

And then use the read_csv:

Click to copy

>>> pd.read_csv(Reader(gen()))   col1 col2 0  foo  bar 1  foo  baz 2  bar  baz

answered Sep 24 '22 02:09

Viktor Kerkez

Related questions
                            
                                Scikit-learn, get accuracy scores for each class
                            
                                Find longest repetitive sequence in a string
                            
                                Docstrings when nothing is returned
                            
                                TensorFlow: How and why to use SavedModel
                            
                                Reading serial data in realtime in Python
                            
                                Python library for playing fixed-frequency sound
                            
                                Format truncated Python float as int in string
                            
                                Scikit Learn TfidfVectorizer : How to get top n terms with highest tf-idf score
                            
                                Python Dictionary contains List as Value - How to update?
                            
                                500 Error without anything in the apache logs
                            
                                Python Threading with Event object
                            
                                What's the difference between io.open() and os.open() on Python?
                            
                                What's the difference between nan, NaN and NAN
                            
                                How to read an image in Python OpenCV
                            
                                Data type conversion error: ValueError: Cannot convert non-finite values (NA or inf) to integer [duplicate]
                            
                                What is a python thread
                            
                                Change timezone of date-time column in pandas and add as hierarchical index
                            
                                Reading data from S3 using Lambda
                            
                                How to change a single value in a NumPy array?
                            
                                Combining rows in pandas [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Create a pandas DataFrame from generator?

Tags:

python

python-3.x

pandas

tinproject

People also ask

2 Answers

C8H10N4O2

Viktor Kerkez

Recent Activity

Donate For Us