Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create a pandas DataFrame from generator?

I've create a tuple generator that extract information from a file filtering only the records of interest and converting it to a tuple that generator returns.

I've try to create a DataFrame from:

import pandas as pd df = pd.DataFrame.from_records(tuple_generator, columns = tuple_fields_name_list) 

but throws an error:

...  C:\Anaconda\envs\py33\lib\site-packages\pandas\core\frame.py in from_records(cls, data, index, exclude, columns, coerce_float, nrows)    1046                 values.append(row)    1047                 i += 1 -> 1048                 if i >= nrows:    1049                     break    1050   TypeError: unorderable types: int() >= NoneType() 

I managed it to work consuming the generator in a list, but uses twice memory:

df = pd.DataFrame.from_records(list(tuple_generator), columns = tuple_fields_name_list) 

The files I want to load are big, and memory consumption matters. The last try my computer spends two hours trying to increment virtual memory :(

The question: Anyone knows a method to create a DataFrame from a record generator directly, without previously convert it to a list?

Note: I'm using python 3.3 and pandas 0.12 with Anaconda on Windows.

Update:

It's not problem of reading the file, my tuple generator do it well, it scan a text compressed file of intermixed records line by line and convert only the wanted data to the correct types, then it yields fields in a generator of tuples form. Some numbers, it scans 2111412 records on a 130MB gzip file, about 6.5GB uncompressed, in about a minute and with little memory used.

Pandas 0.12 does not allow generators, dev version allows it but put all the generator in a list and then convert to a frame. It's not efficient but it's something that have to deal internally pandas. Meanwhile I've must think about buy some more memory.

like image 683
tinproject Avatar asked Sep 20 '13 11:09

tinproject


People also ask

Can we create DataFrame from zip objects?

One of the way to create Pandas DataFrame is by using zip() function. You can use the lists to create lists of tuples and create a dictionary from it. Then, this dictionary can be used to construct a dataframe. zip() function creates the objects and that can be used to produce single item at a time.

How do you create a DataFrame from a dataset?

To create a dataframe, we need to import pandas. Dataframe can be created using dataframe() function. The dataframe() takes one or two parameters. The first one is the data which is to be filled in the dataframe table.

Can you create a DataFrame from an array in Python?

Since a dataframe can be considered as a two-dimensional data structure, we can use a two-dimensional numpy array to create a dataframe. A is a two-dimensional array with 4 rows and 3 columns. We can pass it to the DataFrame function.


2 Answers

You certainly can construct a pandas.DataFrame() from a generator of tuples, as of version 0.19 (and probably earlier). Don't use .from_records(); just use the constructor, for example:

import pandas as pd someGenerator = ( (x, chr(x)) for x in range(48,127) ) someDf = pd.DataFrame(someGenerator) 

Produces:

type(someDf) #pandas.core.frame.DataFrame  someDf.dtypes #0     int64 #1    object #dtype: object  someDf.tail(10) #      0  1 #69  117  u #70  118  v #71  119  w #72  120  x #73  121  y #74  122  z #75  123  { #76  124  | #77  125  } #78  126  ~ 
like image 76
C8H10N4O2 Avatar answered Sep 22 '22 02:09

C8H10N4O2


You cannot create a DataFrame from a generator with the 0.12 version of pandas. You can either update yourself to the development version (get it from the github and compile it - which is a little bit painful on windows but I would prefer this option).

Or you can, since you said you are filtering the lines, first filter them, write them to a file and then load them using read_csv or something else...

If you want to get super complicated you can create a file like object that will return the lines:

def gen():     lines = [         'col1,col2\n',         'foo,bar\n',         'foo,baz\n',         'bar,baz\n'     ]     for line in lines:         yield line  class Reader(object):     def __init__(self, g):         self.g = g     def read(self, n=0):         try:             return next(self.g)         except StopIteration:             return '' 

And then use the read_csv:

>>> pd.read_csv(Reader(gen()))   col1 col2 0  foo  bar 1  foo  baz 2  bar  baz 
like image 34
Viktor Kerkez Avatar answered Sep 24 '22 02:09

Viktor Kerkez