I've create a tuple generator that extract information from a file filtering only the records of interest and converting it to a tuple that generator returns.
I've try to create a DataFrame from:
import pandas as pd df = pd.DataFrame.from_records(tuple_generator, columns = tuple_fields_name_list)
but throws an error:
... C:\Anaconda\envs\py33\lib\site-packages\pandas\core\frame.py in from_records(cls, data, index, exclude, columns, coerce_float, nrows) 1046 values.append(row) 1047 i += 1 -> 1048 if i >= nrows: 1049 break 1050 TypeError: unorderable types: int() >= NoneType()
I managed it to work consuming the generator in a list, but uses twice memory:
df = pd.DataFrame.from_records(list(tuple_generator), columns = tuple_fields_name_list)
The files I want to load are big, and memory consumption matters. The last try my computer spends two hours trying to increment virtual memory :(
The question: Anyone knows a method to create a DataFrame from a record generator directly, without previously convert it to a list?
Note: I'm using python 3.3 and pandas 0.12 with Anaconda on Windows.
Update:
It's not problem of reading the file, my tuple generator do it well, it scan a text compressed file of intermixed records line by line and convert only the wanted data to the correct types, then it yields fields in a generator of tuples form. Some numbers, it scans 2111412 records on a 130MB gzip file, about 6.5GB uncompressed, in about a minute and with little memory used.
Pandas 0.12 does not allow generators, dev version allows it but put all the generator in a list and then convert to a frame. It's not efficient but it's something that have to deal internally pandas. Meanwhile I've must think about buy some more memory.
One of the way to create Pandas DataFrame is by using zip() function. You can use the lists to create lists of tuples and create a dictionary from it. Then, this dictionary can be used to construct a dataframe. zip() function creates the objects and that can be used to produce single item at a time.
To create a dataframe, we need to import pandas. Dataframe can be created using dataframe() function. The dataframe() takes one or two parameters. The first one is the data which is to be filled in the dataframe table.
Since a dataframe can be considered as a two-dimensional data structure, we can use a two-dimensional numpy array to create a dataframe. A is a two-dimensional array with 4 rows and 3 columns. We can pass it to the DataFrame function.
You certainly can construct a pandas.DataFrame()
from a generator of tuples, as of version 0.19 (and probably earlier). Don't use .from_records()
; just use the constructor, for example:
import pandas as pd someGenerator = ( (x, chr(x)) for x in range(48,127) ) someDf = pd.DataFrame(someGenerator)
Produces:
type(someDf) #pandas.core.frame.DataFrame someDf.dtypes #0 int64 #1 object #dtype: object someDf.tail(10) # 0 1 #69 117 u #70 118 v #71 119 w #72 120 x #73 121 y #74 122 z #75 123 { #76 124 | #77 125 } #78 126 ~
You cannot create a DataFrame from a generator with the 0.12 version of pandas. You can either update yourself to the development version (get it from the github and compile it - which is a little bit painful on windows but I would prefer this option).
Or you can, since you said you are filtering the lines, first filter them, write them to a file and then load them using read_csv
or something else...
If you want to get super complicated you can create a file like object that will return the lines:
def gen(): lines = [ 'col1,col2\n', 'foo,bar\n', 'foo,baz\n', 'bar,baz\n' ] for line in lines: yield line class Reader(object): def __init__(self, g): self.g = g def read(self, n=0): try: return next(self.g) except StopIteration: return ''
And then use the read_csv
:
>>> pd.read_csv(Reader(gen())) col1 col2 0 foo bar 1 foo baz 2 bar baz
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With