I have a large file (2GB) of categorical data (mostly "Nan"--but populated here and there with actual values) that is too large to read into a single data frame. I had a rather difficult time coming up with a object to store all the unique values for each column (Which is my goal--eventually I need to factorize this for modeling)
What I ended it up doing was reading the file in chunks into a dataframe and then get the unique values of each column and store them in a list of lists. My solution works, but seemed most un-pythonic--is there a cleaner way to accomplish this in Python (ver 3.5). I do know the number of columns (~2100).
import pandas as pd
#large file of csv separated text data
data=pd.read_csv("./myratherlargefile.csv",chunksize=100000, dtype=str)
collist=[]
master=[]
i=0
initialize=0
for chunk in data:
#so the first time through I have to make the "master" list
if initialize==0:
for col in chunk:
#thinking about this, i should have just dropped this col
if col=='Id':
continue
else:
#use pd.unique as a build in solution to get unique values
collist=chunk[col][chunk[col].notnull()].unique().tolist()
master.append(collist)
i=i+1
#but after first loop just append to the master-list at
#each master-list element
if initialize==1:
for col in chunk:
if col=='Id':
continue
else:
collist=chunk[col][chunk[col].notnull()].unique().tolist()
for item in collist:
master[i]=master[i]+collist
i=i+1
initialize=1
i=0
after that, my final task for all the unique values is as follows:
i=0
names=chunk.columns.tolist()
for item in master:
master[i]=list(set(item))
master[i]=master[i].append(names[i+1])
i=i+1
thus master[i] gives me the column name and then a list of unique values--crude but it does work--my main concern is building the list in a "better" way if possible.
Python provides an option of creating a list within a list. If put simply, it is a nested list but with one or more lists inside as an element. Here, [a,b], [c,d], and [e,f] are separate lists which are passed as elements to make a new list. This is a list of lists.
Using the range() function to create a list from 1 to 100 in Python. In Python, we can use the range() function to create an iterator sequence between two endpoints. We can use this function to create a list from 1 to 100 in Python. The function accepts three parameters start , stop , and step .
I would suggest instead of a list
of list
s, using a collections.defaultdict(set)
.
Say you start with
uniques = collections.defaultdict(set)
Now the loop can become something like this:
for chunk in data:
for col in chunk:
uniques[col] = uniques[col].union(chunk[col].unique())
Note that:
defaultdict
always has a set
for uniques[col]
(that's what it's there for), so you can skip initialized
and stuff.
For a given col
, you simply update the entry with the union of the current set (which initially is empty, but it doesn't matter) and the new unique elements.
Edit
As Raymond Hettinger notes (thanks!), it is better to use
uniques[col].update(chunk[col].unique())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With