Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The Pythonic way to grow a list of lists

I have a large file (2GB) of categorical data (mostly "Nan"--but populated here and there with actual values) that is too large to read into a single data frame. I had a rather difficult time coming up with a object to store all the unique values for each column (Which is my goal--eventually I need to factorize this for modeling)

What I ended it up doing was reading the file in chunks into a dataframe and then get the unique values of each column and store them in a list of lists. My solution works, but seemed most un-pythonic--is there a cleaner way to accomplish this in Python (ver 3.5). I do know the number of columns (~2100).

import pandas as pd
#large file of csv separated text data
data=pd.read_csv("./myratherlargefile.csv",chunksize=100000, dtype=str)

collist=[]
master=[]
i=0
initialize=0
for chunk in data:
    #so the first time through I have to make the "master" list
    if initialize==0:
        for col in chunk:
            #thinking about this, i should have just dropped this col
            if col=='Id':
                continue
            else:
                #use pd.unique as a build in solution to get unique values
                collist=chunk[col][chunk[col].notnull()].unique().tolist()
                master.append(collist)
                i=i+1
    #but after first loop just append to the master-list at
    #each master-list element
    if initialize==1:
        for col in chunk:
            if col=='Id':
                continue
            else:
                collist=chunk[col][chunk[col].notnull()].unique().tolist()
                for item in collist:
                    master[i]=master[i]+collist
                i=i+1
    initialize=1
    i=0 

after that, my final task for all the unique values is as follows:

i=0
names=chunk.columns.tolist()
for item in master:
     master[i]=list(set(item))
     master[i]=master[i].append(names[i+1])
     i=i+1

thus master[i] gives me the column name and then a list of unique values--crude but it does work--my main concern is building the list in a "better" way if possible.

like image 762
RDS Avatar asked Sep 27 '16 05:09

RDS


People also ask

Can you have a list of list of lists in Python?

Python provides an option of creating a list within a list. If put simply, it is a nested list but with one or more lists inside as an element. Here, [a,b], [c,d], and [e,f] are separate lists which are passed as elements to make a new list. This is a list of lists.

How do you create a list from 1 to 100 in Python?

Using the range() function to create a list from 1 to 100 in Python. In Python, we can use the range() function to create an iterator sequence between two endpoints. We can use this function to create a list from 1 to 100 in Python. The function accepts three parameters start , stop , and step .


1 Answers

I would suggest instead of a list of lists, using a collections.defaultdict(set).

Say you start with

uniques = collections.defaultdict(set)

Now the loop can become something like this:

for chunk in data: 
    for col in chunk:
        uniques[col] = uniques[col].union(chunk[col].unique())

Note that:

  1. defaultdict always has a set for uniques[col] (that's what it's there for), so you can skip initialized and stuff.

  2. For a given col, you simply update the entry with the union of the current set (which initially is empty, but it doesn't matter) and the new unique elements.

Edit

As Raymond Hettinger notes (thanks!), it is better to use

       uniques[col].update(chunk[col].unique())
like image 62
Ami Tavory Avatar answered Oct 09 '22 17:10

Ami Tavory