The Pythonic way to grow a list of lists

Tags:

I have a large file (2GB) of categorical data (mostly "Nan"--but populated here and there with actual values) that is too large to read into a single data frame. I had a rather difficult time coming up with a object to store all the unique values for each column (Which is my goal--eventually I need to factorize this for modeling)

What I ended it up doing was reading the file in chunks into a dataframe and then get the unique values of each column and store them in a list of lists. My solution works, but seemed most un-pythonic--is there a cleaner way to accomplish this in Python (ver 3.5). I do know the number of columns (~2100).

import pandas as pd
#large file of csv separated text data
data=pd.read_csv("./myratherlargefile.csv",chunksize=100000, dtype=str)

collist=[]
master=[]
i=0
initialize=0
for chunk in data:
    #so the first time through I have to make the "master" list
    if initialize==0:
        for col in chunk:
            #thinking about this, i should have just dropped this col
            if col=='Id':
                continue
            else:
                #use pd.unique as a build in solution to get unique values
                collist=chunk[col][chunk[col].notnull()].unique().tolist()
                master.append(collist)
                i=i+1
    #but after first loop just append to the master-list at
    #each master-list element
    if initialize==1:
        for col in chunk:
            if col=='Id':
                continue
            else:
                collist=chunk[col][chunk[col].notnull()].unique().tolist()
                for item in collist:
                    master[i]=master[i]+collist
                i=i+1
    initialize=1
    i=0

after that, my final task for all the unique values is as follows:

i=0
names=chunk.columns.tolist()
for item in master:
     master[i]=list(set(item))
     master[i]=master[i].append(names[i+1])
     i=i+1

thus master[i] gives me the column name and then a list of unique values--crude but it does work--my main concern is building the list in a "better" way if possible.

762

asked Sep 27 '16 05:09

RDS

1 Answers

I would suggest instead of a list of lists, using a collections.defaultdict(set).

Say you start with

uniques = collections.defaultdict(set)

Now the loop can become something like this:

for chunk in data: 
    for col in chunk:
        uniques[col] = uniques[col].union(chunk[col].unique())

Note that:

defaultdict always has a set for uniques[col] (that's what it's there for), so you can skip initialized and stuff.
For a given col, you simply update the entry with the union of the current set (which initially is empty, but it doesn't matter) and the new unique elements.

Edit

As Raymond Hettinger notes (thanks!), it is better to use

       uniques[col].update(chunk[col].unique())

answered Oct 09 '22 17:10

Ami Tavory

Related questions
                            
                                Point type in sqlalchemy?
                            
                                Mocking download of a file using Python requests and responses
                            
                                Install psutil without gcc
                            
                                Python property on a list
                            
                                Automatically round Django's DecimalField according to the max_digits and decimal_places attributes before calling save()
                            
                                How to set tight_layout for matplotlib graphs after show()
                            
                                Is it safe to do a data migration as just one operation in a larger Django migration?
                            
                                How to get scraped items from main script using scrapy?
                            
                                Why is relative path not working in python tests?
                            
                                Python 3.5 TypeError: got multiple values for argument [duplicate]
                            
                                Sliding window iterator using rolling in pandas
                            
                                Why does create() in PayPal's batch payments via API return False?
                            
                                Reading hex to double-precision float python
                            
                                How indexing works in Pandas?
                            
                                Making a PyInstaller exe do both command-line and windowed
                            
                                WARNING (theano.sandbox.cuda): CUDA is installed, but device gpu is not available (error: cuda unavailable)
                            
                                Python how to get the calling function (not just its name)?
                            
                                Flower doesn't display all workers for celery
                            
                                pandas: all NaNs when subtracting two dataframes
                            
                                python create html table from dict

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

The Pythonic way to grow a list of lists

Tags:

python

list

nested-lists

RDS

People also ask

1 Answers

Ami Tavory

Recent Activity

Donate For Us