Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLTK ConditionalFreqDist to Pandas dataframe

I am trying to work with the table generated by nltk.ConditionalFreqDist but I can't seem to find any documentation on either writing the table to a csv file or exporting to other formats. I'd love to work with it in a pandas dataframe object, which is also really easy to write to a csv. The only thread I could find recommended pickling the CFD object which doesn't really solve my problem.

I wrote the following function to convert an nltk.ConditionalFreqDist object to a pd.DataFrame:

def nltk_cfd_to_pd_dataframe(cfd):
    """ Converts an nltk.ConditionalFreqDist object into a pandas DataFrame object. """

    df = pd.DataFrame()
    for cond in cfd.conditions():
        col = pd.DataFrame(pd.Series(dict(cfd[cond])))
        col.columns = [cond]
        df = df.join(col, how = 'outer')

    df = df.fillna(0)

    return df

But if I am going to do that, perhaps it would make sense to just write a new ConditionalFreqDist function that produces a pd.DataFrame in the first place. But before I reinvent the wheel, I wanted to see if there are any tricks that I am missing - either in NLTK or elsewhere to make the ConditionalFreqDist object talk with other formats and most importantly to export it to csv files.

Thanks.

like image 981
primelens Avatar asked Feb 28 '13 20:02

primelens


3 Answers

You can treat an FreqDist as a dict, and create a dataframe from there using from_dict

fdist = nltk.FreqDist( ... )    
df_fdist = pd.DataFrame.from_dict(fdist, orient='index')
df_fdist.columns = ['Frequency']
df_fdist.index.name = 'Term'
print(df_fdist)
df_fdist.to_csv(...)

output:

                      Frequency
Term
is                    70464
a                     26429
the                   15079
like image 165
David Avatar answered Nov 09 '22 16:11

David


pd.DataFrame(freq_dist.items(), columns=['word', 'frequency'])
like image 7
Daniil Mashkin Avatar answered Nov 09 '22 14:11

Daniil Mashkin


Ok, so I went ahead and wrote a conditional frequency distribution function that takes a list of tuples like the nltk.ConditionalFreqDist function but returns a pandas Dataframe object. Works faster than converting the cfd object to a dataframe:

def cond_freq_dist(data):
    """ Takes a list of tuples and returns a conditional frequency distribution as a pandas dataframe. """

    cfd = {}
    for cond, freq in data:
        try:
            cfd[cond][freq] += 1
        except KeyError:
            try:
                cfd[cond][freq] = 1
            except KeyError:
                cfd[cond] = {freq: 1}

    return pd.DataFrame(cfd).fillna(0)
like image 2
primelens Avatar answered Nov 09 '22 14:11

primelens