Here is my problem:
I have a csv file containing articles data set with columns: ID, CATEGORY, TITLE, BODY.
In python, I read the file to a pandas data frame like this:
import pandas as pd
df = pd.read_csv('my_file.csv')
Now I need to transform somehow this df
to get a corpus object, let's call it my_corpus
. But how exactly I can do it? I assume I need to use:
from nltk.corpus.reader import CategorizedCorpusReader
my_corpus = some_nltk_function(df) # <- what is the function?
At the end I can use NLTK methods to analyze the corpus. For example:
import nltk
my_corpus.fileids() # <- I expect values from column ID
my_corpus.categories() # <- I expect values from column CATEGORY
my_corpus.words(categories='cat_A') # <- I expect values from column TITLE and BODY
my_corpus.sents(categories=['cat_A', 'cat_B', 'cat_C']) # <- I expect values from column TITLE and BODY
Please, advise.
I guess you need to do 2 things.
First you need to convert each row of your dataframe df to corpus files. The following function should do it for you
def CreateCorpusFromDataFrame(corpusfolder,df):
for index, r in df.iterrows():
id=r['ID']
title=r['TITLE']
body=r['BODY']
category=r['CATEGORY']
fname=str(category)+'_'+str(id)+'.txt'
corpusfile=open(corpusfolder+'/'+fname,'a')
corpusfile.write(str(body) +" " +str(title))
corpusfile.close()
CreateCorpusFromDataFrame('yourcorpusfolder/',df)
Second, you need to read the files from yourcorpusfolder and then do the NLTK processing required by you
from nltk.corpus.reader import CategorizedPlaintextCorpusReader
my_corpus=CategorizedPlaintextCorpusReader('yourcorpusfolder/',
r'.*', cat_pattern=r'(.*)_.*')
my_corpus.fileids() # <- I expect values from column ID
my_corpus.categories() # <- I expect values from column CATEGORY
my_corpus.words(categories='cat_A') # <- I expect values from column TITLE and BODY
my_corpus.sents(categories=['cat_A', 'cat_B']) # <- I expect values from column TITLE and BODY
Some helpful references :
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With