Get the document name in scikit-learn tf-idf matrix

Tags:

I have created a tf-idf matrix but now I want to retrieve top 2 words for each document. I want to pass document id and it should give me the top 2 words.

Right now, I have this sample data:

from sklearn.feature_extraction.text import TfidfVectorizer

d = {'doc1':"this is the first document",'doc2':"it is a sunny day"} ### corpus

test_v = TfidfVectorizer(min_df=1)    ### applied the model
t = test_v.fit_transform(d.values())
feature_names = test_v.get_feature_names() ### list of words/terms

>>> feature_names
['day', 'document', 'first', 'is', 'it', 'sunny', 'the', 'this']

>>> t.toarray()
array([[ 0.        ,  0.47107781,  0.47107781,  0.33517574,  0.        ,
     0.        ,  0.47107781,  0.47107781],
   [ 0.53404633,  0.        ,  0.        ,  0.37997836,  0.53404633,
     0.53404633,  0.        ,  0.        ]])

I can access the matrix by giving the row number eg.

 >>> t[0,1]
   0.47107781233161794

Is there a way I can be able to access this matrix by document id? In my case 'doc1' and 'doc2'.

Thanks

277

asked Oct 10 '14 16:10

user1525721

1 Answers

By doing

t = test_v.fit_transform(d.values())

you lose any link to the document ids. A dict is not ordered so you have no idea which value is given in which order. The means that before passing the values to the fit_transform function you need to record which value corresponds to which id.

For example what you can do is:

counter = 0
values = []
key = {}


for k,v in d.items():
    values.append(v)
    key[k] = counter
    counter+=1

t = test_v.fit_transform(values)

From there you can build a function to access this matix by document id:

def get_doc_row(docid):
    rowid = key[docid]
    row = t[rowid,:]
    return row

answered Oct 31 '22 18:10

patapouf_ai

Related questions
                            
                                Maximum recursion depth error in the python function
                            
                                mock.patch() not patching class who called a couples of levels inside function call
                            
                                How to override __new__ metaclass method of a model in Django
                            
                                Selenium Webdriver, screenshot as numpy array (Python)
                            
                                Overriding Django allauth login form with ACCOUNT_FORMS
                            
                                Image in Tkinter Label?
                            
                                Should import be inside or outside a Python class?
                            
                                Python mail puts unaccounted space in Outlook Subject line
                            
                                Print multiple lines in one statement without leading spaces
                            
                                Lines vs rows in the terminal
                            
                                NotImplementedError: The gflags library must be installed to use tools.run(). Please install gflags or preferrably switch to using tools.run_flow()
                            
                                Python Multiprocessing Early Termination
                            
                                Unable to receive more than 20 MQTT messages using Mosquitto/Paho for Python
                            
                                opencv2: bicubic interpolation while resizing image
                            
                                What is the priority of importing a name, submodule or subpackage from a package in python 2.7?
                            
                                Local variable 'list' referenced before assignment
                            
                                python social auth: Google login Error: invalid_client
                            
                                Python - Should I alias imports with underscores?
                            
                                How to set a PyQtGraph GraphicView window to maximized state
                            
                                How to browse or search One2many field in Odoo?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Get the document name in scikit-learn tf-idf matrix

Tags:

python

machine-learning

matrix

scikit-learn

tf-idf

user1525721

People also ask

1 Answers

patapouf_ai

Recent Activity

Donate For Us