Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scikit learn output metrics.classification_report into CSV/tab-delimited format

I'm doing a multiclass text classification in Scikit-Learn. The dataset is being trained using the Multinomial Naive Bayes classifier having hundreds of labels. Here's an extract from the Scikit Learn script for fitting the MNB model

from __future__ import print_function  # Read **`file.csv`** into a pandas DataFrame  import pandas as pd path = 'data/file.csv' merged = pd.read_csv(path, error_bad_lines=False, low_memory=False)  # define X and y using the original DataFrame X = merged.text y = merged.grid  # split X and y into training and testing sets; from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)  # import and instantiate CountVectorizer from sklearn.feature_extraction.text import CountVectorizer vect = CountVectorizer()  # create document-term matrices using CountVectorizer X_train_dtm = vect.fit_transform(X_train) X_test_dtm = vect.transform(X_test)  # import and instantiate MultinomialNB from sklearn.naive_bayes import MultinomialNB nb = MultinomialNB()  # fit a Multinomial Naive Bayes model nb.fit(X_train_dtm, y_train)  # make class predictions y_pred_class = nb.predict(X_test_dtm)  # generate classification report from sklearn import metrics print(metrics.classification_report(y_test, y_pred_class)) 

And a simplified output of the metrics.classification_report on command line screen looks like this:

             precision  recall   f1-score   support      12       0.84      0.48      0.61      2843      13       0.00      0.00      0.00        69      15       1.00      0.19      0.32       232      16       0.75      0.02      0.05       965      33       1.00      0.04      0.07       155       4       0.59      0.34      0.43      5600      41       0.63      0.49      0.55      6218      42       0.00      0.00      0.00       102      49       0.00      0.00      0.00        11       5       0.90      0.06      0.12      2010      50       0.00      0.00      0.00         5      51       0.96      0.07      0.13      1267      58       1.00      0.01      0.02       180      59       0.37      0.80      0.51      8127       7       0.91      0.05      0.10       579       8       0.50      0.56      0.53      7555           avg/total 0.59      0.48      0.45     35919 

I was wondering if there was any way to get the report output into a standard csv file with regular column headers

When I send the command line output into a csv file or try to copy/paste the screen output into a spreadsheet - Openoffice Calc or Excel, It lumps the results in one column. Looking like this:

enter image description here

like image 929
Seun AJAO Avatar asked Sep 23 '16 13:09

Seun AJAO


1 Answers

As of scikit-learn v0.20, the easiest way to convert a classification report to a pandas Dataframe is by simply having the report returned as a dict:

report = classification_report(y_test, y_pred, output_dict=True) 

and then construct a Dataframe and transpose it:

df = pandas.DataFrame(report).transpose() 

From here on, you are free to use the standard pandas methods to generate your desired output formats (CSV, HTML, LaTeX, ...).

See the documentation.

like image 175
janus235 Avatar answered Sep 25 '22 02:09

janus235