Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Labels of datasets imported with sklearn.datasets.load_files

I'm wondering how to match the labels produced by a SVN classifier with the ones on my dataset. ANd then I realized that the problem starts at the begining: when I load the dataset I got a dataset which in my case has the following properties:

.data = the news text
.target_names = label used in the dataset e.g. ["positive", "negative"]
.target = A matrix with a number for each news with a label.

But I,m wondering if the order og the target_names is different across different datasets (with the sametags but different news), and if the order of the .data elements influences that.

Is there any way to easily know the label of a number in the .target matrix? (I mean, what does 0 or 1 represents in such a matrix)

Best,

like image 609
gal007 Avatar asked Apr 10 '19 16:04

gal007


People also ask

What are the datasets available in sklearn datasets?

scikit-learn comes with a few small standard datasets that do not require to download any file from some external website. Load and return the boston house-prices dataset (regression). Load and return the iris dataset (classification). Load and return the diabetes dataset (regression).

What is sklearn import datasets?

The sklearn. datasets package embeds some small toy datasets as introduced in the Getting Started section. This package also features helpers to fetch larger datasets commonly used by the machine learning community to benchmark algorithms on data that comes from the 'real world'.

Where are sklearn datasets stored?

By default all scikit-learn data is stored in '~/scikit_learn_data' subfolders.

How many datasets are there in sklearn?

Intro to Scikit-Learn's Datasets Scikit-Learn provides seven datasets, which they call toy datasets.


1 Answers

The corresponding label for an entry i in .target is available as .target_names[i]. In your example: .target_names[1] is "negative".

The order of the target names will be the same across different datasets, as long as the tags are exactly the same. This is because sklearn.datasets.load_files() creates the tags from the sorted folder names, as we can see in the source code (v.20.x):

[...]
folders = [f for f in sorted(listdir(container_path))
           if isdir(join(container_path, f))]

if categories is not None:
    folders = [f for f in folders if f in categories]

for label, folder in enumerate(folders):
    target_names.append(folder)
[...]

I'd still suggest to always retrieve the label from target_names of the current dataset to be on the safe side (implementations may change over time etc.)

like image 138
rvf Avatar answered Oct 14 '22 01:10

rvf