I'm wondering how to match the labels produced by a SVN classifier with the ones on my dataset. ANd then I realized that the problem starts at the begining: when I load the dataset I got a dataset which in my case has the following properties:
.data = the news text
.target_names = label used in the dataset e.g. ["positive", "negative"]
.target = A matrix with a number for each news with a label.
But I,m wondering if the order og the target_names is different across different datasets (with the sametags but different news), and if the order of the .data elements influences that.
Is there any way to easily know the label of a number in the .target matrix? (I mean, what does 0 or 1 represents in such a matrix)
Best,
scikit-learn comes with a few small standard datasets that do not require to download any file from some external website. Load and return the boston house-prices dataset (regression). Load and return the iris dataset (classification). Load and return the diabetes dataset (regression).
The sklearn. datasets package embeds some small toy datasets as introduced in the Getting Started section. This package also features helpers to fetch larger datasets commonly used by the machine learning community to benchmark algorithms on data that comes from the 'real world'.
By default all scikit-learn data is stored in '~/scikit_learn_data' subfolders.
Intro to Scikit-Learn's Datasets Scikit-Learn provides seven datasets, which they call toy datasets.
The corresponding label for an entry i
in .target
is available as .target_names[i]
. In your example: .target_names[1]
is "negative".
The order of the target names will be the same across different datasets, as long as the tags are exactly the same. This is because sklearn.datasets.load_files()
creates the tags from the sorted folder names, as we can see in the source code (v.20.x):
[...]
folders = [f for f in sorted(listdir(container_path))
if isdir(join(container_path, f))]
if categories is not None:
folders = [f for f in folders if f in categories]
for label, folder in enumerate(folders):
target_names.append(folder)
[...]
I'd still suggest to always retrieve the label from target_names
of the current dataset to be on the safe side (implementations may change over time etc.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With