Trying to load custom data to perform NB Classification in Scikit. Need help in loading the sample data into Scikit and then perform NB. How to load categorical values for target.
Use the same data for Train and Test or use a complete set just for test.
Sl No,Member ID,Member Name,Location,DOB,Gender,Marital Status,Children,Ethnicity,Insurance Plan ID,Annual Income ($),Twitter User ID
1,70000001,Fly Dorami,New York,39786,M,Single,,Asian,2002,0,548900028
2,70000002,Bennie Ariana,Pennsylvania,6/24/1940,F,Single,,Caucasian,2002,66313,
3,70000003,Brad Farley,Pennsylvania,12001,F,Married,4,African American,2002,98444,
4,70000004,Daggoo Cece,Indiana,14032,F,Married,2,Hispanic,2001,41896,113481472.
Step 1: Calculate the prior probability for given class labels. Step 2: Find Likelihood probability with each attribute for each class. Step 3: Put these value in Bayes Formula and calculate posterior probability. Step 4: See which class has a higher probability, given the input belongs to the higher probability class.
To actually implement the naive Bayes classifier model, we're going to use scikit-learn, and we'll import our GaussianNB from sklearn. naive_bayes .
As a beginner, you might only know a single way to load data (normally in CSV) which is to read it using pandas. read_csv function.
The following should get you started you will need pandas and numpy. You can load your .csv into a data frame and use that to input into the model. You all so need to define targets (0 for negatives and 1 for positives, assuming binary classification) depending on what you are trying to separate.
from sklearn.naive_bayes import GaussianNB
import pandas as pd
import numpy as np
# create data frame containing your data, each column can be accessed # by df['column name']
df = pd.read_csv('/your/path/yourFile.csv')
target_names = np.array(['Positives','Negatives'])
# add columns to your data frame
df['is_train'] = np.random.uniform(0, 1, len(df)) <= 0.75
df['Type'] = pd.Factor(targets, target_names)
df['Targets'] = targets
# define training and test sets
train = df[df['is_train']==True]
test = df[df['is_train']==False]
trainTargets = np.array(train['Targets']).astype(int)
testTargets = np.array(test['Targets']).astype(int)
# columns you want to model
features = df.columns[0:7]
# call Gaussian Naive Bayesian class with default parameters
gnb = GaussianNB()
# train model
y_gnb = gnb.fit(train[features], trainTargets).predict(train[features])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With