I have a dataset with 21000 rows (data samples) and 102 columns (features). I would like to have a larger synthetic dataset generated based on the current dataset, say with 100000 rows, so I can use it for machine learning purposes thereby.
I've been referring to the answer by @Prashant on this post https://stats.stackexchange.com/questions/215938/generate-synthetic-data-to-match-sample-data, but am unable to get it working on generating a larger synthetic dataset for my data.
import numpy as np
from random import randrange, choice
from sklearn.neighbors import NearestNeighbors
import pandas as pd
#referring to https://stats.stackexchange.com/questions/215938/generate-synthetic-data-to-match-sample-data
df = pd.read_pickle('df_saved.pkl')
df = df.iloc[:,:-1] # this gives me df, the final Dataframe which I would like to generate a larger dataset based on. This is the smaller Dataframe with 21000x102 dimensions.
def SMOTE(T, N, k):
# """
# Returns (N/100) * n_minority_samples synthetic minority samples.
#
# Parameters
# ----------
# T : array-like, shape = [n_minority_samples, n_features]
# Holds the minority samples
# N : percetange of new synthetic samples:
# n_synthetic_samples = N/100 * n_minority_samples. Can be < 100.
# k : int. Number of nearest neighbours.
#
# Returns
# -------
# S : array, shape = [(N/100) * n_minority_samples, n_features]
# """
n_minority_samples, n_features = T.shape
if N < 100:
#create synthetic samples only for a subset of T.
#TODO: select random minortiy samples
N = 100
pass
if (N % 100) != 0:
raise ValueError("N must be < 100 or multiple of 100")
N = N/100
n_synthetic_samples = N * n_minority_samples
n_synthetic_samples = int(n_synthetic_samples)
n_features = int(n_features)
S = np.zeros(shape=(n_synthetic_samples, n_features))
#Learn nearest neighbours
neigh = NearestNeighbors(n_neighbors = k)
neigh.fit(T)
#Calculate synthetic samples
for i in range(n_minority_samples):
nn = neigh.kneighbors(T[i], return_distance=False)
for n in range(N):
nn_index = choice(nn[0])
#NOTE: nn includes T[i], we don't want to select it
while nn_index == i:
nn_index = choice(nn[0])
dif = T[nn_index] - T[i]
gap = np.random.random()
S[n + i * N, :] = T[i,:] + gap * dif[:]
return S
df = df.to_numpy()
new_data = SMOTE(df,50,10) # this is where I call the function and expect new_data to be generated with larger number of samples than original df.
The traceback of the error I get is mentioned below:-
Traceback (most recent call last):
File "MyScript.py", line 66, in <module>
new_data = SMOTE(df,50,10)
File "MyScript.py", line 52, in SMOTE
nn = neigh.kneighbors(T[i], return_distance=False)
File "/trinity/clustervision/CentOS/7/apps/anaconda/4.3.31/3.6-VE/lib/python3.5/site-packages/sklearn/neighbors/base.py", line 393, in kneighbors
X = check_array(X, accept_sparse='csr')
File "/trinity/clustervision/CentOS/7/apps/anaconda/4.3.31/3.6-VE/lib/python3.5/site-packages/sklearn/utils/validation.py", line 547, in check_array
"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
I know that this error (Expected 2D array, got 1D array) is occurring on the line nn = neigh.kneighbors(T[i], return_distance=False)
. Precisely, when I call the function, T is the numpy
array of shape (21000x102), my data which I convert from a Pandas Dataframe to a numpy
array. I know that this question may have some similar duplicates, but none of them answer my question. Any help in this regard would be highly appreciated.
Synthetic data is fake data that mimics real data. There are three major reasons for this: you can generate as much synthetic data as you need, you can generate data that may be dangerous to collect in reality, synthetic data is automatically annotated.
So what T[i] is giving it is an array with shape (102, ).
What the function expects is an array with shape (1, 102).
You can get this by calling reshape on it:
nn = neigh.kneighbors(T[i].reshape(1, -1), return_distance=False)
In case you're not familiar with np.reshape, The 1 says that the first dimension should be size 1, and the -1 says that the second dimension should be what ever size numpy can broadcast it to; in this case the original 102.
May be of use for you
SMOTE and other advanced over_sampling techniques
This package imblearn has sklearn like API and lots of oversampling techniques.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With