Generate larger synthetic dataset based on a smaller dataset in Python

Tags:

I have a dataset with 21000 rows (data samples) and 102 columns (features). I would like to have a larger synthetic dataset generated based on the current dataset, say with 100000 rows, so I can use it for machine learning purposes thereby.

I've been referring to the answer by @Prashant on this post https://stats.stackexchange.com/questions/215938/generate-synthetic-data-to-match-sample-data, but am unable to get it working on generating a larger synthetic dataset for my data.

import numpy as np
from random import randrange, choice
from sklearn.neighbors import NearestNeighbors
import pandas as pd
#referring to https://stats.stackexchange.com/questions/215938/generate-synthetic-data-to-match-sample-data


df = pd.read_pickle('df_saved.pkl')
df = df.iloc[:,:-1] # this gives me df, the final Dataframe which I would like to generate a larger dataset based on. This is the smaller Dataframe with 21000x102 dimensions.


def SMOTE(T, N, k):
# """
# Returns (N/100) * n_minority_samples synthetic minority samples.
#
# Parameters
# ----------
# T : array-like, shape = [n_minority_samples, n_features]
#     Holds the minority samples
# N : percetange of new synthetic samples:
#     n_synthetic_samples = N/100 * n_minority_samples. Can be < 100.
# k : int. Number of nearest neighbours.
#
# Returns
# -------
# S : array, shape = [(N/100) * n_minority_samples, n_features]
# """
    n_minority_samples, n_features = T.shape

    if N < 100:
       #create synthetic samples only for a subset of T.
       #TODO: select random minortiy samples
       N = 100
       pass

    if (N % 100) != 0:
       raise ValueError("N must be < 100 or multiple of 100")

    N = N/100
    n_synthetic_samples = N * n_minority_samples
    n_synthetic_samples = int(n_synthetic_samples)
    n_features = int(n_features)
    S = np.zeros(shape=(n_synthetic_samples, n_features))

    #Learn nearest neighbours
    neigh = NearestNeighbors(n_neighbors = k)
    neigh.fit(T)

    #Calculate synthetic samples
    for i in range(n_minority_samples):
       nn = neigh.kneighbors(T[i], return_distance=False)
       for n in range(N):
          nn_index = choice(nn[0])
          #NOTE: nn includes T[i], we don't want to select it
          while nn_index == i:
             nn_index = choice(nn[0])

          dif = T[nn_index] - T[i]
          gap = np.random.random()
          S[n + i * N, :] = T[i,:] + gap * dif[:]

    return S

df = df.to_numpy()
new_data = SMOTE(df,50,10) # this is where I call the function and expect new_data to be generated with larger number of samples than original df.

The traceback of the error I get is mentioned below:-

Traceback (most recent call last):
  File "MyScript.py", line 66, in <module>
    new_data = SMOTE(df,50,10)
  File "MyScript.py", line 52, in SMOTE
    nn = neigh.kneighbors(T[i], return_distance=False)
  File "/trinity/clustervision/CentOS/7/apps/anaconda/4.3.31/3.6-VE/lib/python3.5/site-packages/sklearn/neighbors/base.py", line 393, in kneighbors
    X = check_array(X, accept_sparse='csr')
  File "/trinity/clustervision/CentOS/7/apps/anaconda/4.3.31/3.6-VE/lib/python3.5/site-packages/sklearn/utils/validation.py", line 547, in check_array
    "if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:

I know that this error (Expected 2D array, got 1D array) is occurring on the line nn = neigh.kneighbors(T[i], return_distance=False). Precisely, when I call the function, T is the numpy array of shape (21000x102), my data which I convert from a Pandas Dataframe to a numpy array. I know that this question may have some similar duplicates, but none of them answer my question. Any help in this regard would be highly appreciated.

900

asked Mar 06 '19 16:03

JChat

Video Answer

2 Answers

So what T[i] is giving it is an array with shape (102, ).

What the function expects is an array with shape (1, 102).

You can get this by calling reshape on it:

nn = neigh.kneighbors(T[i].reshape(1, -1), return_distance=False)

In case you're not familiar with np.reshape, The 1 says that the first dimension should be size 1, and the -1 says that the second dimension should be what ever size numpy can broadcast it to; in this case the original 102.

134

answered Oct 17 '22 10:10

Nicolaj Rasmussen

May be of use for you

SMOTE and other advanced over_sampling techniques

This package imblearn has sklearn like API and lots of oversampling techniques.

answered Oct 17 '22 12:10

Venkatachalam

Related questions
                            
                                Conditional import in Python when creating an object inheriting from it
                            
                                Difference between pd.df.plot.box() and pd.df.boxplot()
                            
                                joining output from regex search
                            
                                Tensorflow Adam optimizer vs Keras Adam optimizer
                            
                                pygame surface.blits when using area argument
                            
                                Create an Hour Glass pattern on Java or Python using Simple Code?
                            
                                Nested tqdm outputs to new line
                            
                                Find occurrences of subsequences in event data with time constraints
                            
                                Should I return dataset directly or should i use one_shot iterator instead?
                            
                                Create and use a PyPi proxy repository on nexus
                            
                                how to iterate through matrix array to count the number of similar elements surrounding a particular element inside the matrix
                            
                                PyArrow: Store list of dicts in parquet using nested types
                            
                                Can Python autodoc tuple/list format be changed?
                            
                                Python: Calculating frequency over time from a wav file in Python?
                            
                                Python choose random number in specific interval
                            
                                SQLAlchemy - Error A TVP's rows must be Sequence objects
                            
                                How to code "Restart Kernel and Run all" in button for Python Jupyter Notebook?
                            
                                Create a parameterised type in Python, but make all instances children of the "super-type"
                            
                                Contract words in python with set length
                            
                                Discrepancies between R optim vs Scipy optimize: Nelder-Mead

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Generate larger synthetic dataset based on a smaller dataset in Python

Tags:

python

machine-learning

imputation

scikit-learn