How does one shuffle only one column of data in pandas?
I have a Dataframe with production data that I want to load onto dev for testing. However, the data contains personally identifiable information so I want to shuffle those columns.
Columns: FirstName LastName Birthdate SSN OtherData
If the original dataframe is created by read_csv and I want to translate the data into a second dataframe for sql loading but shuffle first name, last name, and SSN, I would have expected to be able to do this:
if devprod == 'prod':
#do not shuffle data
df1['HS_FIRST_NAME'] = df[4]
df1['HS_LAST_NAME'] = df[6]
df1['HS_SSN'] = df[8]
else:
df1['HS_FIRST_NAME'] = np.random.shuffle(df[4])
df1['HS_LAST_NAME'] = np.random.shuffle(df[6])
df1['HS_SSN'] = np.random.shuffle(df[8])
However, when I try that I get the following error:
A value is trying to be set on a copy of a slice from a DataFrame
The immediate error is a symptom of using an inadvisable approach when working with dataframes.
np.random.shuffle
works in-place and returns None
, so assigning to the output of np.random.shuffle
will not work. In fact, in-place operations are rarely required, and often yield no material benefits.
Here, for example, you can use np.random.permutation
and use NumPy arrays via pd.Series.values
rather than series:
if devprod == 'prod':
#do not shuffle data
df1['HS_FIRST_NAME'] = df[4]
df1['HS_LAST_NAME'] = df[6]
df1['HS_SSN'] = df[8]
else:
df1['HS_FIRST_NAME'] = np.random.permutation(df[4].values)
df1['HS_LAST_NAME'] = np.random.permutation(df[6].values)
df1['HS_SSN'] = np.random.permutation(df[8].values)
This also appears to do the job:
df1['HS_FIRST_NAME'] = df[4].sample(frac=1).values
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With