Sample two pandas dataframes the same way

Tags:

I'm doing a machine learning computations having two dataframes - one for factors and other one for target values. I have to split both into training and testing parts. It seems to me that I've found the way but I'm looking for more elegant solution. Here is my code:

import pandas as pd
import numpy as np
import random

df_source = pd.DataFrame(np.random.randn(5,2),index = range(0,10,2), columns=list('AB'))
df_target = pd.DataFrame(np.random.randn(5,2),index = range(0,10,2), columns=list('CD'))

rows = np.asarray(random.sample(range(0, len(df_source)), 2))

df_source_train = df_source.iloc[rows]
df_source_test = df_source[~df_source.index.isin(df_source_train.index)]
df_target_train = df_target.iloc[rows]
df_target_test = df_target[~df_target.index.isin(df_target_train.index)]

print('rows')
print(rows)
print('source')
print(df_source)
print('source train')
print(df_source_train)
print('source_test')
print(df_source_test)

---- edited - solution by unutbu (midified) ---

np.random.seed(2013)
percentile = .6
rows = np.random.binomial(1, percentile, size=len(df_source)).astype(bool)

df_source_train = df_source[rows]
df_source_test = df_source[~rows]
df_target_train = df_target[rows]
df_target_test = df_target[~rows]

469

asked Jun 23 '13 11:06

Viacheslav Nefedov

2 Answers

Below you can find my solution, which doesn't involve any extra variables.

Use .sample method to get sample of your data
Use .index method on sample, to get indexes
Apply slice()ing by index for second dataframe

E.g. Let's say you have X and Y and you want to get 10 pieces sample on each. And it should be same samples, of course

X_sample = X.sample(10)
y_sample = y[X_sample.index]

187

answered Oct 13 '22 00:10

Alexander Tverdohleb

I like the Alexander answer but I will add an index reset before sampling. The full code:

# index reset
X.reset_index(inplace=True, drop=True)
y.reset_index(inplace=True, drop=True)
# sampling
X_sample = X.sample(10)
y_sample = y[X_sample.index]

Reset of the index is used to not have problem with matching.

answered Oct 13 '22 00:10

pplonski

Related questions
                            
                                PIL changes pixel value when saving
                            
                                How to install MySQLdb on Mountain Lion
                            
                                Passing parameters to decorator at runtime
                            
                                regex Python match large list of strings
                            
                                Error when Installing Pygame on Mountain Lion
                            
                                Python argparse: metavar and action=store_true together
                            
                                Django post save signal getting called twice despite uid
                            
                                ImportError: No module named pysqlite2
                            
                                append to list in defaultdict
                            
                                How to convert from pandas.DatetimeIndex to numpy.datetime64?
                            
                                Round an answer to 2 decimal places in Python
                            
                                Can I easily get datetime with less resolution in Python?
                            
                                Converting List of 3 Element Tuple to Dictionary
                            
                                Python. Matplotlib inverted image
                            
                                Getting multiple entities with get_by_id in ndb
                            
                                In Python, find out number of differences between two ordered lists
                            
                                'ascii' codec can't encode character at position * ord not in range(128)
                            
                                Meaning of in ""? Membership testing empty string literal
                            
                                virtualenv and subprocess.call() in mixed Python 2.7/3.3 environment
                            
                                python multiprocessing Pool with map_async

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Sample two pandas dataframes the same way

Tags:

python

pandas

Viacheslav Nefedov

People also ask

2 Answers

Alexander Tverdohleb

pplonski

Recent Activity

Donate For Us