Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sample two pandas dataframes the same way

Tags:

python

pandas

I'm doing a machine learning computations having two dataframes - one for factors and other one for target values. I have to split both into training and testing parts. It seems to me that I've found the way but I'm looking for more elegant solution. Here is my code:

import pandas as pd
import numpy as np
import random

df_source = pd.DataFrame(np.random.randn(5,2),index = range(0,10,2), columns=list('AB'))
df_target = pd.DataFrame(np.random.randn(5,2),index = range(0,10,2), columns=list('CD'))

rows = np.asarray(random.sample(range(0, len(df_source)), 2))

df_source_train = df_source.iloc[rows]
df_source_test = df_source[~df_source.index.isin(df_source_train.index)]
df_target_train = df_target.iloc[rows]
df_target_test = df_target[~df_target.index.isin(df_target_train.index)]

print('rows')
print(rows)
print('source')
print(df_source)
print('source train')
print(df_source_train)
print('source_test')
print(df_source_test)

---- edited - solution by unutbu (midified) ---

np.random.seed(2013)
percentile = .6
rows = np.random.binomial(1, percentile, size=len(df_source)).astype(bool)

df_source_train = df_source[rows]
df_source_test = df_source[~rows]
df_target_train = df_target[rows]
df_target_test = df_target[~rows]
like image 469
Viacheslav Nefedov Avatar asked Jun 23 '13 11:06

Viacheslav Nefedov


People also ask

How can I tell if two pandas DataFrames are identical?

DataFrame - equals() function The equals() function is used to test whether two objects contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.

How do you plot two DataFrames in the same figure?

MatPlotLib with Python Set the figure size and adjust the padding between and around the subplots. Create two Pandas dataframes, df1 and df2, of two-dimensional, size-mutable, potentially heterogeneous tabular data. Plot df1 and df2 using plot() method. To display the figure, use show() method.

How can you tell if two DataFrames have the same rows?

Checking If Two Dataframes Are Exactly SameBy using equals() function we can directly check if df1 is equal to df2. This function is used to determine if two dataframe objects in consideration are equal or not.


2 Answers

Below you can find my solution, which doesn't involve any extra variables.

  1. Use .sample method to get sample of your data
  2. Use .index method on sample, to get indexes
  3. Apply slice()ing by index for second dataframe

E.g. Let's say you have X and Y and you want to get 10 pieces sample on each. And it should be same samples, of course

X_sample = X.sample(10)
y_sample = y[X_sample.index]
like image 187
Alexander Tverdohleb Avatar answered Oct 13 '22 00:10

Alexander Tverdohleb


I like the Alexander answer but I will add an index reset before sampling. The full code:

# index reset
X.reset_index(inplace=True, drop=True)
y.reset_index(inplace=True, drop=True)
# sampling
X_sample = X.sample(10)
y_sample = y[X_sample.index]

Reset of the index is used to not have problem with matching.

like image 45
pplonski Avatar answered Oct 13 '22 00:10

pplonski