Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select rows randomly based on condition pandas python

I have a small test data sample:

import pandas as pd

df = {'ID': ['H900','H901','H902','','M1435','M149','M157','','M699','M920','','M789','M617','M991','H903','M730','M191'],
  'Clone': [0,1,2,2,2,2,2,2,3,3,3,4,4,4,5,5,6],
  'Length': [48,42  ,48,48,48,48,48,48,48,48,48,48,48,48,48,48,48]}

df = pd.DataFrame(df)

it looks like:

df
Out[4]: 
      Clone   ID  Length
0       0   H900      48
1       1   H901      42
2       2   H902      48
3       2             48
4       2  M1435      48
5       2   M149      48
6       2   M157      48
7       2             48
8       3   M699      48
9       3   M920      48
10      3             48
11      4   M789      48
12      4   M617      48
13      4   M991      48
14      5   H903      48
15      5   M730      48
16      6   M191      48

I want a simple script to pick, for example, 5 rows, out randomly but only the rows that contains an ID, it should not include any row that does not contain an ID.

my script:

import pandas as pd
import numpy as np

df = {'ID': ['H900','H901','H902','','M1435','M149','M157','','M699','M920','','M789','M617','M991','H903','M730','M191'],
  'Clone': [0,1,2,2,2,2,2,2,3,3,3,4,4,4,5,5,6],
  'Length': [48,42  ,48,48,48,48,48,48,48,48,48,48,48,48,48,48,48]}

df = pd.DataFrame(df)

rows = np.random.choice(df.index.values, 5)
sampled_df = df.ix[rows]

sampled_df.to_csv('sampled_df.txt', sep = '\t', index=False)

but this script sometimes pick out the rows that does not contain an ID

like image 209
Jessica Avatar asked Jun 02 '16 13:06

Jessica


People also ask

How do you use pandas to generate a random subset of rows of your dataset?

The easiest way to randomly select rows from a Pandas dataframe is to use the sample() method. For example, if your dataframe is called “df”, df. sample(n=250) will result in that 200 rows were selected randomly. Note, removing the n parameter will result in one random row instead of multiple rows.

How do I select random columns in pandas?

Method 1: Select a single column at random In this approach firstly the Pandas package is read with which the given CSV file is imported using pd. read_csv() method is used to read the dataset. df. sample() method is used to randomly select rows and columns.


2 Answers

I think you need filter empty ID with boolean indexing:

import pandas as pd
import numpy as np

df = {'ID': ['H900','H901','H902','','M1435','M149','M157','','M699','M920','','M789','M617','M991','H903','M730','M191'],
  'Clone': [0,1,2,2,2,2,2,2,3,3,3,4,4,4,5,5,6],
  'Length': [48,42  ,48,48,48,48,48,48,48,48,48,48,48,48,48,48,48]}

df = pd.DataFrame(df)
print (df)
df = df[df.ID != '']

rows = np.random.choice(df.index.values, 5)
sampled_df = df.loc[rows]
print (sampled_df)
like image 88
jezrael Avatar answered Oct 07 '22 06:10

jezrael


It is also possible to use query in this case and then sample. Like so:

df = df.query('(ID != "")').sample(n=5)
like image 2
DataBach Avatar answered Oct 07 '22 04:10

DataBach