I have a small test data sample:
import pandas as pd
df = {'ID': ['H900','H901','H902','','M1435','M149','M157','','M699','M920','','M789','M617','M991','H903','M730','M191'],
'Clone': [0,1,2,2,2,2,2,2,3,3,3,4,4,4,5,5,6],
'Length': [48,42 ,48,48,48,48,48,48,48,48,48,48,48,48,48,48,48]}
df = pd.DataFrame(df)
it looks like:
df
Out[4]:
Clone ID Length
0 0 H900 48
1 1 H901 42
2 2 H902 48
3 2 48
4 2 M1435 48
5 2 M149 48
6 2 M157 48
7 2 48
8 3 M699 48
9 3 M920 48
10 3 48
11 4 M789 48
12 4 M617 48
13 4 M991 48
14 5 H903 48
15 5 M730 48
16 6 M191 48
I want a simple script to pick, for example, 5 rows, out randomly but only the rows that contains an ID, it should not include any row that does not contain an ID.
my script:
import pandas as pd
import numpy as np
df = {'ID': ['H900','H901','H902','','M1435','M149','M157','','M699','M920','','M789','M617','M991','H903','M730','M191'],
'Clone': [0,1,2,2,2,2,2,2,3,3,3,4,4,4,5,5,6],
'Length': [48,42 ,48,48,48,48,48,48,48,48,48,48,48,48,48,48,48]}
df = pd.DataFrame(df)
rows = np.random.choice(df.index.values, 5)
sampled_df = df.ix[rows]
sampled_df.to_csv('sampled_df.txt', sep = '\t', index=False)
but this script sometimes pick out the rows that does not contain an ID
The easiest way to randomly select rows from a Pandas dataframe is to use the sample() method. For example, if your dataframe is called “df”, df. sample(n=250) will result in that 200 rows were selected randomly. Note, removing the n parameter will result in one random row instead of multiple rows.
Method 1: Select a single column at random In this approach firstly the Pandas package is read with which the given CSV file is imported using pd. read_csv() method is used to read the dataset. df. sample() method is used to randomly select rows and columns.
I think you need filter empty ID
with boolean indexing
:
import pandas as pd
import numpy as np
df = {'ID': ['H900','H901','H902','','M1435','M149','M157','','M699','M920','','M789','M617','M991','H903','M730','M191'],
'Clone': [0,1,2,2,2,2,2,2,3,3,3,4,4,4,5,5,6],
'Length': [48,42 ,48,48,48,48,48,48,48,48,48,48,48,48,48,48,48]}
df = pd.DataFrame(df)
print (df)
df = df[df.ID != '']
rows = np.random.choice(df.index.values, 5)
sampled_df = df.loc[rows]
print (sampled_df)
It is also possible to use query in this case and then sample. Like so:
df = df.query('(ID != "")').sample(n=5)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With