I'm trying to create a smaller, stratified sample to cut down on processing time.
Running this code:
df_strat= stratified_sample(df, ["Parental Status","Gender", "Age", "Geographical Residence", "Highest Level of Education", "Industry","725", "899","1125", "1375", "1625", "1875", "2500","3000"], size=None, keep_index=True)
This is the function:
def stratified_sample(df, strata, size=None, seed=None, keep_index= True): population = len(df)
size = __smpl_size(population, size)
tmp = df[strata]
tmp['size'] = 1
tmp_grpd = tmp.groupby(strata).count().reset_index()
tmp_grpd['samp_size'] = round(size/population * tmp_grpd['size']).astype(int)
# controlling variable to create the dataframe or append to it
first = True
for i in range(len(tmp_grpd)):
# query generator for each iteration
qry=''
for s in range(len(strata)):
stratum = strata[s]
value = tmp_grpd.iloc[i][stratum]
n = tmp_grpd.iloc[i]['samp_size']
if type(value) == str:
value = "'" + str(value) + "'"
if s != len(strata)-1:
qry = qry + stratum + ' == ' + str(value) +' & '
else:
qry = qry + stratum + ' == ' + str(value)
# final dataframe
if first:
stratified_df = df.query(qry).sample(n=n, random_state=seed).reset_index(drop=(not keep_index))
first = False
else:
tmp_df = df.query(qry).sample(n=n, random_state=seed).reset_index(drop=(not keep_index))
stratified_df = stratified_df.append(tmp_df, ignore_index=True)
return stratified_df
I'm getting this returned:
File "<unknown>", line 1
Parental Status =='False'and Gender =='F'and Age =='20-29'and Geographical Residence =='Adelaide'and Highest Level of Education =='1'and Industry =='A'and 725 ==13 and 899 ==14 and 1125 ==5 and 1375 ==0 and 1625 ==0 and 1875 ==0 and 2500 ==0 and 3000 ==0
^
SyntaxError: Python keyword not valid identifier in numexpr query
Other people with this error code have had symbols cause this issue, but my data is clean and either object or int32 data.
Anyone know what might be causing this issue?
Appears the issue can be resolved by removing spaces between the column headings.
Eg. "Parental Status"
to "Parental_Status"
has resolved the problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With