Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SyntaxError: Python keyword not valid identifier in numexpr query

I'm trying to create a smaller, stratified sample to cut down on processing time.

Running this code:

df_strat= stratified_sample(df, ["Parental Status","Gender", "Age", "Geographical Residence", "Highest Level of Education", "Industry","725", "899","1125", "1375", "1625", "1875", "2500","3000"], size=None, keep_index=True)

This is the function:

def stratified_sample(df, strata, size=None, seed=None, keep_index= True): population = len(df)
    size = __smpl_size(population, size)
    tmp = df[strata]
    tmp['size'] = 1
    tmp_grpd = tmp.groupby(strata).count().reset_index()
    tmp_grpd['samp_size'] = round(size/population * tmp_grpd['size']).astype(int)

    # controlling variable to create the dataframe or append to it
    first = True 
    for i in range(len(tmp_grpd)):
        # query generator for each iteration
        qry=''
        for s in range(len(strata)):
            stratum = strata[s]
            value = tmp_grpd.iloc[i][stratum]
            n = tmp_grpd.iloc[i]['samp_size']

            if type(value) == str:
                value = "'" + str(value) + "'"
            
            if s != len(strata)-1:
                qry = qry + stratum + ' == ' + str(value) +' & '
            else:
                qry = qry + stratum + ' == ' + str(value)
        
        # final dataframe
        if first:
            stratified_df = df.query(qry).sample(n=n, random_state=seed).reset_index(drop=(not keep_index))
            first = False
        else:
            tmp_df = df.query(qry).sample(n=n, random_state=seed).reset_index(drop=(not keep_index))
            stratified_df = stratified_df.append(tmp_df, ignore_index=True)
    
    return stratified_df

I'm getting this returned:

File "<unknown>", line 1
    Parental Status =='False'and Gender =='F'and Age =='20-29'and Geographical Residence =='Adelaide'and Highest Level of Education =='1'and Industry =='A'and 725 ==13 and 899 ==14 and 1125 ==5 and 1375 ==0 and 1625 ==0 and 1875 ==0 and 2500 ==0 and 3000 ==0
             ^
SyntaxError: Python keyword not valid identifier in numexpr query

Other people with this error code have had symbols cause this issue, but my data is clean and either object or int32 data.

Anyone know what might be causing this issue?

like image 560
Beginner_Wallis Avatar asked Sep 13 '25 10:09

Beginner_Wallis


1 Answers

Appears the issue can be resolved by removing spaces between the column headings.
Eg. "Parental Status" to "Parental_Status" has resolved the problem.

like image 82
Beginner_Wallis Avatar answered Sep 16 '25 00:09

Beginner_Wallis