Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replicating SAS' first and last functionality with Python

I have recently migrated to Python as my primary tool for analysis and I am looking to be able to replicate the first. & last. functionality found in SAS. The SAS code would be as follows;

data data.out;
   set data.in;
   if first.ID then flag = 1;
   if last.ID then flag = 1;
run;

The output would be as follows;

ID     flag
AAAA   1
AAAA   0
AAAA   0
AAAA   1
BBBB   1
BBBB   0
BBBB   0
BBBB   1
CCCC   1
CCCC   0
CCCC   1

Any ideas about how to do this in Python?

like image 351
Taylrl Avatar asked Sep 22 '17 12:09

Taylrl


4 Answers

If you're using python and crunching numbers, this type of thing would typically be done using pandas:

pip install pandas

Assuming you have a CSV file, you can load in your data using pd.read_csv. I won't make assumptions about your input, so please take a look at the documentation. Once you load your dataframe, you can proceed.

import pandas

df = pd.read_csv('file.csv')
df

      ID
0   AAAA
1   AAAA
2   AAAA
3   AAAA
4   BBBB
5   BBBB
6   BBBB
7   BBBB
8   CCCC
9   CCCC
10  CCCC

df['flag'] = ((df.ID != df.ID.shift()) | (df.ID != df.ID.shift(-1))).astype(int)
df
      ID  flag
0   AAAA     1
1   AAAA     0
2   AAAA     0
3   AAAA     1
4   BBBB     1
5   BBBB     0
6   BBBB     0
7   BBBB     1
8   CCCC     1
9   CCCC     0
10  CCCC     1

You could also do this using np.where (appreciated suggestion from Brad Solomon):

df['flag'] = np.where((df.ID != df.ID.shift()) \
                  | (df.ID != df.ID.shift(-1)), 1, 0)
df
      ID  flag
0   AAAA     1
1   AAAA     0
2   AAAA     0
3   AAAA     1
4   BBBB     1
5   BBBB     0
6   BBBB     0
7   BBBB     1
8   CCCC     1
9   CCCC     0
10  CCCC     1
like image 165
cs95 Avatar answered Nov 14 '22 08:11

cs95


Using pandas:

import pandas as pd
import numpy as np
df = pd.DataFrame(['AAAA', 'AAAA', 'AAAA', 'AAAA', 
                   'BBBB', 'BBBB', 'BBBB', 'BBBB', 'CCCC', 'CCCC', 'CCCC',],
                  columns=['ID'])

def firstlast(a):
    # For each character grouping set, create a 1d array of 0s padded
    #     with 1s, equal to length of the group.
    a = np.zeros(len(a)-2)
    a = np.pad(a, (1,1), 'constant', constant_values=(1,1))
    return a

df['flag'] = (s.groupby(s).apply(firstlast).apply(pd.Series).stack()
                  .astype(int).values)

print(df)
      ID  flag
0   AAAA     1
1   AAAA     0
2   AAAA     0
3   AAAA     1
4   BBBB     1
5   BBBB     0
6   BBBB     0
7   BBBB     1
8   CCCC     1
9   CCCC     0
10  CCCC     1

Stealing a bit from @cᴏʟᴅsᴘᴇᴇᴅ on logic (which is much smarter than the above solution) but using numpy.where:

ids = df.ID
df['flag'] = np.where((ids!=ids.shift(1)) | (ids!=ids.shift(-1)), 1, 0)

print(df)
      ID  flag
0   AAAA     1
1   AAAA     0
2   AAAA     0
3   AAAA     1
4   BBBB     1
5   BBBB     0
6   BBBB     0
7   BBBB     1
8   CCCC     1
9   CCCC     0
10  CCCC     1
like image 43
Brad Solomon Avatar answered Nov 14 '22 08:11

Brad Solomon


I feel like this is naturally a groupby concept and ideally would use a groupby-based approach although there is certainly nothing wrong with a shift-based approach either (see the brief discussion of this below for more):

df.loc[ df.groupby('ID',as_index=False).nth([0,-1]).index, 'flag' ] = 1

nth(0) selects the first row of each groupby and nth(-1) the last with nth([0,-1]) selecting both. That will leave the other rows missing, which can be easily filled with fillna(0).

df.flag = df.flag.fillna(0).astype(int)

      ID  flag
0   AAAA     1
1   AAAA     0
2   AAAA     0
3   AAAA     1
4   BBBB     1
5   BBBB     0
6   BBBB     0
7   BBBB     1
8   CCCC     1
9   CCCC     0
10  CCCC     1

With respect to the comment by @JonClements, note that using groupby results in an answer invariant to sort order whereas using the shift approach will depend on the sort order (either of which might be preferred depending on the specific situation).

like image 2
JohnE Avatar answered Nov 14 '22 08:11

JohnE


Sorry late to the party. Variation to original requirement. How to capture sas first dot records with python program? Below sample is based on https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_sas.html

First sas setup: the sample_dot_last and sample_dot_first datasets are what I need python to produce!

    data sampledata;
    infile cards4;
    input ( x y ) ( 2*$8. )  z record_number;
    cards;
    A            I            10    1     
    A            I            11    2   
    A            I            11    3     
    A            J            15    4     
    B            K            9     5     
    B            K            10    6     
    B            K            10    7     
    B            L            14    8     
    C            I            7     9     
    C            I            19   10     
    C            K            3    11     
    C            K            5    12     
    ;;;;

    proc print data= sampledata;
    run;

    data sample_dot_last;
     set sampledata;
      by x y z;
      if last.y;
    run;

    proc print data= sample_dot_last;
    run;

    data sample_dot_first;
     set sampledata;
      by x y z;
      if first.y;
    run;

    proc print data= sample_dot_first;
    run;

Second sample csv for python:

    x,y,z,record number
    A,I,10,1
    A,I,11,2
    A,I,11,3
    A,J,15,4
    B,K,9,5
    B,K,10,6
    B,K,10,7
    B,L,14,8
    C,I,7,9
    C,I,19,10
    C,K,3,11
    C,K,5,12

Finally python program, note the dataframe.groupby( [ ... ] ).last() or .first() produces exact same output as sas!

    import numpy as np
    import pandas as pd
    import os
    cwd= os.getcwd()
    print( "cwd={}".format( cwd ))
    df1= pd.read_csv( 'sampledata.csv')
    print( df1 )

    df2= df1.groupby( [ 'x', 'y' ]).last()
    print( df2 )

    df3= df1.groupby( [ 'x', 'y' ]).first()
    print( df3 )

Sorry different question and answer, hope it useful.

like image 1
hsiwei_yu Avatar answered Nov 14 '22 07:11

hsiwei_yu