I have recently migrated to Python as my primary tool for analysis and I am looking to be able to replicate the first. & last. functionality found in SAS. The SAS code would be as follows;
data data.out;
set data.in;
if first.ID then flag = 1;
if last.ID then flag = 1;
run;
The output would be as follows;
ID flag
AAAA 1
AAAA 0
AAAA 0
AAAA 1
BBBB 1
BBBB 0
BBBB 0
BBBB 1
CCCC 1
CCCC 0
CCCC 1
Any ideas about how to do this in Python?
If you're using python and crunching numbers, this type of thing would typically be done using pandas
:
pip install pandas
Assuming you have a CSV file, you can load in your data using pd.read_csv
. I won't make assumptions about your input, so please take a look at the documentation. Once you load your dataframe, you can proceed.
import pandas
df = pd.read_csv('file.csv')
df
ID
0 AAAA
1 AAAA
2 AAAA
3 AAAA
4 BBBB
5 BBBB
6 BBBB
7 BBBB
8 CCCC
9 CCCC
10 CCCC
df['flag'] = ((df.ID != df.ID.shift()) | (df.ID != df.ID.shift(-1))).astype(int)
df
ID flag
0 AAAA 1
1 AAAA 0
2 AAAA 0
3 AAAA 1
4 BBBB 1
5 BBBB 0
6 BBBB 0
7 BBBB 1
8 CCCC 1
9 CCCC 0
10 CCCC 1
You could also do this using np.where
(appreciated suggestion from Brad Solomon):
df['flag'] = np.where((df.ID != df.ID.shift()) \
| (df.ID != df.ID.shift(-1)), 1, 0)
df
ID flag
0 AAAA 1
1 AAAA 0
2 AAAA 0
3 AAAA 1
4 BBBB 1
5 BBBB 0
6 BBBB 0
7 BBBB 1
8 CCCC 1
9 CCCC 0
10 CCCC 1
Using pandas:
import pandas as pd
import numpy as np
df = pd.DataFrame(['AAAA', 'AAAA', 'AAAA', 'AAAA',
'BBBB', 'BBBB', 'BBBB', 'BBBB', 'CCCC', 'CCCC', 'CCCC',],
columns=['ID'])
def firstlast(a):
# For each character grouping set, create a 1d array of 0s padded
# with 1s, equal to length of the group.
a = np.zeros(len(a)-2)
a = np.pad(a, (1,1), 'constant', constant_values=(1,1))
return a
df['flag'] = (s.groupby(s).apply(firstlast).apply(pd.Series).stack()
.astype(int).values)
print(df)
ID flag
0 AAAA 1
1 AAAA 0
2 AAAA 0
3 AAAA 1
4 BBBB 1
5 BBBB 0
6 BBBB 0
7 BBBB 1
8 CCCC 1
9 CCCC 0
10 CCCC 1
Stealing a bit from @cᴏʟᴅsᴘᴇᴇᴅ on logic (which is much smarter than the above solution) but using numpy.where
:
ids = df.ID
df['flag'] = np.where((ids!=ids.shift(1)) | (ids!=ids.shift(-1)), 1, 0)
print(df)
ID flag
0 AAAA 1
1 AAAA 0
2 AAAA 0
3 AAAA 1
4 BBBB 1
5 BBBB 0
6 BBBB 0
7 BBBB 1
8 CCCC 1
9 CCCC 0
10 CCCC 1
I feel like this is naturally a groupby concept and ideally would use a groupby-based approach although there is certainly nothing wrong with a shift-based approach either (see the brief discussion of this below for more):
df.loc[ df.groupby('ID',as_index=False).nth([0,-1]).index, 'flag' ] = 1
nth(0)
selects the first row of each groupby and nth(-1)
the last with nth([0,-1])
selecting both. That will leave the other rows missing, which can be easily filled with fillna(0)
.
df.flag = df.flag.fillna(0).astype(int)
ID flag
0 AAAA 1
1 AAAA 0
2 AAAA 0
3 AAAA 1
4 BBBB 1
5 BBBB 0
6 BBBB 0
7 BBBB 1
8 CCCC 1
9 CCCC 0
10 CCCC 1
With respect to the comment by @JonClements, note that using groupby results in an answer invariant to sort order whereas using the shift approach will depend on the sort order (either of which might be preferred depending on the specific situation).
Sorry late to the party. Variation to original requirement. How to capture sas first dot records with python program? Below sample is based on https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_sas.html
First sas setup: the sample_dot_last and sample_dot_first datasets are what I need python to produce!
data sampledata;
infile cards4;
input ( x y ) ( 2*$8. ) z record_number;
cards;
A I 10 1
A I 11 2
A I 11 3
A J 15 4
B K 9 5
B K 10 6
B K 10 7
B L 14 8
C I 7 9
C I 19 10
C K 3 11
C K 5 12
;;;;
proc print data= sampledata;
run;
data sample_dot_last;
set sampledata;
by x y z;
if last.y;
run;
proc print data= sample_dot_last;
run;
data sample_dot_first;
set sampledata;
by x y z;
if first.y;
run;
proc print data= sample_dot_first;
run;
Second sample csv for python:
x,y,z,record number
A,I,10,1
A,I,11,2
A,I,11,3
A,J,15,4
B,K,9,5
B,K,10,6
B,K,10,7
B,L,14,8
C,I,7,9
C,I,19,10
C,K,3,11
C,K,5,12
Finally python program, note the dataframe.groupby( [ ... ] ).last() or .first() produces exact same output as sas!
import numpy as np
import pandas as pd
import os
cwd= os.getcwd()
print( "cwd={}".format( cwd ))
df1= pd.read_csv( 'sampledata.csv')
print( df1 )
df2= df1.groupby( [ 'x', 'y' ]).last()
print( df2 )
df3= df1.groupby( [ 'x', 'y' ]).first()
print( df3 )
Sorry different question and answer, hope it useful.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With