SAS PRX to extract substring please

Question

I am trying to use the SAS PRX function to extract a substring from my dataset. But it only returns the exact matches, whereas I need it to be more flexible and extract those that match a variety of conditions.

I have copied my data below. As you can see, one of the variables in my data is "brandmodel" which contains both the brand name and the model# of a particular camera. I need to have a separate column just for the model#s. So I am using the PRX function to extract them as they usually follow one of the following patterns:

For example: JX100 or JX10 or JX1 (i.e., 1-2 alphabets, followed immediately by 1-3 digits. This my program (copied below the data) can handle. But where I run into problems is: how to extract those model#'s where the alphabets are separated from the digits by a space or a hyphen, and how do I extract those into the same column "Model" as those with them altogether? Also, some of the observations do not have model#s, how can I get them to be set to missing instead of being dropped altogether?

Brandmodel|Price

iTwist F124 Digital Camera -red|49.00
Vivitar IF045 Digital Camera -Blue|72.83
Liquid Image Underwater Camera Mask|128.00
Impact Series Video Camera MX Gogglesâ„¢|188.00
Olympus VR 340  Silver|148.00
Olympus TG820 Digital Camera Black|278.00
Olympus VR 340 16MP 10x 3.0 LCD Red|148.00
Vivitar VX137-Pur Digital Camera|39.00

Olympus SZ-12 Digital Camera -Black|198.00
Olympus VG160 Digital Camera Red|98.00
Olympus VR340   Purple|148.00
Olympus TG820 Digital Camera Silver|298.00
Olympus TG820 Digital Camera Blue|278.00
Olympus VG160 Digital Camera    Orange|98.00
Olympus TG820 Digital Camera Red|298.00
Fujifilm FinePix AX500 Red|78.63
Canon A2300 Silver|98.63
Canon A810 Red|75.00
Nikon Coolpix S2600 Digital Camera - Red|88.00
Nikon Coolpix L25 Digital Camera - Silver|82.00
Casio Exilim ZS10BK|128.00

Olympus TG-310 14 MP blue Digital Camera|148.00
Hipstreet Kidz Digital Camera - Blue|14.93
Casio Exilim ZS10PK|128.00
Olympus TG-310 14 MP Digital Camera orange|148.00

SAS Program

data walnov21p2; 
 length brandmodel $ 80;
 infile "G:\File2\data\store_nov21\storenv21p2.csv" firstobs=2 dlm="|" dsd;
 input brandmodel price;
 re= prxparse('/[[:alpha:]]{1,3} \d{1,4}/');
 if prxmatch(re, brandmodel) then
 do;
   model=prxposn(re, 0, brandmodel);
   output;
 end;
run;

itzy · Accepted Answer

For your very last question (set variable to missing rather than dropping observation, remove the output statement from the conditional do at the end. Just change it to:

if prxmatch(re, brandmodel) then model=prxposn(re, 0, brandmodel);

This will cause all observations to be output, regardless of whether model is defined.

For the rest of your question, it is really about pattern matching with Perl regular expressions, and is not specific to SAS. It's also a tricky because some models have spaces in them. Try posting a different question asking about the Perl regular expression (with those tags) that would match what you want.

Also, post some examples of what you want the output to be. For example, what do you expect for input like this:

Olympus VR 340 16MP 10x 3.0 LCD Red|148.00 
Vivitar VX137-Pur Digital Camera|39.00

Joe · Answer

One approach you could use is to build a dictionary of Makes (Nikon, Olympus, etc.) and use that to remove the make from the string; then you have Olympus VR 340 16MP 10x 3.0 LCD Red -> VR 340 16MP... which is easier to parse.

Ultimately, though, this may be something not trivially solved with a regex, since the regex can't easily parse all possible situations (is "Nikon Coolpix" a make, or is "Coolpix S2600" the model?). The best solution for this sort of thing is to make a 'most of the time' regex, something like

/[a-zA-Z]{1,3}[ -]?[\d]{1,4}/

And then code the non-matches to go to a separate dataset, where you can look at them by hand and code them over (or add to your regex, if there is something fairly common).

To do that, you do something like this:

data walnov21p2 nomatch;
length brandmodel $ 80;
infile "G:\File2\data\store_nov21\storenv21p2.csv" firstobs=2 dlm="|" dsd;
input brandmodel price;
re= prxparse('/[[:alpha:]]{1,3} \d{1,4}/');
if prxmatch(re, brandmodel) then
do;
        model=prxposn(re, 0, brandmodel);
        output walnov21p2 ;
end;
else output nomatch;
run;

Then you can process nomatch separately and re-combine them later.

SAS PRX to extract substring please

Tags:

regex

sas

user1901322

2 Answers

itzy

Joe

Recent Activity

Donate For Us

SAS PRX to extract substring please

Tags:

regex

sas

user1901322

2 Answers

itzy

Joe

Related questions

Recent Activity

Donate For Us