A Blog on Analytics and Marketing

SAS, Marketing, Predictive Modeling, Statistics

Posts Tagged ‘PROC SQL’

SAS Macro – Get missing counts for all character variables

Posted by phillippeng on July 23, 2009

%macro charvar_nmiss(base,dsn, outds);
*Counting number of character variables in the dataset;
proc contents data=&base..&dsn out=dsn_vars(keep=name type)  noprint; run;

data dsn_vars; set dsn_vars; where type = 2; seqno = _n_; run;

proc sql;
select name into: varlist
separated by ‘ ‘
from dsn_vars;
quit;

proc sql noprint;
select count(*)into :nvars from dsn_vars;
quit;

*Get the missing counts for each character variable in the dataset;
%do i = 1 %to &nvars;
%LET varn=%SCAN(&varlist,&i,%STR( ));
proc sql noprint;
create table varnmiss&i as
select distinct &varn., count(*) as nmiss
from &base..&dsn.
where &varn. = ‘ ‘;
quit;
%end;

*Concatenating all the datasets;
data final(keep=id nmiss);
id=_n_;
set %do i= 1 %to &nvars;
varnmiss&i
%end;
;
run;

*Clean the space;
proc datasets nolist; %do i = 1 %to &nvars; delete varnmiss&i; %end; run;

*Final dataset with list of variables and missing counts;
proc sql;
create table &outds as
select a.name ,b.nmiss
from dsn_vars as a, final as b
where a.seqno=b.id;
quit;

%mend;

Posted in SAS | Tagged: , , | Leave a Comment »

Use PROC SQL to merge large datasets – Nodupkey

Posted by phillippeng on September 29, 2008

Generally, we have two approaches to merge two datasets: merge statement in a data step and proc sql. Recently, I realized the advantages of using proc sql to merge large datasets.

Merge statement in a data step approach is simple, but it requires you to sort the data first. For a large data set, it may be very time consuming. For example, I have a data set with 33mm records. It took me almost one hour to only sort the data on the server. Using PROC SQL, it only took 25 minutes to finish the whole merge process.

In this post, I am going to summarize the PROC SQL way to achieve the nodupkey option in data step.

proc sql;
create table comb as
select distinct a.*, b.id
from one a right join two b
on a.id= b.id
quit;

If you try above code, you won’t get distinct match from dataset a as you use the nodupkey in the data step. The reason is that distinct option in proc sql is not the same as nodupkey but as nodup. In order to get the same results as nodupkey, we need to use an unknown function “monotonic()”. Here’s how:

proc sql;
create table comb as
select a.*,  b.response, b.id, monotonic() as rowid
from one a right join two b
on a.id = b.id
group by b.id
having rowid=min(rowid);
quit;

What the above code does is to find all the matches from table a for each id in table b and then group them by id. Using function monotonic(), we create a field rowid and then take only the first record (“having rowid = min(rowid)”). This works great when you merge large datasets.

Posted in SAS, SQL | Tagged: , , | 1 Comment »