A Blog on Analytics and Marketing

SAS, Marketing, Predictive Modeling, Statistics

Archive for the ‘SAS’ Category

SAS tips and questions

Text Editor to Recommend: Notepad++

Posted by phillippeng on September 3, 2009

As I search for a free but powerful text editor tool, Notepad++ just comes on top of the list. Notepad++ is widely used by advanced programmers.

Download Page:  http://download.cnet.com/Notepad/3000-2352_4-10327521.html

Great resource Link:  http://notepad-plus.sourceforge.net/uk/download.php

Notepad++ has built in formatting modules for many programming languages. However, SAS is not one of them.  You can add SAS as a user-defined language following steps below:

1. Download the user-defined language to your computer
2. Click start, run, type (or paste in) %APPDATA%\Notepad++ then click ok
3. Rename userDefineLang_SAS.xml as userDefineLang.xml
4. Copy and paste userDefineLang.xml file into the directory. If you add more than one languages, simply
copy everything from each file starting at <UserLang…> to </UserLang> and paste it at the end of the userDefineLang.xml
right before </NotepadPlus>.
5. Save the newly improved userDefineLang.xml

SAS language formatting file was originally created by Joakim and can be downloaded from SAS. I have made some changes to this file. If you have interests in it, please email me (ppeng08@gmail.com).

I’d like to call out some great features about Notepad++:

1. Highlighting your coding language;
2. Column selection and edition;
method: click the starting position –> alt+shift+click the ending position
You can then edit by column. It works great.
Column edit (alt+C) from the menu does not work the way I want.
3. Able to open an incomplete file. For example, you want to check the SAS log file while you code is still running. Regular text editor just cannot open such a file. Notepad++ can read whatever in the current file. Once there are more information added to the file, a window will pop-up asking if you want to reload/refresh the file in the window.
4. Advanced search and replace. When you search for one word, the same word in the whole file will be highlighted. This helps check your spelling and the keyword referencing in the file.

…………

You will find more great features as you begin to use it.


Posted in Computer, SAS, SQL | Leave a Comment »

SAS: Verify function to check invalid data

Posted by phillippeng on July 29, 2009

Syntax for verify function:
verify(character_variable, verify_string)

The verify function returns the first position in the character_variable that contains a character that is not in verify_string. If no invalid data exists, the returned value will be 0.

Initial solution:  IF verify(ID, ‘XYZ012345’) then ….;
Suppose ID = ‘X12B44’, the returned value is then 4.

potential problems:
1. varying length of a variable: There are training blanks which are considered as characters and not in the verify_string.
verify(trim(ID), ‘XYZ012345’)
will return 0 for all valid values of ID, even though they are shorter then specified length.
2. Missing value: The expression above will return a 1 for a missing value of ID. Missing values are considered the same as blanks.
Solution: use missing function

Final solution: if verify(trim(ID), ‘XYZ012345’) and not missing(ID) then …;

In SAS 9 or higher version, you can use NOTDIGIT function:
notdigit(character_value) is equivalent to
verify(character_value, ‘0123456789’)

Posted in SAS | Leave a Comment »

SAS Macro – Get missing counts for all character variables

Posted by phillippeng on July 23, 2009

%macro charvar_nmiss(base,dsn, outds);
*Counting number of character variables in the dataset;
proc contents data=&base..&dsn out=dsn_vars(keep=name type)  noprint; run;

data dsn_vars; set dsn_vars; where type = 2; seqno = _n_; run;

proc sql;
select name into: varlist
separated by ‘ ‘
from dsn_vars;
quit;

proc sql noprint;
select count(*)into :nvars from dsn_vars;
quit;

*Get the missing counts for each character variable in the dataset;
%do i = 1 %to &nvars;
%LET varn=%SCAN(&varlist,&i,%STR( ));
proc sql noprint;
create table varnmiss&i as
select distinct &varn., count(*) as nmiss
from &base..&dsn.
where &varn. = ‘ ‘;
quit;
%end;

*Concatenating all the datasets;
data final(keep=id nmiss);
id=_n_;
set %do i= 1 %to &nvars;
varnmiss&i
%end;
;
run;

*Clean the space;
proc datasets nolist; %do i = 1 %to &nvars; delete varnmiss&i; %end; run;

*Final dataset with list of variables and missing counts;
proc sql;
create table &outds as
select a.name ,b.nmiss
from dsn_vars as a, final as b
where a.seqno=b.id;
quit;

%mend;

Posted in SAS | Tagged: , , | Leave a Comment »

SAS macro to Load a SAS dataset to SQL Server

Posted by phillippeng on July 22, 2009

%macro dbload(databasename, userid, password, sasdn, sqltable);
proc dbload dbms=odbc data=&sasdn.;
dsn=&databasename;
uid=&userid;
pwd=&password;
table=”&sqltable.”;
limit=0;
load;
run;
%mend dbload;

You can define a libname using odbc and then load the data through data step. However, there are advantages with PROC DBLOAD over data step:

1. You can re-define the column data type for the loaded table.
2. You can use PROC DBLOAD APPEND to just append new records to a table.

Posted in SAS, SQL | Leave a Comment »

SAS – Create a file list under the same folder

Posted by phillippeng on July 22, 2009

%let mydir=c:\working directory;
/*DOS command to get a list of all data files programs */
filename mylist pipe “%str(dir %”&mydir%”\pre_* /B) ” lrecl=5000;
/*Adding ‘pre_’ gives all the files with a name starting ‘pre_’. To list all the files, just use ‘*’ */

/*put the list of files into a data set for macro processing*/
data mylist;
infile mylist truncover;
input filename $char50.;
put _infile_;
run;

proc sql;
select substr(filename,1,10) into:filelist
separated by ‘ ‘
from mylist;
quit;

%put &filelist;

Posted in SAS | Tagged: | Leave a Comment »

SAS: Merge a very large dataset and a small dataset efficiently

Posted by phillippeng on January 24, 2009

I need to append zip penetration data to a campaign file. This is a very typical data appending job, in which the campaign file is a very large file with over 24mm records while the zip penetration file has only 13,000 records. PROC SQL can make it, but it takes more than one hour to finish the process (running on a fast server). Using the hash table is a more efficient way to achieve the goal.

Macro Code:
%macro appnd_zip(dsin, dsout);
data home.&dsout.;
if 0 then set home.penetration_zip5(keep=zip5 zip_penetration);
declare hash zip(dataset: “home.penetration_zip5”);
zip.definekey(“zip5”);
zip.definedata(“zip_penetration”);
zip.definedone();

do until(eof);
set home.&dsin.(rename=(ms403=zip5)) end=eof;
if zip.find(key:zip5) = 0 then output;
else do;
call missing(zip_penetration);
output;
end;
end;
stop;
run;

%mend;

In this macro, I first define a hash table based on the small dataset (penetration at zip5 level). Then I match it against the large dataset based on the matchkey zip5. When there’s a match, append the data. When there’s no any match found, assign a missing value to it. This is equivalent to a left join. If you want to do an inner join, you need to output only when there’s a match found. The processing time is reduced to about 15 minutes.

Posted in SAS, Uncategorized | Leave a Comment »

SAS Tips: Readin a special text file

Posted by phillippeng on November 19, 2008

Problem: Read in a kind of special text file into SAS. I say it is special because it’s at least not covered in the popular book “The Little SAS Book”. Each field in the input text file is enclosed with double quotation marks and is separated by comma.

Challenge: Get rid of the double quotation marks for each field

Solution:

filename file1 “..\file1.txt”;
libname home “&path.\Data”;

data home.data1;
infile file1 dsd dlm=’09’x missover lrecl=225 firstobs=2;
format     var1    $90.
var2    $40.
var3    $25.
var4    $2.
var5    $5.
var6    $4.
var7    $4.
informat     var1    $90.
var2    $40.
var3    $25.
var4    $2.
var5    $5.
var6    $4.
var7    $4.
;
input @;
_infile_ = tranwrd (   substr (   _infile_
, 2
, length(_infile_) – 2
)
, ‘”,”‘
, ’09’x
);
input       var1      $
var2      $
var3      $
var4      $
var5      $
var6      $
var7      $
;
run;

Key point: Use “_infile_” and tranwrd function.

Posted in SAS | Tagged: , , , , | Leave a Comment »

Scoring observations using PROC FASTCLUS

Posted by phillippeng on October 6, 2008

PROC FASTCLUS can be used to perform a k-means clustering for observations. All the observations in the training dataset are assigned to clusters on the basis of the parameterization of the procedure and of their variable values. Scoring the observations in the validation dataset using PROC FASTCLUS seems a little bit challenging because the cluster assignment rules depend on new observations now.

Scoring new observations without changing the cluster assignment rules can be achieved by using a SEED dataset in PROC FASTCLUS.

/*original clustering */

%let indsn = input;  *your input dataset;
%let nclus = maxclus; *number of clusters to request;
%let indvars = varlist; *independent variables to run proc fastclus on;
%let valid = val_data; *validation dataset to score;

proc fastclus data=&indsn maxclusters = &nclus outseed= clusterSeeds;
var &indvars;
run;

/*scoring new observations using the seed dataset */
proc fastclus data=&valid  out=&valid._scored seed = clusterSeeds maxclusters = &nclus maxiter = 0;
var &indvars;
run;

Reference:
“Data Preparation for Analytics Using SAS” By Gerhard Svolba, Gerhard Svolba, Ph.D.

Posted in Modeling, SAS | Tagged: , , , , | Leave a Comment »

PROC GAM – Detect the Seasonality

Posted by phillippeng on October 2, 2008

Fluctuations embedded in a data series over time may come from random fluctuations or latent seasonality. Very often, we need to test if there’s a true seasonality trend. If so, what seasonality pattern is it? The example in the following link provides a perfect illustration on how to examine the seasonality using SAS PROC GAM.

proc gam

Posted in Modeling, SAS | Tagged: , , | Leave a Comment »

Use PROC SQL to merge large datasets – Nodupkey

Posted by phillippeng on September 29, 2008

Generally, we have two approaches to merge two datasets: merge statement in a data step and proc sql. Recently, I realized the advantages of using proc sql to merge large datasets.

Merge statement in a data step approach is simple, but it requires you to sort the data first. For a large data set, it may be very time consuming. For example, I have a data set with 33mm records. It took me almost one hour to only sort the data on the server. Using PROC SQL, it only took 25 minutes to finish the whole merge process.

In this post, I am going to summarize the PROC SQL way to achieve the nodupkey option in data step.

proc sql;
create table comb as
select distinct a.*, b.id
from one a right join two b
on a.id= b.id
quit;

If you try above code, you won’t get distinct match from dataset a as you use the nodupkey in the data step. The reason is that distinct option in proc sql is not the same as nodupkey but as nodup. In order to get the same results as nodupkey, we need to use an unknown function “monotonic()”. Here’s how:

proc sql;
create table comb as
select a.*,  b.response, b.id, monotonic() as rowid
from one a right join two b
on a.id = b.id
group by b.id
having rowid=min(rowid);
quit;

What the above code does is to find all the matches from table a for each id in table b and then group them by id. Using function monotonic(), we create a field rowid and then take only the first record (“having rowid = min(rowid)”). This works great when you merge large datasets.

Posted in SAS, SQL | Tagged: , , | 1 Comment »