A Blog on Analytics and Marketing

SAS, Marketing, Predictive Modeling, Statistics

SAS: Merge a very large dataset and a small dataset efficiently

Posted by phillippeng on January 24, 2009

I need to append zip penetration data to a campaign file. This is a very typical data appending job, in which the campaign file is a very large file with over 24mm records while the zip penetration file has only 13,000 records. PROC SQL can make it, but it takes more than one hour to finish the process (running on a fast server). Using the hash table is a more efficient way to achieve the goal.

Macro Code:
%macro appnd_zip(dsin, dsout);
data home.&dsout.;
if 0 then set home.penetration_zip5(keep=zip5 zip_penetration);
declare hash zip(dataset: “home.penetration_zip5”);
zip.definekey(“zip5”);
zip.definedata(“zip_penetration”);
zip.definedone();

do until(eof);
set home.&dsin.(rename=(ms403=zip5)) end=eof;
if zip.find(key:zip5) = 0 then output;
else do;
call missing(zip_penetration);
output;
end;
end;
stop;
run;

%mend;

In this macro, I first define a hash table based on the small dataset (penetration at zip5 level). Then I match it against the large dataset based on the matchkey zip5. When there’s a match, append the data. When there’s no any match found, assign a missing value to it. This is equivalent to a left join. If you want to do an inner join, you need to output only when there’s a match found. The processing time is reduced to about 15 minutes.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: