A Blog on Analytics and Marketing

SAS, Marketing, Predictive Modeling, Statistics

SQL: Select the first/last row within a group in SQL

Posted by phillippeng on February 24, 2009

In SAS, you can first sort the dataset and then use first. or last. statement to select the first/last row within a group. There’s no such command in SQL. Below is my way to select the first/last row within a group in SQL.

Code:

select a.* into table_out
from table_in a,
(select id, min(rank) as minrank  from table_in group by id) b
where a.id = b.id and a.rank = b.minrank

I first create table b which is equivalent to sort the table_in and then match to itself. If I want to select the last row, I can then use max() function.

If min rank is not unique within each ID group, the above code will then choose all the rows with a rank equal to the min rank. To only select the first/last row within a group in SQL, we can create a global unique identifier and then match it. Below is the code:

select *, identity(int,1,1) as rowid into #temp from table_in
where condition_Expression order by id, rank

select a.*
into table_out
from table_in a,
(select id, min(rowid) as minrowid  from #temp group by id) b
where a.rowid = b.minrowid

In this example, we first use the indentity() function to create a new globally unique identifier rowid. Then we pick the row with minimum rowid. Be careful that you should sort your data first.

Added: March 9, 2009

We can use this to dedup the data within a group.

Posted in SQL | Tagged: , | 1 Comment »

SQL: Common Tasks

Posted by phillippeng on February 23, 2009

I use this  for SQL Server.

1. select a random sample from a table in SQL

select top 100 *
from table_name
order by newid()

Note: the trick is to use newid() function.

2. Rename a table in SQL:
use database_name;
Go
EXEC sp_rename ‘tablename_old’, ‘tablename_new’;
GO

3. Rename a column:
EXEC sp_RENAME ‘TableName.OldColumnName’ , ‘NewColumnName’, ‘COLUMN’

4. Drop a column:
alter table tablename
drop column col1, column2

Posted in SQL, Uncategorized | Tagged: , | Leave a Comment »

SAS: Merge a very large dataset and a small dataset efficiently

Posted by phillippeng on January 24, 2009

I need to append zip penetration data to a campaign file. This is a very typical data appending job, in which the campaign file is a very large file with over 24mm records while the zip penetration file has only 13,000 records. PROC SQL can make it, but it takes more than one hour to finish the process (running on a fast server). Using the hash table is a more efficient way to achieve the goal.

Macro Code:
%macro appnd_zip(dsin, dsout);
data home.&dsout.;
if 0 then set home.penetration_zip5(keep=zip5 zip_penetration);
declare hash zip(dataset: “home.penetration_zip5”);
zip.definekey(“zip5”);
zip.definedata(“zip_penetration”);
zip.definedone();

do until(eof);
set home.&dsin.(rename=(ms403=zip5)) end=eof;
if zip.find(key:zip5) = 0 then output;
else do;
call missing(zip_penetration);
output;
end;
end;
stop;
run;

%mend;

In this macro, I first define a hash table based on the small dataset (penetration at zip5 level). Then I match it against the large dataset based on the matchkey zip5. When there’s a match, append the data. When there’s no any match found, assign a missing value to it. This is equivalent to a left join. If you want to do an inner join, you need to output only when there’s a match found. The processing time is reduced to about 15 minutes.

Posted in SAS, Uncategorized | Leave a Comment »

Download them all!

Posted by phillippeng on January 14, 2009

We have a vertical list dry testing project. Content team posts all the vertical lists from different vendors on an internal portal. There are 94 vertical lists need to be downloaded. Oh! My Ghosh! Anybody who’s taking a lead on this would have a headache if he/she does it manually.

How about downloading them all at a time? Can we do that? Oh, yeah.

Here’s my personal recommendation: Use download software “flashget”. Here’s the download page of the English version:

http://www.flashget.com/en/download.htm

After you download and install it, you should reboot your computer before it really works for you fully. I thought I need the administrator authorization to install it. Out of my anticipation, it passes through that and get installed without bothering the IT buddy. Woh! It’s green.

After log into the download page of my vertical lists, I can right click and choose “Download all with Flashget”. No more headache with downloading many files.

Posted in Computer | Leave a Comment »

[Book Reading] Marketing Metrics: 50+ metrics every executive should master

Posted by phillippeng on January 10, 2009

“Marketing Metrics: 50+ metrics every executive should master” is published by University of Pennsylvania Wharton School Publishing in 2006. The authors are Paul W. Farris, Neil T. Bendle, Phillip E Pfeifer and David J. Reibstein.

As the title suggested, the book discusses over 50 marketing metrics. These metrics are divided into 9 categories based on what they are measuring:

1. Share of hearts, minds and markets. (competitive analysis)
2. Margins and profits. (Profitability analysis)
3. Product and portfolio management
4. Customer profitability
5. Sales force and channel management
6. Pricing strategy
7. Promotion
8. Advertising media and web metrics
9. Marketing and Finance

At the age of customer centric, I will focus on sharing their more detailed discussions on customer profitability later.

Posted in Marketing | Leave a Comment »

10 Ways To Make Windows XP Look And Feel Better

Posted by phillippeng on December 8, 2008

I came across the following article talking about customizing windows xp. I tried it and like it very much. Test it yourself.

http://www.pcmech.com/article/10-ways-to-make-windows-xp-look-and-feel-better/

Posted in Computer | Leave a Comment »

SAS Tips: Readin a special text file

Posted by phillippeng on November 19, 2008

Problem: Read in a kind of special text file into SAS. I say it is special because it’s at least not covered in the popular book “The Little SAS Book”. Each field in the input text file is enclosed with double quotation marks and is separated by comma.

Challenge: Get rid of the double quotation marks for each field

Solution:

filename file1 “..\file1.txt”;
libname home “&path.\Data”;

data home.data1;
infile file1 dsd dlm=’09’x missover lrecl=225 firstobs=2;
format     var1    $90.
var2    $40.
var3    $25.
var4    $2.
var5    $5.
var6    $4.
var7    $4.
informat     var1    $90.
var2    $40.
var3    $25.
var4    $2.
var5    $5.
var6    $4.
var7    $4.
;
input @;
_infile_ = tranwrd (   substr (   _infile_
, 2
, length(_infile_) – 2
)
, ‘”,”‘
, ’09’x
);
input       var1      $
var2      $
var3      $
var4      $
var5      $
var6      $
var7      $
;
run;

Key point: Use “_infile_” and tranwrd function.

Posted in SAS | Tagged: , , , , | Leave a Comment »

Time Series (1)

Posted by phillippeng on October 15, 2008

Text and Resources on time series:

1. Enders, Walter (2004). Applied Econometrics Time Series, 2nd edition, New York: Wiley & Sons, Inc.
2. Wei, William W.S. (1990). Time Series Analysis: Univariate and Multivariate Methods, New York: Addison-Wesley Publishing Co., Inc.
3. Box, G.E.P., G.M. Jenkins and G.C. Reinsel (1994). Time Series Analysis Forecasting and Control, Third Edition, San Francisco: Holden-Day, Inc.
4. Lutkepohl, Helmut (2005). New Introduction to Multiple Time Series Analysis, 3rd Edition, Springer Verlag.

Posted in Modeling | Tagged: , , | Leave a Comment »

Scoring observations using PROC FASTCLUS

Posted by phillippeng on October 6, 2008

PROC FASTCLUS can be used to perform a k-means clustering for observations. All the observations in the training dataset are assigned to clusters on the basis of the parameterization of the procedure and of their variable values. Scoring the observations in the validation dataset using PROC FASTCLUS seems a little bit challenging because the cluster assignment rules depend on new observations now.

Scoring new observations without changing the cluster assignment rules can be achieved by using a SEED dataset in PROC FASTCLUS.

/*original clustering */

%let indsn = input;  *your input dataset;
%let nclus = maxclus; *number of clusters to request;
%let indvars = varlist; *independent variables to run proc fastclus on;
%let valid = val_data; *validation dataset to score;

proc fastclus data=&indsn maxclusters = &nclus outseed= clusterSeeds;
var &indvars;
run;

/*scoring new observations using the seed dataset */
proc fastclus data=&valid  out=&valid._scored seed = clusterSeeds maxclusters = &nclus maxiter = 0;
var &indvars;
run;

Reference:
“Data Preparation for Analytics Using SAS” By Gerhard Svolba, Gerhard Svolba, Ph.D.

Posted in Modeling, SAS | Tagged: , , , , | Leave a Comment »

Time Management

Posted by phillippeng on October 3, 2008

This afternoon I attended the internal training session “Time Management Essentials” instructed by Susan Connors. Susan provided lots of tips and principles of managing times. It’s an informative session. Everybody needs to learn how to manage their time so that they can be more productive.

Time management is not about how much we know about time management, but about how much we put into action. In another word, it’s not about science but behaviour.

Here are some tips/views shared:

1. Manage your email at once. Don’t check your email every 3 to 5 minutes. However, we need to balance the client expectation and email management efficiency.

2. Retool the priority system. If we have the following labels:
Urgent and goal-related = A
Goal-related but not urgent = B
Urgent but not goal-related = C
we need to have have at least one “B” priority in our to-do list every day.

3. Use the primary energy time for priority tasks.

You can find more free resources through

http://site.ebrary.com/pub/mcgraw-hill

Posted in Uncategorized | Tagged: | Leave a Comment »