A Blog on Analytics and Marketing

SAS, Marketing, Predictive Modeling, Statistics

Archive for the ‘SQL’ Category

PROC SQL or SQL in more general sense.

Text Editor to Recommend: Notepad++

Posted by phillippeng on September 3, 2009

As I search for a free but powerful text editor tool, Notepad++ just comes on top of the list. Notepad++ is widely used by advanced programmers.

Download Page:  http://download.cnet.com/Notepad/3000-2352_4-10327521.html

Great resource Link:  http://notepad-plus.sourceforge.net/uk/download.php

Notepad++ has built in formatting modules for many programming languages. However, SAS is not one of them.  You can add SAS as a user-defined language following steps below:

1. Download the user-defined language to your computer
2. Click start, run, type (or paste in) %APPDATA%\Notepad++ then click ok
3. Rename userDefineLang_SAS.xml as userDefineLang.xml
4. Copy and paste userDefineLang.xml file into the directory. If you add more than one languages, simply
copy everything from each file starting at <UserLang…> to </UserLang> and paste it at the end of the userDefineLang.xml
right before </NotepadPlus>.
5. Save the newly improved userDefineLang.xml

SAS language formatting file was originally created by Joakim and can be downloaded from SAS. I have made some changes to this file. If you have interests in it, please email me (ppeng08@gmail.com).

I’d like to call out some great features about Notepad++:

1. Highlighting your coding language;
2. Column selection and edition;
method: click the starting position –> alt+shift+click the ending position
You can then edit by column. It works great.
Column edit (alt+C) from the menu does not work the way I want.
3. Able to open an incomplete file. For example, you want to check the SAS log file while you code is still running. Regular text editor just cannot open such a file. Notepad++ can read whatever in the current file. Once there are more information added to the file, a window will pop-up asking if you want to reload/refresh the file in the window.
4. Advanced search and replace. When you search for one word, the same word in the whole file will be highlighted. This helps check your spelling and the keyword referencing in the file.


You will find more great features as you begin to use it.


Posted in Computer, SAS, SQL | Leave a Comment »

How To Select All Columns Except One Column From A Table?

Posted by phillippeng on July 31, 2009

If you have a table with more than ten columns and you need to select all columns except one column, you may not want to type the column names one by one. Especially, it becomes very challenging if you have too many columns. Here’s my trick to do this.

1. Select all columns into another table and drop the column you don’t want.
select * into #temp from table_a
alter table #temp
drop column col1              (col1 is the column that you don’ t want to select.)

2. Rename the column and then drop.
When you want to join two tables by matching the ID column in each table, you want to use ‘select * from table_a, table_b by a.id = b.id’. However, an error message will tell you to select only a.id or b.id. It comes back to our old question. But we cannot drop the ID column first because we need to keep it for the Join purpose. The easy answer is to rename the column and then drop it.
sp_rename ‘Table_a.ID’, ‘ID_a’, ‘Column’

Posted in SQL | Tagged: , , | Leave a Comment »

SAS macro to Load a SAS dataset to SQL Server

Posted by phillippeng on July 22, 2009

%macro dbload(databasename, userid, password, sasdn, sqltable);
proc dbload dbms=odbc data=&sasdn.;
%mend dbload;

You can define a libname using odbc and then load the data through data step. However, there are advantages with PROC DBLOAD over data step:

1. You can re-define the column data type for the loaded table.
2. You can use PROC DBLOAD APPEND to just append new records to a table.

Posted in SAS, SQL | Leave a Comment »

Stratified sampling in SQL Server

Posted by phillippeng on July 9, 2009

As I know, there’s no well documented way to do stratified sampling in SQL server. Below is my solution to this need:

— randomly sort the table

select * into #temp1
from table_name
order by newid()

— create group_id
select row_number() over(partition by strata_name order by strata_name) as rowid, *
into #temp2
from #temp1

–set the sample size for each strata. If the sample size (set to 5 in this sample code) is greater than the available counts for a specific group, all the available records are pulled.
select strata_name, sampsize = case when counts > 5 then 5 else counts end
into #temp3
from (select strata_name, count(*) as counts
from #temp2 group by strata_name) a
–Pull the sample
select a.*
into table_name_sampled
from #temp2 a, #temp3 b
where a.strata_name = b.strata_name and a.rowid <= b.sampsize
— Clean up the temporary tables
drop table #temp1
drop table #temp2
drop table #temp3

Posted in SQL | Tagged: , , | Leave a Comment »

SQL: Select the first/last row within a group in SQL

Posted by phillippeng on February 24, 2009

In SAS, you can first sort the dataset and then use first. or last. statement to select the first/last row within a group. There’s no such command in SQL. Below is my way to select the first/last row within a group in SQL.


select a.* into table_out
from table_in a,
(select id, min(rank) as minrank  from table_in group by id) b
where a.id = b.id and a.rank = b.minrank

I first create table b which is equivalent to sort the table_in and then match to itself. If I want to select the last row, I can then use max() function.

If min rank is not unique within each ID group, the above code will then choose all the rows with a rank equal to the min rank. To only select the first/last row within a group in SQL, we can create a global unique identifier and then match it. Below is the code:

select *, identity(int,1,1) as rowid into #temp from table_in
where condition_Expression order by id, rank

select a.*
into table_out
from table_in a,
(select id, min(rowid) as minrowid  from #temp group by id) b
where a.rowid = b.minrowid

In this example, we first use the indentity() function to create a new globally unique identifier rowid. Then we pick the row with minimum rowid. Be careful that you should sort your data first.

Added: March 9, 2009

We can use this to dedup the data within a group.

Posted in SQL | Tagged: , | 1 Comment »

SQL: Common Tasks

Posted by phillippeng on February 23, 2009

I use this  for SQL Server.

1. select a random sample from a table in SQL

select top 100 *
from table_name
order by newid()

Note: the trick is to use newid() function.

2. Rename a table in SQL:
use database_name;
EXEC sp_rename ‘tablename_old’, ‘tablename_new’;

3. Rename a column:
EXEC sp_RENAME ‘TableName.OldColumnName’ , ‘NewColumnName’, ‘COLUMN’

4. Drop a column:
alter table tablename
drop column col1, column2

Posted in SQL, Uncategorized | Tagged: , | Leave a Comment »

Use PROC SQL to merge large datasets – Nodupkey

Posted by phillippeng on September 29, 2008

Generally, we have two approaches to merge two datasets: merge statement in a data step and proc sql. Recently, I realized the advantages of using proc sql to merge large datasets.

Merge statement in a data step approach is simple, but it requires you to sort the data first. For a large data set, it may be very time consuming. For example, I have a data set with 33mm records. It took me almost one hour to only sort the data on the server. Using PROC SQL, it only took 25 minutes to finish the whole merge process.

In this post, I am going to summarize the PROC SQL way to achieve the nodupkey option in data step.

proc sql;
create table comb as
select distinct a.*, b.id
from one a right join two b
on a.id= b.id

If you try above code, you won’t get distinct match from dataset a as you use the nodupkey in the data step. The reason is that distinct option in proc sql is not the same as nodupkey but as nodup. In order to get the same results as nodupkey, we need to use an unknown function “monotonic()”. Here’s how:

proc sql;
create table comb as
select a.*,  b.response, b.id, monotonic() as rowid
from one a right join two b
on a.id = b.id
group by b.id
having rowid=min(rowid);

What the above code does is to find all the matches from table a for each id in table b and then group them by id. Using function monotonic(), we create a field rowid and then take only the first record (“having rowid = min(rowid)”). This works great when you merge large datasets.

Posted in SAS, SQL | Tagged: , , | 1 Comment »