A Blog on Analytics and Marketing

SAS, Marketing, Predictive Modeling, Statistics

  • Categories

  • Archives

Posts Tagged ‘SQL’

Stratified sampling in SQL Server

Posted by phillippeng on July 9, 2009

As I know, there’s no well documented way to do stratified sampling in SQL server. Below is my solution to this need:

— randomly sort the table

select * into #temp1
from table_name
order by newid()

— create group_id
select row_number() over(partition by strata_name order by strata_name) as rowid, *
into #temp2
from #temp1

–set the sample size for each strata. If the sample size (set to 5 in this sample code) is greater than the available counts for a specific group, all the available records are pulled.
select strata_name, sampsize = case when counts > 5 then 5 else counts end
into #temp3
from (select strata_name, count(*) as counts
from #temp2 group by strata_name) a
–Pull the sample
select a.*
into table_name_sampled
from #temp2 a, #temp3 b
where a.strata_name = b.strata_name and a.rowid <= b.sampsize
— Clean up the temporary tables
drop table #temp1
drop table #temp2
drop table #temp3


Posted in SQL | Tagged: , , | Leave a Comment »

SQL: Select the first/last row within a group in SQL

Posted by phillippeng on February 24, 2009

In SAS, you can first sort the dataset and then use first. or last. statement to select the first/last row within a group. There’s no such command in SQL. Below is my way to select the first/last row within a group in SQL.


select a.* into table_out
from table_in a,
(select id, min(rank) as minrank  from table_in group by id) b
where a.id = b.id and a.rank = b.minrank

I first create table b which is equivalent to sort the table_in and then match to itself. If I want to select the last row, I can then use max() function.

If min rank is not unique within each ID group, the above code will then choose all the rows with a rank equal to the min rank. To only select the first/last row within a group in SQL, we can create a global unique identifier and then match it. Below is the code:

select *, identity(int,1,1) as rowid into #temp from table_in
where condition_Expression order by id, rank

select a.*
into table_out
from table_in a,
(select id, min(rowid) as minrowid  from #temp group by id) b
where a.rowid = b.minrowid

In this example, we first use the indentity() function to create a new globally unique identifier rowid. Then we pick the row with minimum rowid. Be careful that you should sort your data first.

Added: March 9, 2009

We can use this to dedup the data within a group.

Posted in SQL | Tagged: , | 1 Comment »

SQL: Common Tasks

Posted by phillippeng on February 23, 2009

I use this  for SQL Server.

1. select a random sample from a table in SQL

select top 100 *
from table_name
order by newid()

Note: the trick is to use newid() function.

2. Rename a table in SQL:
use database_name;
EXEC sp_rename ‘tablename_old’, ‘tablename_new’;

3. Rename a column:
EXEC sp_RENAME ‘TableName.OldColumnName’ , ‘NewColumnName’, ‘COLUMN’

4. Drop a column:
alter table tablename
drop column col1, column2

Posted in SQL, Uncategorized | Tagged: , | Leave a Comment »