Ramblings of a SQL Hillbilly

Temporary Query Items: Table Variables

2014-08-22T15:53:46-05:00

Note: This is part of a series on Temporary Query Items.

Previously we’ve talked about Temporary Query Items, what they are, why they matter, and the various factors that might cause you to choose one over another. We’ve also already talked about our first TQI, temporary tables. Without further ado, it’s time for our next one – the Table Variable.

A Little History

Table variables have been present in SQL Server since at least SQL Server 2000. You should be able to make use of them everywhere.

Location, Location, Location

Table variables live in TempDB. There’s an ancient myth that says table variables only live in memory – that’s not entirely true. Like any other data set in use, a table variable may be cached in memory, but its actual home address is TempDB. As with temporary tables, this means that table variables can more easily cause TempDB contention than other TQIs.

Unlike temporary tables, table variables are local in scope. You can only use them in the originating connection.

How Long Does It Live?

Table variables don’t last very long relative to temporary tables, but they do have a fair amount of longevity. When you create a table variable, the data contained within will exist until the completion of the transaction. You can refer to it in multiple queries within one transaction, but it will disappear at the end of transaction.

Indexing

Table variables do support clustered indexes, but ONLY when the index is defined in the variable declaration. This actually improves somewhat in SQL Server 2014, where we have the ability to define non-clustered indexes directly in our table create statements – and therefore, we are able to define non-clustered indexes on table variables there.

One thing that table variables do NOT support is column statistics. This can be a pain point due to the fact that the query optimizer can’t determine how many rows to expect out of a table variable, so it assumes that the number will always be 1. 1 is a terrible default, but it’s better than all the other potential defaults. Large result sets stored in table variables can lead to performance problems depending on the execution plan that is chosen.

Copy That

Like temporary tables, when you create a table variable based on an existing data set, it is created as a separate physical copy of the data. Modifications to the table variable do not affect the original data source. This can be either a strength or a weakness , depending on your purposes for using a TQI.

But Does It Blend?

The data in a table variable is fully modifiable. However, modifications to the table variable result set do not automatically translate back to the original data used to populate it. This can definitely be either a point for or against your choice of TQI.

Show Me The Code!

Test Data Setup

I’ll be using a test table to do some demonstrations. You can create/populate it here:

create table dbo.JK_Temp_Stuff_Test_Table (
       myIntKey int PRIMARY KEY CLUSTERED,
       myVarchar varchar(10) not null
);

insert into dbo.JK_Temp_Stuff_Test_Table (myIntKey, myVarchar)
values (5, 'Wheeee');

create table dbo.JK_Temp_Stuff_Test_Table_2 (
       myIntKey int PRIMARY KEY CLUSTERED,
       myOtherVarchar varchar(10) not null
);

insert into dbo.JK_Temp_Stuff_Test_Table_2 (myIntKey, myOtherVarchar)
values (5, 'Waffle');

Creation

The creation syntax for a table variable is a hybrid between a CREATE TABLE statement and a DECLARE statement – which makes sense, as a table variable is a table that happens to be a variable.

declare @JK_Test_Table_Variable TABLE (
       myIntKey int PRIMARY KEY CLUSTERED,
       myVarchar varchar(10) not null
);

Note that you CANNOT do a SELECT INTO with a table variable.

Modification

Inserting into a table variable is just like inserting into a regular table. However, note that you have to do the INSERT in the same transaction as the CREATE.

declare @JK_Test_Table_Variable TABLE (
       myIntKey int PRIMARY KEY CLUSTERED,
       myVarchar varchar(10) not null
);

insert into ##JK_Test_Temp_Table (myIntKey, myVarchar)
values (1, 'A'),
       (2, 'B'),
       (3, 'A'),
       (4, 'C');

Table variables are also updateable:

declare @JK_Test_Table_Variable TABLE (
       myIntKey int PRIMARY KEY CLUSTERED,
       myVarchar varchar(10) not null
);

insert into @JK_Test_Table_Variable (myIntKey, myVarchar)
values (1, 'A'),
       (2, 'B'),
       (3, 'A'),
       (4, 'C');

update tgt
   set MyVarchar = 'D'
 from @JK_Test_Table_Variable tgt
where myIntKey = 4;

select *
  from @JK_Test_Table_Variable;

Indexing

Unfortunately, you cannot index a table variable after creation. This means that prior to SQL Server 2014, you cannot add nonclustered indexes to table variables. You can, however, declare a table variable with a clustered or unique index on it:

declare @JK_Test_Table_Variable TABLE (
       myIntKey int PRIMARY KEY CLUSTERED,
       myVarchar varchar(10) not null
);

-- Fails
create nonclustered index JK_Test_Table_Variable_SIDX_1 on @JK_Test_Table_Variable (
       myVarchar asc
);

It is also not possible to create statistics on a table variable. Therefore, the query optimizer will always estimate that there is 1 row in the table variable, and therefore it may make some very bad plan choices. You’ll find that a lot of people insist that table variables are bad for performance – this is usually the reason. A table variable with millions of rows is almost always going to be slower than about any other option.

Global vs Local

Table Variables are only local. They’re so local, actually, that their use is limited to the transaction they’re created in.

Longevity

No really, only the creating transaction. The insert will fail:

declare @@JK_Test_Table_Variable TABLE (
       myIntKey int PRIMARY KEY CLUSTERED,
       myVarchar varchar(10) not null
);
go

insert into @@JK_Test_Table_Variable (myIntKey, myVarchar)
values (1, 'A'),
       (2, 'B'),
       (3, 'A'),
       (4, 'C');
go

Source Data Dependence

As with temporary tables, data set modifications are done independently of the original data. This should be pretty clear from the INSERT INTO syntax:

declare @JK_Test_Table_Variable TABLE (
       myIntKey int PRIMARY KEY CLUSTERED,
       myVarchar varchar(10) not null
);

insert into @JK_Test_Table_Variable
select *
  from dbo.JK_Temp_Stuff_Test_Table;

update @JK_Test_Table_Variable
   set myVarchar = 'Whooo';

select * from @JK_Test_Table_Variable;
select * from dbo.JK_Temp_Stuff_Test_Table;

Cleanup

And here’s the cleanup code for this exercise:

drop table dbo.JK_Temp_Stuff_Test_Table;
drop table dbo.JK_Temp_Stuff_Test_Table_2;

Moving On

We’ve taken some time to look at some attributes of Table Variables, including strengths and weaknesses. As with Temporary Tables, you can’t make a decision about using them in a vacuum. In particular, I’ve cautioned against the use of table variables for large data sets – but for very small data sets, table variables can be very useful. I particularly like them for use with the OUTPUT clause of the MERGE statement when trying to log row counts in SSIS.

Next up: Views.

Temporary Query Items: Temporary Tables

2014-08-22T08:09:16-05:00

Note: This is part of a series on Temporary Query Items.

Previously, on TQI…

We’ve talked previously about what a Temporary Query Item is and how scope, indexing, statistics, TempDB, and memory can all affect your choice of TQI for your queries. It’s all been important discussion, if a little unsexy – go back and read it if you haven’t, or you may get a little lost. If you have (no really, read it first!), then you’re ready to walk through our first Temporary Query Item – Temporary Tables.

A Little History

Temporary Tables have been present in SQL Server since at least SQL Server 2000. You should be able to make use of them everywhere.

Location, Location, Location

Temporary Tables live in TempDB. This is not necessarily a performance problem, as even objects in TempDB can be cached in memory. However, if too much is going on in TempDB, this can cause a performance loss due to contention. If WAY too much is going on in TempDB, use of a temporary table can actually cause you to blow TempDB, which means your query will fail. Don’t cause your query to fail. Using a temporary table certainly doesn’t guarantee problems, but you do want to be careful.

Regarding location scope, temporary tables give you a little bit of flexibility. You can declare your temporary table as global or local. A global temporary table can be accessed from other connections than the creating one. A local temporary table can only be accessed from the creating connection, although it can be used and reused as long as the connection is open.

Global

Global temporary tables are named with a ## prefix. So, if I want a global temporary table to hold pickles, I would name it ##Pickles. Once created, this table will hang around as long as something is using it, and anyone can make use of it.

Local

Local temporary tables are named with a # prefix. My same temp table for holding delicious pickles would be named #Pickles. Once created, this table can only be accessed from the creating connection. They can create their own #Pickles table for THEIR delicious pickles, because behind the scenes SQL Server will postfix your local temporary table name with a long string of underscores and a hexadecimal number, and postfix theirs with a different long string of underscores and a hexadecimal number. BIG HAIR DEAL: Clustered indexes on local temp tables are global in scope. This means that two or more people attempt to create local temporary tables with identically named clustered indexes, the runner-up loses and cannot create the table.

How Long Does It Live?

Temporary tables present a fair amount of longevity. If SQL Server stops, then your temp table will go away. If you close the creating connection, the temp table will go away. If you drop the temp table, it will also go away. Otherwise, temporary tables will happily sit there forever occupying TempDB. This makes it one of the longest-lived TQIs, which is one of its biggest strengths.

I should note here that if a global temporary table is actively being referenced when dropped or the creating connection is closed, the table will actually persist until it is no longer being actively referenced. This is a little nicer than getting an error because the object suddenly doesn’t exist in the middle of your query. It could also mean that you insert into something that immediately stops existing, though.

Indexing

Another strength of temporary tables is that they can be indexed. Specifically, temporary tables can have both clustered and nonclustered indexes applied to them. They can also be created with a clustered index already on them. I’ve used this to my advantage before by making a temporary table with a copy of the data I want, then indexing it to fit my series of queries.

Copy That

When you create a temporary table based on an existing data set, it is created as a separate physical copy of the data. Modifications to the temp table do not affect the original data source. This can be either a strength or a weakness , depending on your purposes for using a TQI.

But Does It Blend?

The data contained within a temporary table is modifiable. You can insert, update, or delete records within. And you can dance if you want to.

As mentioned above, modifications to the temporary table does not automatically translate back to the original data used to populate it.

Show Me The Code!

Test Data Setup

I’ll be using a test table to do some demonstrations. You can create/populate it here:

create table dbo.JK_Temp_Stuff_Test_Table (
       myIntKey int PRIMARY KEY CLUSTERED,
       myVarchar varchar(10) not null
);

insert into dbo.JK_Temp_Stuff_Test_Table (myIntKey, myVarchar)
values (5, 'Wheeee');

create table dbo.JK_Temp_Stuff_Test_Table_2 (
       myIntKey int PRIMARY KEY CLUSTERED,
       myOtherVarchar varchar(10) not null
);

insert into dbo.JK_Temp_Stuff_Test_Table_2 (myIntKey, myOtherVarchar)
values (5, 'Waffle');

Creation

Temporary tables can be created in a very similar way to plain old tables. You can do a standard create statement:

create table ##JK_Test_Temp_Table (
       myIntKey int not null,
       myVarchar varchar(10) not null
CONSTRAINT [PK_JK_Test_Temp_Table] PRIMARY KEY CLUSTERED
(
  [myIntKey] ASC
)
);

As it turns out, you can also do SELECT INTO with a temp table. I’ll show that later.

Modification

Inserting into a temporary table is just like regular tables as well.

insert into ##JK_Test_Temp_Table (myIntKey, myVarchar)
values (1, 'A'),
       (2, 'B'),
       (3, 'A'),
       (4, 'C');

Temporary tables are also updateable:

update tgt
   set myVarchar = 'D'
  from ##JK_Test_Temp_Table tgt
 where myInt = 4;

select *
  from ##JK_Test_Temp_Table;

Indexing

You can create indexes on your temp table after the fact. Run the SELECT with the execution plan, create the index, then rerun the SELECT and note the change in execution plan. BIG HAIRY WARNING: If you attempt to create an index on a temp table with the execution plan enabled, SSMS will blow up.

-- Enable Execution Plan
select *
  from ##JK_Test_Temp_Table
where myVarchar = 'A';

-- Disable Execution Plan
create nonclustered index JK_Test_Temp_Table_SIDX_1 on ##JK_Test_Temp_Table(
       myVarchar asc
);

-- Enable Execution Plan
select *
  from ##JK_Test_Temp_Table
where myVarchar = 'A';

Global vs Local

So far, we’ve been using a global temporary table. Here, we’ll look at local temporary tables – run the CREATE and INSERT in one window, then run the SELECT in another:

create table #JK_Test_Temp_Table_2 (
       myIntKey int not null,
       myVarchar varchar(10) not null
CONSTRAINT [PK_JK_Test_Temp_Table_2] PRIMARY KEY CLUSTERED
(
  [myIntKey] ASC
)
);

insert into #JK_Test_Temp_Table_2 (myIntKey, myVarchar)
values (1, 'A'),
       (2, 'B'),
       (3, 'A'),
       (4, 'C');

select *
  from #JK_Test_Temp_Table_2
where myVarchar = 'A';

Longevity

The temporary table goes away when you close the creating connection. Close your original window, then run this:

select *
  from ##JK_Test_Temp_Table
where myVarchar = 'A';

Source Data Dependence

And finally, we can see modifying the data in the temporary table is independent of the data that created it. Note the SELECT INTO syntax:

select *
  into #JK_Test_Temp_Table_3
 from dbo.JK_Temp_Stuff_Test_Table;

update #JK_Test_Temp_Table_3
   set myVarchar = 'Whooo';

select * from #JK_Test_Temp_Table_3;
select * from dbo.JK_Temp_Stuff_Test_Table;

Cleanup

And here’s the cleanup code for this exercise:

drop table dbo.JK_Temp_Stuff_Test_Table;
drop table dbo.JK_Temp_Stuff_Test_Table_2;

Moving On

In this post, we’ve focused on some of the strengths and weaknesses of temporary tables. While this is not sufficient information to decide that a temporary table would be a more or less advantageous choice of Temporary Query Item in all cases, it should give you some ideas on how to make use of them in your queries. Next up, we’ll take a look at our next TQI: Table Variables.

Temporary Query Items: TempDB and Memory

2014-08-19T11:42:11-05:00

Note: This is part of a series on Temporary Query Items.

In previous posts, I’ve talked some about what Temporary Query Items are and why you might need them. I’ve also given an overview of scope and how it might affect your choices of TQIs for your own queries. Likewise, I’ve worked through indexing and statistics, a reasonable understanding of which can help to understand the advantagese and disadvantages of some TQIs. Now, I’d like to talk a bit about TempDB and memory, and how they can also affect your choice of Temporary Query Item.

Memory: Where The Party Is

Memory is the most critical hardware component when it comes to SQL Server. If you have more memory than you have data, then you won’t have to worry about disk latency (much.) Memory is incredibly fast compared to every other piece of hardware you have (excepting CPU caches, but those are generally out of your control.) You want more memory. You always want more memory. More RAM.

Right. So more RAM is mo’ betta. But we have a 2 terabyte data warehouse – that really doesn’t fit into RAM. And memory isn’t just for storing data like it is stored on disk – we also have to store data in the form that we’re querying. That means that we need a way to handle running out of memory for a query without crashing the whole system. What we have is TempDB.

TempDB: The Public Toilet of SQL Server

Brent Ozar describes TempDB as the public toilet of SQL Server. Everything that needs to temporarily dump to disk in SQL Server uses TempDB. This includes index rebuilds, the version store, and even innocent little SELECT statements.

TempDB is slow, even if you put it on SSDs. Memory is a TON faster. You could theoretically allocate a RAMDisk for TempDB, but…

Why Does It Matter?

Okay, so memory is fast, TempDB is slow, but TempDB is there when we run out of memory. What does that have to do with Temporary Query Items? Good question!

Some of our TQIs live in TempDB. Some of them don’t. What this means for you, is that your choice of Temporary Query Item might be influenced by how it uses or abuses TempDB.

Moving On

The only major concern that I haven’t touched on when it comes to the differences among Temporary Query Items is syntax. Each one has unique ways in which to invoke it, and I could theoretically dedicate a post just to that. However, you’ve been waiting for specifics, and specifics you shall have! I’ll cover syntax of each TQI individually as I come to them.

First up: temporary tables.

Temporary Query Items: Indexing and Statistics

2014-08-18T21:17:53-05:00

Note: This is part of a series on Temporary Query Items.

In my previous posts, I’ve talked about what Temporary Query Items are and how scope helps to define the similarities and differences among the different TQIs. Before we can launch into an in-depth discussion of our first Temporary Query Item, though, we need to take a little time to talk about indexing and statistics.

Index? What, like the card?

Many people have made many comparisons of indexes in SQL Server to more familiar real-life objects. The obvious analog is to an index in a book, where you look up certain terms and the index points you to the page numbers in the book where the term can be found – here’s Technet. Others have compared indexes to entries in the phone book, like Brent Ozar. Mark Solomon suggests that non-clustered indexes are like miniature tables (SQL Server tables, not dining room tables.)

Analogies are nice and all, but what’s the reality? In essence, indexes do two things: Holding (and potentially organizing) data, and holding pointers to the physical location of data. Pointers, as you’ll recall from your hours of C coding earlier today, contain an address to a location in memory. In SQL Server indexing, a pointer contains the address of a row in the table.

Before we move on with that idea though, we should talk about the two types of indexes in SQL Server: Clustered and non-clustered. They’re both types of indexes, of course, but they tend to be defined by their differences more than their similarities.

Clustered

A clustered index is the technical term for an index that defines the physical structure of a table. You can have up to one clustered index per table – though technically, a table without a clustered index isn’t really a table, it’s a heap. When you have a clustered index, then the data in the table will be ordered on disk according to the definition of the index, which is a large part of why there can only be one clustered index.

Non-clustered

A non-clustered index does not enforce any ordering on the data in the table. Instead, each row within the non-clustered index contains a pointer to the original row in the clustered index. Any columns contained within the non-clustered index will be stored along with the pointer (so, those columns are essentially duplicated.) There can be up to 999 non-clustered indexes on a table (according to Denny Cherry that’s not a challenge), and non-clustered indexes can be applied even when there is no clustered index. When it comes to heaps, the pointer to the non-clustered index is replaced by a pointer to a RowID in the heap.

Okay, so why do we have them again?

Both kinds of indexes in SQL Server are there to help support queries. Without indexing, finding data in our tables would require scanning the whole table every time. With indexes, we can know exactly where to look for what data, and so we can be much more efficient. If you want more on that, Brent Ozar has a fantastic course that walks through how this works called How To Think Like SQL Server, which for $29 is a steal.

Statistics. What is this, baseball?

I love baseball statistics. Not because I like baseball, but because you really start to see what can happen when you try very hard to identify new outliers time after time. We’re about to the point that “Most pitches thrown left-handed by a right-handed pitcher against a switch-hitting batter during a September game at 50 degrees when the catcher’s name begins with M” will be a thing that is said and celebrated by some announcer somewhere. That is RIDICULOUSLY specific.

Statistics in SQL Server aren’t nearly that bad, though. Where indexes contain information about the intersection of values in columns with rows, statistics contain information about the distribution of values in columns. Statistics contain some information about cardinality and ranges of values as well. I’m a big fan of Erin Stellato’s Statistics Starters presentation for more information on statistics.

What does this have to do with Temporary Query Thingies?

The SQL Server optimizer uses indexes and statistics to help it make the right decisions about turning your queries into execution plans. This means that understanding how TQIs interact with indexes and statistics can help you make better decisions about which ones are right for you and what you may be doing at any given moment.

Moving On

We’re much closer to talking about Temporary Tables, our first Temporary Query Item, in depth. However, there’s one more topic I really need to cover before that – TempDB and Memory.

Temporary Query Items: Scope

2014-08-17T01:01:51-05:00

Note: This is part of a series on Temporary Query Items.

Last time we took a look at what a temporary query item is, and why it is useful. In this post, I’ll be looking at how Scope plays into the similarities and differences among Temporary Query Items. An understanding of scope and how it relates to each TQI is crucial to being able to choose the best one for your situation, which is why I’ve dedicated an entire post to it.

What is Scope?

Wikipedia has a great article about scope from a generic computer science standpoint. My “quick and dirty” definition of scope is “The Time and Place of validity for a temporary item.” It’s a decent definition, and it tells us that we’re concerned with two things: When and Where.

When

The Temporary part of Temporary Query Items tells us that they only exist or are only accessible for a limited period of time. This isn’t to say that there’s a time limit on the existence of any of the TQIs in the traditional sense of the word – on the contrary, in all cases the data will hang around at least until the end of execution of the query you’re using it in. Not all TQIs are created equal in this respect, though – some of them are only valid for a single use, others are valid until manually dropped or until the connection is closed, and one even survives reboots. Likewise, the actual data contained within the TQI may or may not change over time based on how the underlying data changes. In a way, that’s a whole nother aspect of scope – can the data get stale?

Where

The Where aspect of Temporary Query Items is mostly related to your ability to reuse TQIs across multiple queries, executions, or connections. Some TQIs can only be used in one query, while others can be used multiple times during a single execution. Others can be used universally within the database or server.

Moving on

This was a relatively short post. I believe for simplicity’s sake that I’ll leave a discussion of how these aspects of scope play out in each TQI for posts dedicated to each TQI. Before we move on to our first TQI, however, we need to talk a little about indexing and statistics.

Temporary Query Items: Introduction

2014-08-16T22:43:47-05:00

Note: This is part of a series on Temporary Query Items.

While working with one of my junior developers on a query one day, I discovered that she had inadvertently deleted a few records due to not understanding how a CTE maps back to the underlying data. I was able to quickly resolve the issue thanks to how our testing environment is set up, but it prompted me to create a presentation on what I call Temporary Query Items. What I was really looking for was different ways to use and reuse transformed, abbreviated, or combined data sets in queries. Turns out, in SQL Server we have plenty of options for that, depending on what your requirements are, and so I was able to put together a pretty good presentation on the upsides and downsides of each. In this series of posts, I’ll be exploring what I cover in my presentation. This post: An intro to Temporary Query Items.

But Why Male Models Temporary Query Items?

I’d like to clarify my terminology a little bit here. My original inclination was to call them Temporary Objects, but I wanted to be careful because some of them… well… aren’t. Objects in SQL Server are usually the kinds of things you can find in sys.objects, and some of my TQIs can’t be found there due to scoping – but that’s a Good Thing™ because it means that sys.objects doesn’t get all gunked up with stuff that’s transient anyway. And that’s where the Temporary part of the nomenclature comes in – everything I’m going to talk about is transient in some way. These are the sand castles and seasonal flowers of T-SQL, the stuff that doesn’t need to live long. Every last one of them is meant to support Queries, and only to support queries.

And as for Items? Well, I’ve already told you that these aren’t all Objects. I thought about Fluff, except that title doesn’t really convey how useful they are. They aren’t really Clauses, and they don’t fit too well as Sub-Queries or Statements. So Items they are.

Who Are You Calling Temporary?

While researching potential candidates for my list, I found five things that I would call useful for some form of transient use while querying. I’ll go more in-depth on these in later blog posts.

Temporary Tables
Table Variables
Views
Common Table Expressions (aka CTEs)
Derived Tables

But Why Male Models Temporary Query Items?

Okay, so now we know what a TQI is, and it’s somewhat clear WHY they’re called TQIs. But what do we need them for? Why even have them?

Reuse, Reduce, Recycle

One of the most useful attributes of a TQI is data reuse in a query. Let’s say that you’ve got two related tables that you want to join together and pull some columns from. Now let’s say also that you’ll want to take the resulting data set, and join it to itself three times. Without a TQI, you’re looking at a six-table join, with multiple join conditions repeated over and over. With one of the Temporary Query Items, you can cut the amount of code you’re working with down quite a bit – six joins becomes three or four joins, depending on the approach used. I’ll cover how this works in particular in posts for each of the TQIs.

Alternatively, maybe you have multiple different queries, all doing some of the same things. With a TQI, you can wrap parts of those queries into one easily-referenced item that you can keep using without having to duplicate the code everywhere.

Code Simplification With Limited Duplication Duration

That title is a mouthful, so I’ll break this down. Let’s say that you have a query that gets run quite often, building a customized view of a dataset (whether you’re transforming the data, limiting columns, limiting rows, or even combing tables.) It’s a query that performs pretty well, and it’s really not worth duplicating the data in a permanent form such as a table. A TQI could be just the ticket for making things easier on your users or your automated process.

Consistent View with Automated Cleanup

Again, all TQIs are transient. Most of them come with some automated form of self-cleanup, meaning that you avoid the issue of forgetting to drop a table that you only needed for a bit.

Location, Location, Location

Most of these TQIs live in memory, although some can live partially in tempdb. This could help reduce the impact of disk i/o (ask Brent Ozar, Disk I/O is Bad.)

Getting On With It

“Okay Mister Slacker, you’ve piqued my interest – bring on the Temporary Tables!” I love your enthusiasm! But, we’re not quite ready for those details yet – first, we have to talk about scope.

T-SQL Tuesday: Assumptions

2014-07-07T17:51:28-05:00

For this month’s T-SQL Tuesday, Dev Nambi has chosen the topic of assumptions. He assumed I would have a big assumption at work to talk about. Good assumption!

Probably the biggest assumption we deal with in my unit at work is Things Are The Way They Are For A Reason, And Therefore We Should Continue To Do Them That Way (hereafter abbreviated Things Are And Should Be.) I love talking about this assumption because I have a long and storied history with it. Well, I have a storied history with it, anyway. And length of time is all in the eye of the beholder. I know, because I used to be able to count to approximately a million in fifteen minutes, and now I can spend fifteen minutes ordering breakfast and not notice. But I digress (which surprises nobody!)

When I began working as a Data Warehouse Forklift Operator, I entered a culture that had deliberately taken a “Produce First, Understand Later” approach. If it could be templatized it was, and if it couldn’t be templatized, it was maintained by one of the more savvy and experienced developers. We were in the process of moving from an ETL tool that did everything for you (badly) and to a tool that only did some things for you (SSIS). The old tool transformed data in DB2, and now we wanted to transform the data on SQL Server. For the first several months I worked on code, I assumed that Things Are And Should Be. After all, I was new to the data warehousing world, and we had experienced Architects and Designers who Knew What They Were Doing, right?

Fast forward about six months. I had discovered many cases in which the way we did things was suboptimal, circuituous, and even Not Quite What We Wanted. I began to suspect that my Big Assumption was perhaps not too helpful. I spent a lot of time waiting on code that worked, but took its sweet time with the large quantities of data it was processing. Some of the architecture of what I was working with seemed overly complicated. Bits and pieces were clumsy. But, I liked what I did and I was still new, after all, so what did I know?

After a year or so of running the Data Forklift, I began to wonder why the code I was modifying was written like it was. After all, surely I could do my job better if I knew what exactly I was trying to accomplish. So, I began asking questions – and found that, more often than I was comfortable with, the answer was I Don’t Know, We’ve Just Always Done It That Way. So I began to trace execution paths and transformations through the code. Many post-it notes and spreadsheets were sacrificed to the study of the code in order to understand. There were flashes of inspiration in the shower, Aha! moments while driving home, and many an evening’s sleep interrupted as I determined the answer to another question of Why Would You Do It That Way? While I was left scratching my head, I also determined that I would pursue the path of the architect, understanding the process so deeply that I could see my way clear to improvements.

It’s now been three years, and I now have a half dozen junior developers depending on me to teach them… well, pretty much everything about what we do, why, and how! Err… And how. Anyway, I find them asking many of the same questions, and I’ve had to resist the urge to give them the same old answers. It’s easy to default back to Things Are And Should Be. After all, much of the code they’re now questioning is code that I’ve modified from the way we used to do things! However, I believe now that the best thing I can do for the team is to head Things Are And Should Be off at the pass. If I can pass on all I know to them, then we’ll all be better. A developer who questions nothing will never become an asset to the team – and my team has too much potential to settle for mediocrity.

T-SQL Tuesday: An Interview Invitation

2014-05-12T22:43:05-05:00

For this month’s T-SQL Tuesday, Boris Hristov has chosen the topic of interviews.

If I could give one piece of advice to beginning developers looking to do well in an interview (and I can, because this is the internet after all and why are you reading my blog if you don’t believe that?), then it would be to remember that honesty is key. I don’t really care as much about what you know as I do about your ethics. My team is currently looking to fill three positions, and I would love to have them filled soon. However, I won’t even suggest applying to a candidate who I am not sure I can trust. If you don’t have fanatical levels of integrity, I don’t want you on my team. I can and will do everything I can to help you become competent, but I’m not going to waste my time teaching you to be honest (unless you want my help with that, in which case I’ll do what I can.)

Otherwise, I’m a technical kind of guy. If I’m interviewing you, then you’ll quickly discover that I’m going to make it very difficult to bluster about your skill level and knowledge of certain tools. Probably my favorite question is “Can you name two things you hate about [Technology X]?” because it’s an excellent gauge of actual familiarity. I’m certainly not looking for fanboys of this platform or that environment or those manufacturers – the reality is that all software sucks, some just work better than others. And to follow up “How would you do it better?” As a college student, I used to think that my dream job would involve a cubicle in the corner where I got requirements and pizza as input, and produced code as output. What I’ve found, though, is that writing code is a small part of how I earn my paycheck even as a developer. What is more beneficial to my company and to my personal growth is the time I spend architecting ways to do this or that better, and I can stay sharp on that by architecting out how I would rewrite functionality in the software I use day to day.

So wake up every day, resolve to do the right thing, and find a new problem to solve. I believe that’ll take you a long way towards an interview matching its intended function of matching the right people to the right job.

T-SQL Tuesday: Why So Serious?

2014-04-08T09:27:00-05:00

For this month’s T-SQL Tuesday, Matt Velic has chosen the topic of Dirty Little Tricks you can play on your coworkers/developers/those $@%! consultants using T-SQL. I’m a bit impish, so naturally it was time to dust off the ol’ blog for this topic!

Anybody who follows me on Twitter knows that I’m often faced with problems of assumed ordering. (Remember kids – if you didn’t order it, it ain’t ordered! It just looks like it is!) So, my Dirty Little T-SQL Trick involves sneakily breaking SSIS.

Let’s say that I have a Data Flow Task that does some ETL for me. It pulls my source data, does a key lookup against the target, and then routes any no matches through a surrogate key generator. After any necessary key generation, everything is pulled back together in a Merge Join, then inserted or updated into the target.

Now, it’s important to note that you can’t just feed any old thing into a Merge Join. No, the inputs have to be sorted – or at least, they have to TELL you that they’re sorted. You can do this with a Sort Transformation, though that’s generally slower than sorting in your source. Now, you CAN use the Advanced Editor for your source, to set the IsSorted property of the Output to True. You also have to set one or more of the output columns as a sort key, but once you’ve done that, your Merge Join will believe that the input is sorted, and sorted just like you tell it.

So where’s the trick? Well, I mentioned before that you can’t just feed any old thing into a Merge Join. See, what it TELLS you is that it’s doing a join. But it’s not a SQL join, because in SQL order is not guaranteed. In SQL Server, Table A inner join Table B will give you an inner join, regardless of physical or logical ordering of the data. In SSIS, a Merge Join is essentially an array comparison that starts at the beginning of both arrays and reads through to the end. If your data is out of order, tough luck! And if the arrays are ordered correctly but in opposite order (ascending versus descending) then a very interesting thing happens – the Merge Join will match exactly one row!

So here’s my trick: Change the ORDER BY clause on the source query to order opposite of what the Merge Join expects. It won’t complain, and the solution is non-obvious (at least, non-obvious if you’ve never encountered it before). And it will leave your coworkers scratching their heads when they get only one row!

Accidents - Happy and Otherwise

2013-10-22T21:40:00-05:00

I have been blessed throughout my life by an abundance of good fortune. You might call it serendipity – a series of happy accidents, all leading me to my current position. This is not to discount my natural talent nor my hard work, but I do admit that it all seems a little strange looking back. Perhaps I’m too quick to forget my failures, or at least the ones that didn’t direct me to where I am now.

A short history of how I got where I am would of course include a retelling of taking Computer Science 101 because I thought Intro to Computing would bore the life out of me. It would also include how I accidentally became the dialog editor of an open-source game because I was the first contributor who spoke English natively, and how that lead me to be a Google Summer of Code mentor for the same project. And it would of course include how I accidentally got into the SQL Server and Data Warehousing worlds when the internship I thought I was applying for lead me to the internship in the Data Warehouse.

I suppose I should be more careful using the term “accident” (and I know I should be more careful using the term “We” but that’s a different chat.) It kind of has a connotation in my mind of just falling into events and places Mr. Bean-style. I’m really not oblivious, perhaps I’m just happier with the idea that I’m not completely in control.

Either way, some of my happiest accidents yet happened last week, at the PASS Summit. There, I met some people who challenged my preconceptions of what a community can be. Special thanks go out to Mickey Stuewe and Allen Kinsel for their support in not only making me feel welcome, but in introducing me around to others in the community until I felt comfortable enough to begin introducing myself. I also met some people who surprised me by their depth of knowledge and speed of thought – I want to point out Rob Farley in particular here, who I look forward to misunderstanding for many years to come. I also had the opportunity to speak to such fine folks as Merrill Aldrich, Stuart Miller, and Andy Yun who helped me to understand that others faced many of the same challenges as I face every day – I’m not alone. I had the chance to talk to great consultants like David Stein, Bill Fellows, Tim Radney, Brent Ozar, Kendra Little, and Jes Borland, who really solidified in my mind that there are short-term contractors out there who really do care to see clients learn, grow, and succeed – as it turns out, they’re not all the bargain-basement quality “consultants” I’m used to dealing with at all. And finally, special thanks goes out to folks like Gail Shaw, who made it clear that the heroes of the SQL Server world are not all highly driven, self-seeking individuals, but instead regular folk who get kind of embarassed when mobbed by adoring fans.

Unfortunately, into each life a little rain must fall, and not all accidents are happy. If I’ve introduced myself to you or been introduced to you, you’ll know that I go by JK. These are my first and middle initials, and I chose this nickname because of a collision between my first and last name with someone else where I work, as well as a collision between my first name and that of another member of my team. I’m not trying to hide my identity, it just started as a way to minimize confusion (indeed, JK is a nickname my mother developed for me in college when she hired another person with my first name.) I was a bit stunned this week to discover that I had recently been confused with a local registered sex offender who shares my real first and last name, but not my middle name or initial. Let me make this clear – I am not a child molester. The idea makes me sick. And anyone who suggests that I am is confused or mislead.

And now, back to your normally scheduled nonsense.

Summit 2013 Reflections

2013-10-21T21:01:00-05:00

It’s now the Monday night after the PASS Summit and it’s time for this slacker to get back into the rhythm of blogging. I’m working on Big Big Things^TM for the blog in the near future, but for now I’d like to do a multi-part series of reflections on my experiences at the Summit.

Anyone who has not been to the PASS Summit most likely thinks it’s just a big training event, a place where you can pay a lot of money to go learn about SQL Server. If you think that, you don’t know what you’re missing.

The most valuable part of the Summit, in my opinion, is the amazing amount of networking and community building you get to do. Folks on Twitter are fond of using the #sqlfamily hashtag to refer to their friends and colleagues in the SQL Server community. That sounds cute and trite, but the reality is that the PASS Summit is like the biggest and most intimate family reunion you’ve ever experienced. I’m kind of a weird guy – I dress funny, look awkward, and tend to talk too much. But I was not once made to feel like I didn’t belong last week. And the best part is, you don’t have to explain what it is you do to your grandmother again. We understand you. We know what struggles you deal with. We don’t care what you look like, where you’re from, or who you know. And for once in my life, I’m using We correctly.

The second most valuable part of the Summit is the SQL Server Clinic held by Microsoft. This is your opportunity to go ask questions of the people who support and maintain SQL Server, in person, for absolutely free. This is a pretty unique opportunity and something everyone should take advantage of (except me – I only have really strange and awkward questions.)

And finally, you have the opportunity to learn about the newest and most exciting changes coming down the pike in the SQL Server world. This year, Dr. Dewitt gave a fascinating keynote on the changes coming to Hekaton in SQL Server 2014. Hekaton is their name for their in-memory OLTP solution – it looks like it’s going to rock the OLTP performance world. I only wish we could take better advantage of it in the data warehousing world!

Oh, and there’s some training, but the cheapest way to get that is to spend about $250 and get the sessions on USB. But the real draw are the three things I mentioned above.

Slide Deck: Deployment WORST Practices

2013-10-09T22:41:00-05:00

As promised, here is the slide deck from my presentation last night to the Southwest Missouri SQL Server User’s Group on Deployment WORST Practices. I had a lot of fun putting this together (thanks go out to Grant Fritchey and Matt Velic for the inspiration!) Do note that it’s in OpenDocument Presentation format – I use Libreoffice Impress to do my presentations. You shouldn’t have any trouble opening it in whatever you use, as far as I know.

Update: Apparently the ODP wasn’t working right for some people. Here is a PowerPoint version that will hopefully be better.

T-SQL Tuesday: SQL Swag Wishlist

2013-10-08T12:24:00-05:00

For this month’s T-SQL Tuesday topic, Kendal Van Dyke chose the topic of the best SQL Server Swag you’ve received. Unfortunately, I haven’t snagged any yet – that’s what I’m hoping for at the PASS Summit next week. Here, in no particular order, is the Swag I’m hoping to Snag at the Summit:

A SQL Sentry tea cozy
Brent Ozar Unlimited boxer shorts, signed by the team
An official Red Gate secret decoder ring
A copy of Deadpool volume 3 issue #3 signed by Grant Fritchey and Neil Hambly
An illustrated guide to pronouncing Mladen Prajdic’s name
SQL Ferret
A grappling hook
One of those tiny umbrellas
Knighthood
A pair of solid gold six-shooters
A bottle of water from Mars
My own Comedy Central Special
18 mosquito nets

Troubleshooting Empty String Comparison Issues

2013-10-07T21:25:00-05:00

While testing an SSIS package today, I discovered that I was missing some rows I expected to build. Digging in revealed that the missing rows had a column in the key that looked like an empty string to me, but that didn’t seem to be comparing correctly. SQL Server is usually pretty good about this, so I had a mystery to solve!

The first thing to realize about SQL Server is that an empty string and a string composed of all spaces are considered to be equivalent. That is, the following code will return Yes:

SELECT CASE
       WHEN '' = '    '
       THEN 'Yes'
       ELSE 'No'
       END as Are_empty_strings_equivalent_to_spaces

Note that the empty string ‘’ and a string with a NULL value are not the same thing! An empty string is just a string with a length of zero. So, if I compare what appears to be two CHAR(22) fields that are composed of spaces for equality, I should get a value of TRUE returned. The fact that I wasn’t seeing this was an indicator that something weirder was going on.

The first tool I pulled out of the box was the trusty pair of LTRIM() and RTRIM(). LTRIM() removes leading spaces from the passed string, and RTRIM() removes trailing spaces. I really didn’t expect this to work, but gave it a try anyway because I tend to stumble upon weird stuff like this from time to time where SQL Server doesn’t behave as I expect. That’s not to say that it behaves incorrectly – I’ll believe my understanding is at fault first! Anyway, I did check my strings for their LEN() attributes when wrapped in LTRIM() and RTRIM(), and discovered that one version of the string showed length 0 (the empty string), and the other showed length 20! Curiouser and curiouser.

My next stop on this train was to make use of the ASCII function to examine what was actually in those strings. This function tells you the ASCII code of the first character in the string you pass it. When combined with SUBSTRING, you can examine any character in your string to see what it is. I was looking for ASCII codes of 32 (space), but to my surprise the source column had 0 (NUL) in 20 of the 22 characters! As it turns out, at some point in the past, our DB2 source-side system put those NUL terminators into the first 20 bytes of that column, and our previous ETL solution maintained the NUL bytes into the data warehouse source layer. However, SSIS quite helpfully substituted space characters for the NUL characters, meaning that when the time came to join back to the original table, the comparison failed and we lost rows.

In our case, this is old data so we’re just going to plug it. However, I could have stuck a REPLACE(colname,CHAR(0),CHAR(32)) in the staging query and solved the problem as well. If you run into a similar problem, that may be the solution you want, though you may be dealing with different characters. Hope it helps!

Why Maintenance Is Important

2013-10-05T22:31:00-05:00

This evening I got into my vehicle to come home from about 45 minutes away to discover, much to my chagrin, that the truck wouldn’t start. I had jumper cables and friends nearby to get me going, but I wasn’t sure I would make it home – there was a cable with a bit of a loose connection and nothing I could really do to fix it at the time. I finally adjusted the cable sufficiently to keep the truck running and made it home, but it left me thinking – why don’t have I have a maintenance plan for my truck?

The idea behind maintenance is that if you sacrifice a little bit of time regularly, you’re less likely to have a major malfunction later down the road. This obviously applies to cars, but it also applies to SQL Server databases. Brad McGehee has written a great book on SQL Server maintenance plans – it sits on my bookshelf at work and taunts me about how I need to figure out how to apply it to developer work. Brad’s book is targeted at DBAs. I do have an idea or two about how an ETL developer can plan for regular maintenance:

Plan performance reviews for your packages in Production. Look at what’s running the longest, and what individual steps within each package are running the longest. Watch especially for big changes in runtime and map out what changed to cause the new runtimes.
Periodically review what individual pieces of code do. Rejustify it and analyze whether it could be rewritten to better fit business needs.
Read lots of blog posts and consider how each new technique could enhance your ETL process.
Review index usage stats in your databases to ensure that you have the correct indexes and that you don’t have any taking up space but not helping out.
Take the time to review your architecture and make sure it’s still working for you instead of against you.

Those are just a few ideas, but I hope they help you to start thinking about what you can do to maintain what you build. Maintenance isn’t often fun or glamorous, but it can reduce the number of emergencies you have.

Harassment and Culture

2013-10-04T10:53:00-05:00

When and where I grew up, it was far from uncommon to hear this phrase: “If I didn’t pick on you, you’d think I didn’t like you anymore.” I received and gave a lot of good-natured ribbing, which certainly helped with my sense of humor and my tolerance of others. It’s a philosophy that has served me well, but I’m also aware that it’s a philosophy that, taken to an extreme, could result in bullying or harassment.

This morning, Denise McInerny posted a blog post about the PASS Anti-Harassment Policy as a reminder of the kind of behavior that will not be tolerated at the PASS Summit in just over a week. I applaud the intent of such a policy while lamenting its necessity. I also don’t anticipate any violations – in my time in the SQL Server community I have met nothing but class acts.

With that in mind, though, I do want to comment a bit on some things that the AHP doesn’t specifically cover. In the past, I have had the opportunity to work with people from diverse geographical and cultural backgrounds, as well as a diverse set of ages and experience levels. The truth is that what is perfectly acceptable or even encouraged in one culture or amongst one generation is seen as impolite or even taboo in another. I was reminded of this last night at a non-SQL Server event when a young lady was recounting how some of her friends had spent quite a lot of time discussing the size of a part of her body in a way that was meant to be complimentary, but that would have been considered offensive when and where I was her age, and Denise’s blog post this morning served as additional reinforcement of the idea that perhaps it’s better to be overly explicit in what’s acceptable than to have to worry about charges of harassment being thrown around (merited or otherwise.)

With that in mind, here’s my personal checklist of things to watch out for in your own behavior if you’re attending the Summit (or even just interacting with others on Twitter):

Please refrain from comments on body parts (yours or those of others). What you mean as a compliment could very well be seen as harassment.
You want to sell yourself well, but you too much self-promotion is often seen as bragging.
If someone else has done the work, do not try to pass it off as your own. In American culture, there exists a fairly strong ethic that copying the work of others and passing it off as your own is bad – we call it plagiarism. This isn’t a concept that exists as strongly in some other cultures, but it’s important to be aware of it when dealing with professionals from a multi-national perspective.
Don’t force your presence on others. There are hundreds of professionals attending PASS – you want to network as much as possible. If someone doesn’t seem overly interested in your attention, excuse yourself and find someone else to talk to.
Try to limit your use of ‘acceptable’ stereotype humor. I’m a hillbilly from Missouri, and I make plenty of jokes about people from Kansas, Arkansas, and Illinois. I’ve run most of by people from those areas, so I’m fairly comfortable telling those jokes. My mother, who is naturally blonde, tells more dumb blonde jokes than anyone else I’ve met. This topical humor is alright for the places I call home, but I’ll be leaving these jokes home when I go to the Summit.
Don’t make the mistake of looking down on someone because of where they come from, what they do, or how long they’ve been doing it. We all had to start somewhere, and people will pick up on disdain more easily than you think. If you’re friendly with me but harsh with others, we won’t become fast friends.
Be very, very careful about physical contact and personal space issues. Americans tend to stand further apart and speak louder than in some other cultures. Some people enjoy physical cues of affection like patting or squeezing the shoulder, others don’t. When in doubt, don’t.
Never assume that the interactions of others can be emulated by you. If two old friends slap each other on the back and call each other names, that is not an invitation for you to do the same.
Above all else, prepare to have a good time and meet lots of neat people who also want to have a good time. Just don’t let your good time become someone else’s bad night.

Meta: Blogging is Hard

2013-10-03T23:16:00-05:00

I have no doubt about it: Blogging is hard.

I’ve been attempting to blog since the early days of blogging, when your choices were LiveJournal or Xanga and Geocities pages were all the rage. I say attempting because I never really could seem to stick with it. I’d be interested for a few days or weeks, but then I’d fall off as I got busy or just bored. It’s easy to come up with excuses, but at the end of the day I was just not motivated or disciplined enough to make it happen.

It’s with this understanding in mind that I’ve promised myself that I will blog once a day through the end of 2013. The important thing is not to be read, or to be popular, or even to be helpful. The important thing is to do it. No excuses about lots happening at work. No excuses about it being my vacation. No excuses about coming home tired after a long day at a cowboy action shoot. I’m not committing to a certain length or depth of topic, as long as I write something of substance every day without fail. Blogging is hard, but they say that adversity breeds character, and I’ve always held to the philosophy that nothing worth doing is easy. Repetition will become discipline, and all the nights spent pondering what to write will remind me that there is so much more to learn out there.

To anyone who is reading this and considering starting blogging or considering picking up one that you started and quit on before, I’d encourage you to commit to a blogging challenge. In the end you’ll learn more than you could have imagined. I just don’t recommend an every day thing – three times a week for a month is probably plenty. Shoot, ping me on Twitter when you write a new post and I’ll be happy to read it.

Twitter Tip: Be Kind To Your Followers

2013-10-02T20:47:00-05:00

If you’ve been involved with the SQL Server community at all, you should know that one of the best ways to keep in touch is through Twitter. While there are a number of things you can learn about how to use Twitter from the basic mechanics to how to leverage its strengths (I recommend this free ebook from Brent Ozar), I’d just like to give a couple of tips on how to be kind to your followers of various interests.

We’ve all been there. An insightful or funny person we follow suddenly goes on a tweeting overload about politics, or a sporting event, or live-tweeting an event they’re thrilled to be attending but that we’re not quite so thrilled to be hearing about. It’s not that they’re a bad person, they’re just a normal person with diverse interests, and sometimes those interests don’t sync up with our own. So, how do you avoid being that person? Sure, you’ve got like 30 followers, but you still want to be respectful of their time and their Tweetstream. I have two suggestions that can help you be an unsung hero of Twitter.

My first suggestion is for topics that are either controversial or will produce a large number of tweets over time. Politics and religion are usually two of the more egregious ones, but if you find yourself letting loose with 200 tweets per Cardinals game or 90 pictures per day from your vacation you should also consider this tip: Make a separate Twitter account.

Story time: At one point on my personal Twitter account, I found myself developing more of an interest in political activism, which actually cost me some of my hard-won followers. I found it to be easier to move all my political content to a new account, which meant signing up with a new email address. Considering that Gmail, Hotmail, and many other webmail services are free, this really shouldn’t be a barrier. What I found was that I gained many, many new followers on my politics-oriented account, and meanwhile I built back up my follower count on my personal account by not “spamming” them with what is admittedly a controversial topic. In much the same way, I consider the @sqlslacker account to be my professional branding account, and so I try to be more careful about what I say and the topics I cover there.

My second suggestion, which can be either substituted for the first or supplement the first, is to make use of hashtags whenever you’re engaging in a particular topic. The primary advantage of this approach is that many Twitter clients today allow you to filter out certain hashtags. If I put a filter on #NFL for example, then I won’t be inundated by football-related tweets. I can also reject event-related tweets that I really don’t care about without unfollowing someone who usually generates great content. A side benefit of using hashtags is that it will allow any community formed around that topic to see your input and interact with you without having to seek out your account in particular.

Happy Tweeting!

T-SQL Trickery: An Alternative To OR

2013-10-01T20:39:00-05:00

One of the less pleasant aspects of doing ETL coding is dealing with requirements that don’t allow for straightforward, well-performing code. One of the trickier aspects of pulling from multiple source tables is determining whether those tables have changed. Many times, you have to make use of staging tables to avoid having to string together several performance-killing OR conditions, but sometimes there is an easier way…

Let’s say you have the following predicates in your WHERE clause:

AND (A.ETL_LOAD_TIMESTAMP > ETL.LAST_LOADED_TIMESTAMP
  OR B.ETL_LOAD_TIMESTAMP > ETL.LAST_LOADED_TIMESTAMP
  OR C.ETL_LOAD_TIMESTAMP > ETL.LAST_LOADED_TIMESTAMP
  OR D.ETL_LOAD_TIMESTAMP > ETL.LAST_LOADED_TIMESTAMP
  OR E.ETL_LOAD_TIMESTAMP > ETL.LAST_LOADED_TIMESTAMP)

Depending on indexing and the query plan SQL Server chooses, this can perform pretty badly. I dealt with one query this summer that did this for fourteen tables – and the “developer lead” I was working with couldn’t figure out why it ran so slow!

Anyway, after a little bit of deliberation, I came up with an alternative approach which took our runtime down from 20 minutes to about 10:

AND ETL.LAST_LOADED_TIMESTAMP <
    (SELECT MAX(ETL_LOAD_TIMESTAMP) FROM (
            SELECT A.ETL_LOAD_TIMESTAMP
             UNION ALL
            SELECT B.ETL_LOAD_TIMESTAMP
             UNION ALL
            SELECT C.ETL_LOAD_TIMESTAMP
             UNION ALL
            SELECT D.ETL_LOAD_TIMESTAMP
             UNION ALL
            SELECT E.ETL_LOAD_TIMESTAMP) CANDIDATE_TIMESTAMPS )

This is a T-SQL design pattern which I’ve found handy time and again. If you don’t care if all the timestamps are greater as long as one or more is greater, then this is the ticket. It’s also handy for working with date overlaps – you can take the highest begin and the lowest end from all choices, which gives you the full overlap as long as your joins are correct. You can use this in WHERE clauses or in SELECTs. And note that I used UNION ALL instead of UNION – in this case, we’d lose more from the SORT and the DISTINCT functionality of UNION than we’d gain from not dealing with duplicate values.

Hope it helps!

Test SQL: INFORMATION_SCHEMA For Fun And Profit

2013-09-30T19:45:00-05:00

Part of the joy of working as an ETL developer is that you get to spend a lot of time testing your code by validating large amounts of data. We have a testing tool with several built-in tests (which I maintain), but we periodically discover a new scenario which needs to be added to our toolbox. The fastest way I’ve found of mocking up these tests is with dynamic SQL using the INFORMATION_SCHEMA views.

Firstly, what are the INFORMATION_SCHEMA views? These are an ISO standard way of querying database metadata that stays relatively static throughout the life of the DBMS. There can be improvements and changes in the underlying system tables, but INFORMATION_SCHEMA should remain relatively unchanged. I like them because they bring together information from various system tables and views such as sys.indexes or sys.tables in a way that doesn’t require me to write the join logic as often. For mocking up a test query, they re fantastic.

Let’s say that you have a column in most of the tables in your database called ETL_PROCESSED_TIMESTAMP that does what it says on the tin, and an ACTIVE_ROW_FLAG that tells you which row is the most recent look for that entity. You’ve discovered that Randy the unlucky intern accidentally deleted a row in one of your ETL control tables on the development server, and some of your test loads may not have any active rows loaded for that cycle. Randy doesn’t remember touching that table, and you tested fifteen table load packages over the past week. How do you go about determining which tables are affected?

You could write fifteen queries to see if you have any active rows that were touched during the latest load, but that’s a pain with three tables, and gets worse as it scales up. Also, it took you a week to discover this, and it could potentially happen again in the future. A better solution is to generate the query you need automatically, which will make it easy to stick in a testing tool – no custom code needed, just plug in some T-SQL.

That’s where the INFORMATION_SCHEMA views come in. We can make use of INFORMATION_SCHEMA.COLUMNS to find tables where we have both an ETL_PROCESSED_TIMESTAMP and an ACTIVE_ROW_FLAG, then build a query based off that and INFORMATION_SCHEMA.TABLES to tell us when there are no new active rows in a table. The first place to start is finding tables with the interesting columns:

select *
  from INFORMATION_SCHEMA.TABLES A
 where exists (
       select 1
         from INFORMATION_SCHEMA.COLUMNS
        where SCHEMA_NAME = A.SCHEMA_NAME
          and TABLE_NAME = A.TABLE_NAME
          and COLUMN_NAME = 'ETL_PROCESSED_TIMESTAMP'
       )
   and exists (
       select 1
         from INFORMATION_SCHEMA.COLUMNS
        where SCHEMA_NAME = A.SCHEMA_NAME
          and TABLE_NAME = A.TABLE_NAME
          and COLUMN_NAME = 'ACTIVE_ROW_FLAG'
       )

That’s a good start to any testing query. Looking at the raw output can let you determine where you need to make any modifications, such as only querying a certain schema, or tables with a certain naming scheme. For my purposes, I will assume I want to work with everything I see here, so the next step is to write some dynamic SQL that will generate my test for me:

select 'select count(*) from ' + TABLE_SCHEMA + '.'
     + TABLE_NAME + ' where ACTIVE_ROW_FLAG = 1 and '
     + 'ETL_PROCESSED_TIMESTAMP = (SELECT MAX(ETL_PROCESSED_TIMESTAMP) '
     + 'from ' + TABLE_SCHEMA + '.' + TABLE_NAME + ');'
  from INFORMATION_SCHEMA.TABLES A
 where exists (
       select 1
         from INFORMATION_SCHEMA.COLUMNS
        where SCHEMA_NAME = A.SCHEMA_NAME
          and TABLE_NAME = A.TABLE_NAME
          and COLUMN_NAME = 'ETL_PROCESSED_TIMESTAMP'
       )
   and exists (
       select 1
         from INFORMATION_SCHEMA.COLUMNS
        where SCHEMA_NAME = A.SCHEMA_NAME
          and TABLE_NAME = A.TABLE_NAME
          and COLUMN_NAME = 'ACTIVE_ROW_FLAG'
       )

This is a fairly good start, and will give the output as a series of SQL queries that can be copied and run individually or run as a whole in SSMS. You could write a cursor to go through the result set and call exec() on each query, or you could get a little fancier:

with basequery as (
select ROW_NUMBER() OVER (ORDER BY TABLE_SCHEMA, TABLE_NAME) AS RANKING
      , TABLE_SCHEMA
      , TABLE_NAME
  from INFORMATION_SCHEMA.TABLES A
 where exists (
       select 1
         from INFORMATION_SCHEMA.COLUMNS
        where SCHEMA_NAME = A.SCHEMA_NAME
          and TABLE_NAME = A.TABLE_NAME
          and COLUMN_NAME = 'ETL_PROCESSED_TIMESTAMP'
       )
   and exists (
       select 1
         from INFORMATION_SCHEMA.COLUMNS
        where SCHEMA_NAME = A.SCHEMA_NAME
          and TABLE_NAME = A.TABLE_NAME
          and COLUMN_NAME = 'ACTIVE_ROW_FLAG'
       )
)

select 'select ''' + TABLE_SCHEMA + '.' + TABLE_NAME + ''' AS TABLE, '
     + 'count(*) AS ACTIVE_ROWS from ' + TABLE_SCHEMA + '.'
     + TABLE_NAME + ' where ACTIVE_ROW_FLAG = 1 and '
     + 'ETL_PROCESSED_TIMESTAMP = (SELECT MAX(ETL_PROCESSED_TIMESTAMP) '
     + 'from ' + TABLE_SCHEMA + '.' + TABLE_NAME + ') '
     + case when exists (
                 select 1
                   from basequery
                  where RANKING > A.RANKING)
            then 'union '
            else ''
            end
  from basequery A
 order by RANKING

That guy will build you an all-in-one query that pulls from all tables and combines it all into one result set with the number of rows it found and the name of the table it queried. I added some logic to order all the rows so that it automatically stop adding UNIONs when it gets to the last row. The thing to keep in mind here is that if there are a lot of tables and these columns are not indexed, this query could take a while. You could add some where clauses to limit it to certain schemas or even particular tables. You could also build a CTE where you pull the results and then only show the rows where the count equals zero to zoom in on troublemakers. It’s also possible to have it build in GOs for you, which means you start getting rows earlier but go back to getting them one at a time. These are all left as exercises for the reader.