I was teaching a class today on more advanced SQL queries and we were discussing if you could use preceding and following in a single windowing function.
If you’re not familiar with windowing functions, they’re really cool. You can perform all kinds of trickery and magic with them. Let’s start with a simple example.
The dataset I’m using has flight data from the early 2000’s.
My first query is this:
SELECT DESTINATION_ID ,sum(ARRIVAL_DELAY) AS [Total Delay on arrival] FROM FACT.FLIGHTDATA FFD INNER JOIN DIM.DATE D ON FFD.DATEID = D.DIMDATEID WHERE D.Month = 1 AND D.Year = 2003 AND d.DayOfMonth = 1 GROUP BY DESTINATION_ID ORDER BY DESTINATION_ID
This query returns the destination ID’s and the sum of the delays on january first 2003. Nothing special here.
But what if we want the running sum of delays. Not just the total sum but the delays need to add up. For this we can use a windowing function.
SELECT DESTINATION_ID ,ARRIVAL_DELAY ,SUM(ARRIVAL_DELAY) OVER ( PARTITION BY DESTINATION_ID ORDER BY DESTINATION_ID ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS RUNNING_DELAY FROM FACT.FLIGHTDATA FFD INNER JOIN DIM.DATE D ON FFD.DATEID = D.DIMDATEID WHERE D.Month = 1 AND D.Year = 2003 AND d.DayOfMonth = 1 ORDER BY DESTINATION_ID
This query will return the running delay on every flight of the day. The way it works is that SQL picks up the data for the first destination id, orders it by destination id (a useless order but it’s just a demo, leave me alone) and then starts adding up the delay as each row gets processed. When the first destination is done, it resets the sum and starts all over again.
Running Sum on previous and current row
So far so good. Another technique is to look back in the rows. You could show a values from records ‘behind’ of the current record, you can also do some calculations with them. In this first case, I’m going to calculate my running sum on the current and previous row.
SELECT DESTINATION_ID ,ACTUAL_DEPARTURE_TIME ,ARRIVAL_DELAY ,SUM(ARRIVAL_DELAY) OVER ( PARTITION BY ACTUAL_DEPARTURE_TIME ORDER BY DESTINATION_ID ROWS BETWEEN 1 PRECEDING AND CURRENT ROW ) AS RUNNING_DELAY FROM FACT.FLIGHTDATA FFD INNER JOIN DIM.DATE D ON FFD.DATEID = D.DIMDATEID WHERE D.Month = 1 AND D.Year = 2003 AND d.DayOfMonth = 1 ORDER BY DESTINATION_ID
The query above is very much like the previous query, the difference is that there’s no unbounded preceding (meaning read everything from the current data window), but look one row back.
In this result set, you’ll notice the delays are quite steady. The delays continue to be calculated over the current row and the previous row. The keen eyed people will notice that the sorting being weird, for ID 6 and departure time 1318 the first row is at the bottom of the result set. My only explanation is that the rows are processed by the windowing function in a different order than my order by is. Something for another blog to dive deeper into.
Running sum with preceding and following
As I wrote earlier, you can go both ways, looking ahead and behind you. This query will look something like this
SELECT DESTINATION_ID ,ACTUAL_DEPARTURE_TIME ,ARRIVAL_DELAY ,SUM(ARRIVAL_DELAY) OVER ( PARTITION BY DESTINATION_ID, ACTUAL_DEPARTURE_TIME ORDER BY DESTINATION_ID ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING ) AS RUNNING_DELAY FROM FACT.FLIGHTDATA FFD INNER JOIN DIM.DATE D ON FFD.DATEID = D.DIMDATEID WHERE D.Month = 1 AND D.Year = 2003 AND d.DayOfMonth = 1 ORDER BY DESTINATION_ID, ACTUAL_DEPARTURE_TIME
This was the question from class, will it work? My answer, inspired by other great teachers, is to try it out for yourself. Learn by trying it out and explaining the results. I don’t know if they did it, but I found it fun enough to give it a shot as I wasn’t sure about the answer.
Now, I had to think a bit about the result set, because it looks quite weird.
On row one, it can’t look back, but it can look forward two rows with the current row included. -19 times 3 = -57.
The second row can look one row back, and the result is -19 times 4 = -76.
The third row is equally calculated.
Row 4 can’t look enough any more because the actual departure time rows are running out for 950. So the sum drops back to -57.
Row 5 drops even back further because there are no following rows any more, just the current row and the preceding row.
This means you can look both ways but remember that the current row always comes along. So when you’re looking back one row, it’s the previous row and the current row. If you need to calculate a running sum over 6 rows, make sure your preceding or following statement is 5 to take the current row into account.
This was an excellent question from my class and one of the reasons I just love teaching. Classes come up with the best questions that will trigger you to dig deeper into something, enabling you to learn more.
If you think my explanations are not clear (enough) or I’ve missed a point, please let me know!
Thanks for reading!