laika222
1/5/2018 - 1:44 AM

Window Functions

/* WINDOW FUNCTIONS - operates on a window (set) of rows. Includes ranking, offest, aggregate, and distribution functions. The OVER clause defines a window or user-specified set of rows within a query result set, which is used before the window function then computes a value for each row in the window.

Some notable window functions:

RANKING FUNCTIONS:

RANK - returns the rank of each row within the partition of a result set. May include ties and gaps.

DENSE_RANK - returns the rank of each row within the partition of a result set. May include ties but will not include gaps.

ROW_NUMBER - returns a unique sequential row number within partition based on current order.

NTILE - distributes the rows in an ordered partition into a specified number of groups. Returns the number of the group to which the current row belongs.

OFFSET FUNCTIONS:

LAG - returns an expression from a previous row that is a defined offset from the current row. Returns NULL if no row at specified position.

LEAD - returns an expression from a later row that is a defined offset from the curret row. Returns NULL if no row at specified position.

FIRST_VALUE - returns the first value in the current window frame. Requires window ordering to be meaningful.

LAST_VALUE - returns the last value in the current window frame. Requires window ordering to be meaningful.

*/

-- RANK example, which looks at values and gives them a numeric ranking based on the value.
SELECT TOP 3 ProductID, Name, ListPrice
	RANK() OVER(ORDER BY ListPrice DESC) AS RankByPrice
FROM products1
ORDER BY RankByPrice;

-- Using ROW_NUMBER to count the rows, and RANK to rank them along with a PARTION (the ranking restarts each time a new SalesPersonID comes up in the result set: TITION (allows you to reset the numbering each time a new partition comes up. 
SELECT SalesPersonID, TotalDue, 
  ROW_NUMBER() OVER(ORDER BY SalesPersonID, TotalDue) AS 'Row Count',
  RANK() OVER(PARTITION BY SalesPersonID ORDER BY SalesPersonID, TotalDue) AS 'Rank by SalesPerson'
FROM Sales.SalesOrderHeader
ORDER BY SalesPersonID;

/* Excerpt from results. Note how ROW_NUMBER is not partitioned, so it counts the rows throughout entire result set. RANK is partioned by SalesPersonID, so it restarts the ranking when a new SalesPersonID comes up:

SalesPersonID TotalDue      Row Count   Rank by SalesPerson
...
274           95426.0186    46          46
274           97248.3525    47          47
274           126852.1615   48          48
275           6.3484        49          1
275           12.0152       50          2
275           18.3792       51          3
278           22.3061       52          1
278           25.3868       53          2
278           27.4401       54          3
278           27.4401       54          4
...

*/

-- using an aggregate function with OVER will give you an aggregate for that window. Below, SUM on Amount ORDERed BY Territory will show total sales for that Territory
SELECT *,
  SUM(Amount) OVER(ORDER BY Territory) AS 'Total Sales for Territory'
FROM windowTester
ORDER BY Territory, SalesPerson, SalesYear;

/* Results - note how 'Total Sales for Territory' shows total sales per Territory:

SalesPerson Territory SalesYear Amount    Total Sales for Territory
12          1         2011      89.20     9288.48
12          1         2012      239.30    9288.48
12          1         2012      37.98     9288.48
12          1         2013      8922.00   9288.48
819         3         2011      353.20    3140.80
819         3         2014      2523.20   3140.80
819         3         2017      22.20     3140.80
819         3         2018      242.20    3140.80
224         5         2010      10.00     14931.48
224         5         2011      11.20     14931.48
2522        5         2012      347.20    14931.48
2522        5         2013      455.20    14931.48
2522        5         2013      555.20    14931.48
2522        5         2017      777.20    14931.48
2522        5         2019      346.20    14931.48

*/

-- creating running total using partion on Territory, so the cumulative count will restart at each new Territory.
SELECT *,
	SUM(Amount) OVER(
              PARTITION BY Territory
              ORDER BY Territory, SalesYear, Amount
              ) AS 'Running Total Sales'
FROM windowTester
ORDER BY Territory, SalesPerson, SalesYear;

/* Results - note how 'Running Total Sales' shows cumulative total sales, and the count restarts each time a new Territory comes about:

SalesPerson Territory SalesYear Amount    Running Total Sales
12          1         2011      89.20     89.20
12          1         2012      37.98     127.18
12          1         2012      239.30    366.48
12          1         2013      8922.00   9288.48
819         3         2011      353.20    353.20
819         3         2014      2523.20   2876.40
819         3         2017      22.20     2898.60
819         3         2018      242.20    3140.80
224         5         2010      10.00     10.00
224         5         2011      11.20     21.20
2522        5         2012      347.20    368.40
2522        5         2013      455.20    823.60
2522        5         2013      555.20    1378.80
2522        5         2017      777.20    2156.00
2522        5         2019      346.20    2502.20

*/

-- since you're ORDERing BY Territory, SalesYear, and Amount, you don't have to use ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. That clause is implied, and the running total will increase each time a new combination of Territory, SalesYear, and Amount comes up, which in this case is each line in the results set.
SELECT *,
	SUM(Amount) OVER(
              PARTITION BY Territory
              ORDER BY Territory, SalesYear, Amount
              ) AS 'Running Total Sales'
FROM windowTester
ORDER BY Territory, SalesPerson, SalesYear;

-- example of running total that uses less ordering which will make the running total increase less frequently. In this case, the total will partition by Territory (it'll restart with each new Territory), and it'll only increase when a new SalesYear comes up per the ORDER BY clause.
SELECT *,
  SUM(Amount) OVER(
              PARTITION BY Territory
              ORDER BY SalesYear
              ) AS 'Running Total Sales'
FROM windowTester
ORDER BY Territory, SalesPerson, SalesYear;

/* Results - note how 'Running Total Sales' shows cumulative total sales, but the count only increases when a new Territory and SalesYear combination comes up. Note how there is the same 'Running Total' number for both Territory 1 SalesYear 2012 lines: 

SalesPerson Territory SalesYear Amount    Running Total Sales
12          1         2011      89.20     89.20
12          1         2012      239.30    366.48
12          1         2012      37.98     366.48
12          1         2013      8922.00   9288.48
819         3         2011      353.20    353.20
819         3         2014      2523.20   2876.40
819         3         2017      22.20     2898.60
819         3         2018      242.20    3140.80
224         5         2010      10.00     10.00
224         5         2011      11.20     21.20
2522        5         2012      347.20    368.40
2522        5         2013      455.20    1378.80
2522        5         2013      555.20    1378.80
2522        5         2017      777.20    2156.00
2522        5         2019      346.20    2502.20

*/

-- how to use aggregate with OVER to calculate a percentage of the window, in this case, what percentage each amount constitutes for the Total Sales for the Territory.
SELECT *,
  SUM(Amount) OVER(ORDER BY Territory) AS 'Total Sales for Territory',
  CAST(Amount / SUM(Amount) OVER(ORDER BY Territory) *100 AS DECIMAL(5,2)) AS 'Percentage of Territory'
FROM windowTester
ORDER BY Territory, SalesPerson, SalesYear;

/*Results:

SalesPerson Territory   SalesYear   Amount      Total Sales for Territory Percentage of Territory
----------- ----------- ----------- ----------- ------------------------- -----------------------
12          1           2011        89.20       9288.48                   0.96
12          1           2012        239.30      9288.48                   2.57
12          1           2012        37.98       9288.48                   0.40
12          1           2013        8922.00     9288.48                   96.05
819         3           2011        353.20      12429.28                  2.84
819         3           2014        2523.20     12429.28                  20.30
819         3           2017        22.20       12429.28                  0.17
819         3           2018        242.20      12429.28                  1.94
224         5           2010        10.00       14931.48                  0.06
224         5           2011        11.20       14931.48                  0.07
2522        5           2012        347.20      14931.48                  2.32
2522        5           2013        455.20      14931.48                  3.04
2522        5           2013        555.20      14931.48                  3.71
2522        5           2017        777.20      14931.48                  5.20
2522        5           2019        346.20      14931.48                  2.31

*/ 

-- how to use AVG with OVER to calculate a moving average. In this case, the average between the calues is calculated with each new SalesYear and Amount combination comes up, and the average resets with each new Territory that comes up.
SELECT *,
  AVG(Amount) OVER(Partition BY Territory ORDER BY SalesYear, Amount) AS 'Moving Average for Territory'
FROM windowTester
ORDER BY Territory, SalesPerson, SalesYear;

/*Results:

SalesPerson Territory   SalesYear   Amount                Moving Average for Territory
----------- ----------- ----------- --------------------- ----------------------------
12          1           2011        89.20                 89.20
12          1           2012        37.98                 63.59
12          1           2012        239.30                122.16
12          1           2013        8922.00               2322.12
819         3           2011        353.20                353.20
819         3           2014        2523.20               1438.20
819         3           2017        22.20                 966.20
819         3           2018        242.20                785.20
224         5           2010        10.00                 10.00
224         5           2011        11.20                 10.60
2522        5           2012        347.20                122.80
2522        5           2013        455.20                205.90
2522        5           2013        555.20                275.76
2522        5           2017        777.20                359.3333
2522        5           2019        346.20                357.4571

*/ 

-- normally the OVER clause acts over the rows specified in the ORDER BY clause (it will recalculate each time a new combination in the ORDER BY comes up), and it will partition by the PARTITION BY clause. You can also manually specify the rows the OVER claue will act upon by including a ROWS clause. In the example below, it SUMs the current row and the 2 rows after. Note how it only acts upon rows within the partition. Two rows from the end of the partition, it only counts that row and one after, and in the last row only counts that row (since the following rows are outside the partition).
SELECT *,
	SUM(Amount) OVER(
              PARTITION BY Territory
              ORDER BY Territory, SalesYear, Amount
              ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
              ) AS 'Running Total Sales'
FROM windowTester
ORDER BY Territory, SalesPerson, SalesYear;

/*Results:

SalesPerson Territory   SalesYear   Amount                Running Total Sales
----------- ----------- ----------- --------------------- ---------------------
12          1           2011        89.20                 366.48
12          1           2012        37.98                 9199.28
12          1           2012        239.30                9161.30
12          1           2013        8922.00               8922.00
819         3           2011        353.20                2898.60
819         3           2014        2523.20               2787.60
819         3           2017        22.20                 264.40
819         3           2018        242.20                242.20
224         5           2010        10.00                 368.40
224         5           2011        11.20                 813.60
2522        5           2012        347.20                1357.60
2522        5           2013        455.20                1787.60
2522        5           2013        555.20                1678.60
2522        5           2017        777.20                1123.40
2522        5           2019        346.20                346.20

*/ 

-- creating a running total through entire result set by adding ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. Note how 'Running Total Sales' shows cumulative total. 
SELECT *,
	SUM(Amount) OVER(
                  ORDER BY Territory 
                  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
                  ) AS 'Running Total Sales'
FROM windowTester
ORDER BY Territory, SalesPerson, SalesYear;

/* Results - note how 'Running Total Sales' shows total sales per Territory:

SalesPerson Territory SalesYear   Amount    Running Total Sales
12          1         2011        89.20     89.20
12          1         2012        239.30    328.50
12          1         2012        37.98     366.48
12          1         2013        8922.00   9288.48
819         3         2011        353.20    9641.68
819         3         2014        2523.20   12164.88
819         3         2017        22.20     12187.08
819         3         2018        242.20    12429.28
224         5         2010        10.00     14920.28
224         5         2011        11.20     14931.48
2522        5         2012        347.20    12776.48
2522        5         2013        455.20    13231.68
2522        5         2013        555.20    13786.88
2522        5         2017        777.20    14564.08
2522        5         2019        346.20    14910.28

*/

-- LAG, allows you to compare other rows farther back in the result set. Format is LAG(expression to look at, what the offset is, default value if lagged row can't be found). In the example below, the LAG looks at the Amount column, and the 2 offset tells it to look two rows back (ORDERed BY Territory and the SalesYear). In the first two rows (before it's able to find a row 2 rows back), it returns the default of 0.
SELECT *,
  LAG(Amount, 2, 0) OVER(
                        ORDER BY Territory, SalesYear
                        ) AS 'Amount from 2 Rows Back'
FROM windowTester
ORDER BY Territory, SalesPerson, SalesYear;

/*Results:

SalesPerson Territory   SalesYear   Amount                Amount from 2 Rows Back
----------- ----------- ----------- --------------------- -----------------------
12          1           2011        89.20                 0.00
12          1           2012        239.30                0.00
12          1           2012        37.98                 89.20
12          1           2013        8922.00               239.30
819         3           2011        353.20                37.98
819         3           2014        2523.20               8922.00
819         3           2017        22.20                 353.20
819         3           2018        242.20                2523.20
224         5           2010        10.00                 22.20
224         5           2011        11.20                 242.20
2522        5           2012        347.20                10.00
2522        5           2013        455.20                11.20
2522        5           2013        555.20                347.20
2522        5           2017        777.20                455.20
2522        5           2019        346.20                555.20

*/ 

-- LEAD does the opposite of LAG, allows you to compare other rows ahead in the result set. Format is LEAD(expression to look at, what the offset is, default value if lagged row can't be found). In the example below, the LEAD looks at the Amount column, and the 3 offset tells it to look two rows ahead (ORDERed BY Territory and the SalesYear). In the last three rows (before it's able to find a row 3 rows ahead), it returns the default of 0.
SELECT *,
  LEAD(Amount, 3, 0)  OVER(
                          ORDER BY Territory, SalesYear
                          ) AS 'Amount from 3 Rows Ahead'
FROM windowTester
ORDER BY Territory, SalesPerson, SalesYear;

/*Results:

SalesPerson Territory   SalesYear   Amount                Amount from 3 Rows Ahead
----------- ----------- ----------- --------------------- ------------------------
12          1           2011        89.20                 8922.00
12          1           2012        239.30                353.20
12          1           2012        37.98                 2523.20
12          1           2013        8922.00               22.20
819         3           2011        353.20                242.20
819         3           2014        2523.20               10.00
819         3           2017        22.20                 11.20
819         3           2018        242.20                347.20
224         5           2010        10.00                 455.20
224         5           2011        11.20                 555.20
2522        5           2012        347.20                777.20
2522        5           2013        455.20                346.20
2522        5           2013        555.20                0.00
2522        5           2017        777.20                0.00
2522        5           2019        346.20                0.00

*/ 

/*
LAG and LEAD can be used to evalutate one row based on the value of the previous row. You just have to make sure that the ORDER BY in the ORDER clause of the LAG/LEAD window funciton
matched the ORDER BY clause of the entire query (so that the window function is determining order in the same way that you)
*/
SELECT customer_no
  , key_value
  -- show the previous row's value (just for illustrative purposes, not needed for the row_next_value window function that follows)
  , previous_row_value = lag(key_value, 1, 'not matched') OVER(PARTITION BY customer_no ORDER BY customer_no, last_update_dt)
  -- flag the current row based on the value if the previous row, ORDER BY must match the ORDER BY of the entire query
  , row_value_repeat = 
	  CASE WHEN key_value = lag(key_value, 1, 'not matched') OVER(PARTITION BY customer_no ORDER BY customer_no, last_update_dt) 
	  	THEN 'SAME' 
	  	ELSE 'NOT SAME' 
	  END
FROM TX_CUST_KEYWORD where customer_no IN(
ORDER BY customer_no, last_update_dt

/* Results in the row_value_repeat column:

customer_no key_value                      previous_row_value             row_value_repeat
----------- ------------------------------ ------------------------------ ----------------
236097      1946                           GRADUATE/PROFESSIONAL DEGREE   NOT SAME
236108      2214764                        not matched                    NOT SAME
236144      360                            not matched                    NOT SAME
239830      4250                           4250                           SAME
239830      4250                           4250                           SAME
239830      4250                           4250                           SAME
239830      4250                           4250                           SAME
239830      4250                           4250                           SAME
239830      4250                           4250                           SAME
236144      2292975                        360                            NOT SAME
236144      360                            2292975                        NOT SAME
236144      360                            360                            SAME
236144      360                            360                            SAME
236144      360                            360                            SAME
236144      360                            360                            SAME
236144      360                            360                            SAME
236144      360                            360                            SAME
236144      360                            360                            SAME
236144      360                            360                            SAME
236144      360                            360                            SAME
236144      360                            360                            SAME
236144      360                            360                            SAME
*/

-- how to segment by quartile by using NTILE. Format is NTILE(number of segments), so NTILE(4) will break the results into quartiles, NTILE(10) will break them into deciles, etc.
SELECT *,
  NTILE(10) OVER(ORDER BY Amount) AS 'Quartile'
FROM windowTester
ORDER BY Quartile, Amount;

/* Results:

SalesPerson Territory   SalesYear   Amount                Quartile
----------- ----------- ----------- --------------------- --------------------
224         5           2010        10.00                 1
224         5           2011        11.20                 1
819         3           2017        22.20                 1
12          1           2012        37.98                 1
12          1           2011        89.20                 2
12          1           2012        239.30                2
819         3           2018        242.20                2
2522        5           2019        346.20                2
2522        5           2012        347.20                3
819         3           2011        353.20                3
2522        5           2013        455.20                3
2522        5           2013        555.20                3
2522        5           2017        777.20                4
819         3           2014        2523.20               4
12          1           2013        8922.00               4

(15 row(s) affected)

*/

-- example using NTILE to assign by decile
SELECT *,
  NTILE(10) OVER(ORDER BY Amount) AS 'Decile'
FROM windowTester
ORDER BY Decile, Amount;

/* Results:

SalesPerson Territory   SalesYear   Amount                Decile
----------- ----------- ----------- --------------------- --------------------
224         5           2010        10.00                 1
224         5           2011        11.20                 1
819         3           2017        22.20                 2
12          1           2012        37.98                 2
12          1           2011        89.20                 3
12          1           2012        239.30                3
819         3           2018        242.20                4
2522        5           2019        346.20                4
2522        5           2012        347.20                5
819         3           2011        353.20                5
2522        5           2013        455.20                6
2522        5           2013        555.20                7
2522        5           2017        777.20                8
819         3           2014        2523.20               9
12          1           2013        8922.00               10

(15 row(s) affected)

*/

-- how to calculate the percentile of an array of values using PERCENTILE_DISC and PERCENTILE_CONT. PERCENTILE_DISC will find the percentile by using a value in the array, and PERCENTILE_CONT will interpolate the value (meaning if the median is between two values in the array, it will take the average of the two values). Examples of both PERCENTILE_DISC and PERCENTILE_CONT are below using the median percentile value 0.5. While the two PERCENTILE clauses leave a blank in-between the () in the OVER clause, note that you can insert a PARTION BY clause in-between the () to partition the clause - however, you can't use an ORDER BY clause within the () of the OVER clause when using one of the PERCENTILE functions.
SELECT Product, VendorID, Price, 
	PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY Price)   
                            OVER () AS MedianCont,
	PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY Price)   
                            OVER () AS MedianDisc
	FROM products1  
	ORDER BY Price, VendorID;
	
/* Results - note that PERCENTILE_CONT shows the actual median which is the average of 27.30 and 200.00, while PERCENTILE_DISC returns 27.30 which is an actual value in the array
	
Product           VendorID    Price   MedianCont  MedianDisc
----------------  ----------- ------- ----------- ------------
Computer          1           1.20    113.65      27.30
Soda              1           5.78    113.65      27.30
Seed              5           12.50   113.65      27.30
Keyboard          2           17.75   113.65      27.30
Keyboard Stand    2           27.30   113.65      27.30
Baseball Bat      3           200.00  113.65      27.30
Flag              2           287.50  113.65      27.30
Guitar            5           300.75  113.65      27.30
Computer Desk     4           500.00  113.65      27.30
Guitar            1           2320.20 113.65      27.30

(10 row(s) affected)

*/