- ROLLUP / CUBE / GROUPING
- CUME_DIST
- FIRST_VALUE
- LAST_VALUE
- RATIO_TO_REPORT
- ROW_NUMBER
- NTILE
- WIDTH_BUCKET
- RANK
- DENSE_RANK
- PERCENT_RANK
- PERCENTILE_CONT
- LAG
- LEAD
- SUM, AVG, MIN, MAX, VARIANCE, STDDEV, COUNT
- KEEP
- OLAP Windows

SELECT a,b,c, <aggregates> FROM t GROUP by a,b,c UNION ALL SELECT a,b,NULL,<aggregates> FROM t GROUP by a,b,NULL UNION ALL SELECT a,NULL,NULL,<aggregates> FROM t GROUP by a,NULL,NULL UNION ALL SELECT NULL,NULL,NULL,<aggregates> FROM t GROUP by NULL,NULL,NULL / but they will be nicely ordered, i.e. SQL> SELECT 2 a, b, SUM(b), 3 DECODE(GROUPING(A), 1, 'SUM OF B FOR ALL A') title1, 4 DECODE(GROUPING(B), 1, 'SUM OF B FOR A = ' || a) title2 5 FROM t 6 GROUP BY ROLLUP (B,A) 7 / A B SUM(B) TITLE1 TITLE2 ---------- ---------- ---------- -------------------- -------------------- 1 10 10 1 20 20 1 30 SUM OF B FOR A = 1 2 20 20 2 20 SUM OF B FOR A = 2 3 30 30 3 30 SUM OF B FOR A = 3 4 40 40 4 40 SUM OF B FOR A = 4 120 SUM OF B FOR ALL A SUM OF B FOR A = 10 rows selected.

Note, the use of the GROUPING function, which we can use to determine if the current row is an additional row (i.e. aggregated row) which is the grouped "extra" row(s) from the OLAP operation.

CUBE will create every combination -- you'll get the aggregates by: a,b,c a,b a,c b,c a b c NULLHowever, whereas ROLLUP will guarantee a nice "report", i.e. aggregate totals at the "end" of the result set, CUBE does not.

The following example shows this :

SQL> SELECT a, CUME_DIST() OVER (ORDER BY a) 2 FROM t; A CUME_DIST()OVER(ORDERBYA) ---------- ------------------------- 10 .333333333 20 .666666667 30 1 3 rows selected.This can be explained as that 2/3 of the values come before (or include) the value 20, etc.

SQL> SELECT a, FIRST_VALUE(a) OVER () 2 FROM t; A FIRST_VALUE(A)OVER() ---------- -------------------- 10 10 20 10 2 rows selected.

SQL> SELECT a, LAST_VALUE(a) OVER () 2 FROM t; A LAST_VALUE(A)OVER() ---------- -------------------- 10 20 20 20 2 rows selected.

Note, that it is important to specify the "windowing clause" of the OLAP functions when using functions such as LAST_VALUE which have to operate on the whole set "further on" than the current row, when specifying ORDER BY clauses etc in the OVER () statement.

The reason is that the default windowing clause is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, and this will not take into account rows after the current row. For example :

SQL> SELECT a, LAST_VALUE(a) OVER (ORDER BY a) 2 FROM t; A LAST_VALUE(A)OVER(ORDERBYA) ---------- --------------------------- 10 10 20 20See how the value seems incorrect. What is actually being executed is :

SQL> SELECT a, LAST_VALUE(a) OVER (ORDER BY a 2 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) 3 FROM t; A LAST_VALUE(A)OVER(ORDERBYAROWSBETWEENUNBOUNDEDPRECEDINGANDCURRENTROW) ---------- --------------------------------------------------------------------- 10 10 20 20You need to correctly think about the OLAP window in this situation.

Note, in this case, the correct statement is :

SQL> SELECT a, LAST_VALUE(a) OVER (ORDER BY a 2 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) x 3 FROM t; A X ---------- ---------- 10 20 20 20

Its syntax is:

RATIO_TO_REPORT ( expr ) OVER ( [query_partition_clause] ) SQL> CREATE TABLE t ( a NUMBER(10) ); Table created. SQL> INSERT INTO t VALUES (10); 1 row created. SQL> / 1 row created. SQL> / 1 row created. SQL> SELECT a, RATIO_TO_REPORT(a) OVER () 2 FROM T; A RATIO_TO_REPORT(A)OVER() ---------- ------------------------ 10 .333333333 10 .333333333 10 .333333333 3 rows selected.

You can mimic the functionality of RATIO_TO_REPORT, by using SUM, (which is what RATIO_TO_REPORT will do anyway), but it'll be quicker to do the division in a built-in (and it'll be easier to read and understand what's going on), i.e.

SQL> SELECT a, a / SUM(a) OVER () 2 FROM t; A A/SUM(A)OVER() ---------- -------------- 10 .333333333 10 .333333333 10 .333333333 3 rows selected.Note, that the comparison between SUM and RATIO_TO_REPORT is not strictly speaking correct (although it explains the situation well), the difference is that RATIO_TO_REPORT will return NULL if the SUM(a) OVER () is zero, so the correct comparison (to prevent the

RATIO_TO_REPORT(a) OVER () = a / DECODE(SUM(a) OVER (), 0, NULL, SUM(a) OVER ())which assumes that a divided by NULL = NULL (which it is in most database platforms), but to be strictly correct,

RATIO_TO_REPORT(a) OVER () = DECODE(SUM(a) OVER (), 0, NULL, a / SUM(a) OVER ())which, again, DRAMATICALLY shows the improvement in readability of RATIO_TO_REPORT over SUM.

SQL> SELECT * FROM T; A ---------- X X Y Y 4 rows selected. SQL> SELECT 2 A, 3 ROWNUM, 4 ROW_NUMBER() OVER (PARTITION BY A ORDER BY A) partitioned, 5 ROW_NUMBER() OVER (ORDER BY A) non_partitioned 6 FROM T 7 / A ROWNUM PARTITIONED NON_PARTITIONED ---------- ---------- ----------- --------------- X 1 1 1 X 2 2 2 Y 3 1 3 Y 4 2 4 4 rows selected.Note, that a partition window over the entire result set will match ROWNUM.

So, when do you use ROWNUM vs ROW_NUMBER()?

here is how you decide :

- If you need the TOP-N rows out of a set (eg: show me the highest paid person in the COMPANY), ROWNUM is the best thing to use.
- If you need the TOP-N rows out of groups within a set (eg: show me the highest paid person in each department), ROW_NUMBER() (or dense_rank whatever) is the ONLY thing to use (ROWNUM cannot do it)

The remainder values (the remainder of number of rows divided by buckets) are distributed 1 per bucket, starting with bucket 1. If expr is greater than the number of rows, a number of buckets equal to the number of rows will be filled, and the remaining buckets will be empty.

NTILE( expr ) [ OVER ( query_partition_clause ORDER_BY_clause ) ]

Let's say I have a table of data and I utilise a basic NTILE with 2 buckets on that set, I get :

SQL> SELECT a, NTILE(2) over ( order by a ) 2 FROM temp; A NTILE(2)OVER(ORDERBYA) ---------- ---------------------- 1 1 2 1 3 2 4 2We can see that 4 values have been spread over 2 "buckets".

SQL> SELECT a, NTILE(3) over ( order by a ) 2 from temp; A NTILE(3)OVER(ORDERBYA) ---------- ---------------------- 1 1 2 1 3 2 4 3 4 rows selected.

This can be a bit confusing to get your head around, but fortunately there's an easy way to understand it. Take your result set, say 1,1,2,3,4,4,5. Now, if I apply an NTILE(n) on this, work out what the highest integer (say n=3) that will factor into the number of rows (x). In this case, x=7 and the highest integer that will factor into 7 three times is 2, since 2+2+2=6. Now, we have 1 remaining (7-6). This is now placed in the left (or top) most bucket, so we have the following :

1 1 2 3 4 4 5 1 1 1 2 2 3 3 ^ remainder value And hence to prove it : SQL> SELECT a, NTILE(3) OVER (ORDER BY a) 2 FROM t; A NTILE(3)OVER(ORDERBYA) ---------- ---------------------- 1 1 1 1 2 1 3 2 4 2 4 3 5 3 7 rows selected.Note, that NTILE (in OLAP parlance) produces

Ideally each bucket is a "closed-open" interval of the real number line. For example, a bucket can be assigned to scores between 10.00 and 19.999... to indicate that 10 is included in the interval and 20 is excluded. This is sometimes denoted [10, 20).

WIDTH_BUCKET(expr, min_value, max_value, num_of_buckets)

ExampleSQL> SELECT a, WIDTH_BUCKET(a, 1, 21, 5) 2 FROM t 3 / A WIDTH_BUCKET(A,1,21,5) ---------- ---------------------- 10 3 10 3 20 5 15 4 15 4 5 rows selected.

This can be just as confusing as NTILE to understand (probably because they do very similar things), but the way to think about this one is :

Say I have the above example. In t, a has values of 10,10,15,15,20, and I want to "split" these values over 5 buckets. Now, min_value is inclusive, but max_value is exclusive, so :

Therefore, the range of values in each histogram is 4 (well, actually.. 3.999999...., since m -> n is exclusive of n)

Therefore, histogram 0 = < 1 histogram 1 = 1 -> 4.9999... histogram 2 = 5 -> 8.9999... histogram 3 = 9 -> 12.9999... histogram 4 = 13 -> 16.9999... histogram 5 = 17 -> 20.9999... histogram 6 = >= 21And, hence, value of 10 is in histogram (bucket) #3, 15 is in bucket #4 and 20 is in bucket #5.

In the above query, In case of DEPTNO 20 both SCOTT and FORD have the same salary, So both are assigned rank 1. Resulting rank 2 is skipped and rank 3 is assigned to JONES.SyntaxRANK() OVER ( query_partition_clause order_by_clause)ExampleSQL> SELECT deptno, ename, sal, comm, RANK() OVER (PARTITION BY deptno ORDER BY sal DESC, comm) as rk 2 FROM emp; DEPTNO ENAME SAL COMM RK ---------- ---------- ---------- ---------- ---------- 10 KING 5000 1 10 CLARK 2450 2 10 MILLER 1300 3 20 SCOTT 3000 1 20 FORD 3000 1 20 JONES 2975 3 20 ADAMS 1100 4 20 SMITH 800 5 30 BLAKE 2850 1 30 ALLEN 1600 300 2 30 TURNER 1500 0 3 30 WARD 1250 500 4 30 MARTIN 1250 1400 5 30 JAMES 950 6

SQL> SELECT a, RANK() OVER (ORDER BY a), DENSE_RANK() OVER (ORDER BY a) 2 FROM t; A RANK()OVER(ORDERBYA) DENSE_RANK()OVER(ORDERBYA) ---------- -------------------- -------------------------- 10 1 1 20 2 2 20 2 2 30 4 3

It is calculated as :

Therefore, PERCENT_RANK returns values in the range of 0 - 1. Row(s) with a RANK of 1 will have a PERCENT_RANK of 0, i.e.

SQL> SELECT a, PERCENT_RANK() OVER ( ORDER BY a ), RANK() OVER (ORDER BY a) 2 FROM t 3 / A PERCENT_RANK()OVER(ORDERBYA) RANK()OVER(ORDERBYA) ---------- ---------------------------- -------------------- 10 0 1 20 .5 2 30 1 3 3 rows selected.

What this basically boils down to is that given a value (x), this function will work out (for the current partition window) what value wouldSyntaxPERCENTILE_CONT ( x ) WITHIN GROUP ( order_by_clause ) [ OVER ([ query_partition_clause ]) ]

SQL> SELECT a FROM t; A ---------- 10 20 30 40 4 rows selected. SQL> SELECT PERCENTILE_CONT(0.5) WITHIN GROUP ( ORDER BY a ) 2 FROM t 3 / PERCENTILE_CONT(0.5)WITHINGROUP(ORDERBYA) ----------------------------------------- 25 1 row selected.

Because this is a statistical distribution, the "first" row over the ordered set is at 0th percentile, the "last" row is the 100th percentile, i.e.

SQL> SELECT PERCENTILE_CONT(0) WITHIN GROUP ( order by a ) 2 FROM t; PERCENTILE_CONT(0)WITHINGROUP(ORDERBYA) --------------------------------------- 10 1 row selected. SQL> SELECT PERCENTILE_CONT(1) WITHIN GROUP ( order by a ) 2 FROM t 3 / PERCENTILE_CONT(1)WITHINGROUP(ORDERBYA) --------------------------------------- 40 1 row selected.

SyntaxLAG(<column> [, n]) OVER ( query_partition_clause ORDER BY clause) n - number of rows to "lag" by. Defaults to 1.ExampleSQL> SELECT 2 a, 3 LAG(a) OVER (ORDER BY a) 4 FROM 5 ( SELECT 'x' a FROM dual 6 UNION ALL 7 SELECT 'y' a FROM dual 8 UNION ALL 9 SELECT 'z' a FROM dual 10 ) 11 / A L - - x y x z y 3 rows selected.

SyntaxLEAD(<column> [, n]) OVER ( query_partition_clause order_by_clause) n - number of rows to "lead" by. Defaults to 1.ExampleSQL> SELECT 2 a, 3 LEAD(a) OVER (ORDER BY a) 4 FROM 5 ( SELECT 'x' a FROM dual 6 UNION ALL 7 SELECT 'y' a FROM dual 8 UNION ALL 9 SELECT 'z' a FROM dual 10 ) 11 / A L - - x y y z z

There good for an EXTREMELY wide variety of uses, but the following example shows the use of SUM() OVER (..) for computing running totals.Syntaxfunction( [ALL | DISTINCT] expr ) OVER (query_partition_clause ORDER BY clause)

Note, the use of ROWID to ensure the sum is worked out PER ROW (not PER a) (see below).Example-- This is the data set. SQL> SELECT * FROM T; A ---------- 1 2 3 3 SQL> SELECT 2 a, 3 SUM(a) OVER (ORDER BY a, rowid) 4 FROM t 5 / A SUM(A)OVER(ORDERBYA) ---------- -------------------- 1 1 2 3 3 6 3 9 3 rows selected.

Note, if the analytic OVER expression does not describe a unique set of column(s), then the function will be applied over the ENTIRE set of matching rows, i.e. if the above example was ( ORDER BY a ), i.e. no ROWID, then the result is :

SQL> SELECT a, SUM(a) OVER (ORDER BY a) 2 FROM t; A SUM(A)OVER(ORDERBYA) ---------- -------------------- 1 1 2 3 39 <-- Since, A has only 3 distinct values (and 4 rows), the SUM is the sum of ALL As3 9

The query_partition_clause has to be used if you want the values to be "reset" for a changing column, consider :

SQL> SELECT A FROM T 2 ORDER BY A; A ---------- 10 10 10 20 30If I want running totals but "reset" for every value of A, the PARTITION BY clause is used, i.e.

SQL> SELECT 2 A, 3 SUM(A) OVER (PARTITION BY A ORDER BY A,ROWID) 4 FROM 5 T; A SUM(A)OVER(PARTITIONBYAORDERBYA,ROWID) ---------- -------------------------------------- 10 10 10 20 10 30 20 20 30 30

Note, you can only use the KEEP syntax with the OLAP-enabled aggregate functions, i.e. SUM, AVG, MIN, MAX, VARIANCE, STDDEV, COUNT.

Here's an example. Let's say we have a table of employees.

SQL> SELECT * FROM t; EMPNO DEPTNO SALARY COMM_PCT ---------- ---------- ---------- ---------- 1 10 1000 40 2 10 2000 20 3 10 5000 30 4 20 4000 40 5 20 8000 50 6 20 8000 60 6 rows selected.Now, we ask the question, show me the minimum salaries for each department :

SQL> SELECT 2 deptno, 3 MIN(salary), 4 FROM t 5 GROUP BY deptno 6 / DEPTNO MIN(SALARY) ---------- ----------- 10 1000 20 4000But, now, lets say I wanted to ask the question,

SQL> SELECT 2 deptno, 3 MIN(salary) KEEP (DENSE_RANK FIRST ORDER BY comm_pct) 4 FROM t 5 GROUP BY deptno 6 / DEPTNO MIN(SALARY)KEEP(DENSE_RANKFIRSTORDERBYCOMM_PCT) ---------- ----------------------------------------------- 10 2000 20 4000So, now the result of department 10 changes. Only the rows where comm_pct ranked highest were taken, the following query shows the results of the ranking :

SQL> SELECT 2 deptno, 3 comm_pct, 4 DENSE_RANK() OVER (PARTITION BY deptno ORDER BY comm_pct) 5 FROM t; DEPTNO COMM_PCT DENSE_RANK()OVER(PARTITIONBYDEPTNOORDERBYCOMM_PCT) ---------- ---------- --------------------------------------------------As you can see, because the commision percentage column with a value of 20 is ranked highest, only this record is taken into account, and the MINIMUM of all records with a comm_pct = 20, is 2000.10 20 110 30 2 10 40 320 40 120 50 2 20 60 3 6 rows selected.

Of course, you can specify LAST instead of FIRST, and show the lowest ranked records, i.e.

SQL> SELECT * FROM t; EMPNO DEPTNO SALARY COMM_PCT ---------- ---------- ---------- ---------- 1 10 1000 40 2 10 2000 20 3 10 5000 30 4 20 4000 40 5 20 8000 50 6 20 8000 60 6 rows selected. SQL> SELECT 2 deptno, 3 MIN(salary) KEEP (DENSE_RANK LAST ORDER BY comm_pct) 4 FROM t 5 GROUP BY deptno 6 / DEPTNO MIN(SALARY)KEEP(DENSE_RANKLASTORDERBYCOMM_PCT) ---------- ---------------------------------------------- 10 1000 20 8000

Note, that according to the documentation http://download-west.oracle.com/docs/cd/B10501_01/server.920/a96540/functions45a.htm, only DENSE_RANK can currently be used in the KEEP clause.

This is borne out by the error message when trying to use other functions, i.e.

SQL> SELECT 2 deptno, 3 MIN(salary) KEEP (RANK LAST ORDER BY comm_pct) 4 FROM t 5 GROUP BY deptno 6 / SELECT * ERROR at line 1: ORA-02000: missing DENSE_RANK

The following SQL asks the question, "how many rows exist from 1 row back from the current row to the end of the set?".

SQL> SELECT COUNT(a) OVER (ORDER BY a 2 ROWS BETWEEN 1 PRECEDING AND UNBOUNDED FOLLOWING) 3 FROM mc1; COUNT(X_Y)OVER(ORDERBYX_YROWSBETWEEN1PRECEDINGANDUNBOUNDEDFOLLOWING) -------------------------------------------------------------------- 3 3 2 3 rows selected.Similarly, the following SQL asks "how many rows exist from two rows back to 1 row forward?"

SQL> SELECT COUNT(a) OVER (ORDER BY a 2 ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING) 3 FROM mc1; COUNT(A)OVER(ORDERBYAROWSBETWEEN2PRECEDINGAND1FOLLOWING) -------------------------------------------------------- 2 3 3 3 rows selected.You can use CURRENT ROW to specify that the current row defines the end of the set.

SQL> SELECT COUNT(a) OVER (ORDER BY a 2 ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) 3 FROM mc1; COUNT(A)OVER(ORDERBYAROWSBETWEEN1PRECEDINGANDCURRENTROW) -------------------------------------------------------- 1 2 2 3 rows selected.

This SQL asks the question, "how many rows are within 1 of the current value going back, and within 1 of the current value going forward?"

SQL> SELECT a, COUNT(a) OVER (ORDER BY a 2 RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING) 3 FROM mc1; A COUNT(A)OVER(ORDERBYARANGEBETWEEN1PRECEDINGAND1FOLLOWING) ---------- --------------------------------------------------------- 9 2 10 3 11 3 12 2 20 1 5 rows selected.The following SQL uses CURRENT ROW to ask "how many rows are within 1 of the current value going backwards?"

SQL> SELECT a, COUNT(a) OVER (ORDER BY a 2 RANGE BETWEEN 1 PRECEDING AND CURRENT ROW) 3 FROM mc1; A COUNT(A)OVER(ORDERBYARANGEBETWEEN1PRECEDINGANDCURRENTROW) ---------- --------------------------------------------------------- 9 1 10 2 11 2 12 2 20 1 5 rows selected.Using RANGE on DATE datatypes treats it in DAYS, by default.

SQL> SELECT c, COUNT(c) OVER (ORDER BY c 2 RANGE BETWEEN 1 PRECEDING 3 AND CURRENT ROW) range_count 4 FROM mc1; C RANGE_COUNT --------- ----------- 15-OCT-04 1 16-OCT-04 2 17-OCT-04 2 3 rows selected.In order to get a more granular level of time, you need to use the INTERVAL syntax.

For example, here is how to go one hour or one second back :

SQL> SELECT c, COUNT(c) OVER (ORDER BY c 2 RANGE BETWEEN INTERVAL '1' HOUR PRECEDING 3 AND CURRENT ROW) range_count 4 FROM mc1; C RANGE_COUNT --------- ----------- 15-OCT-04 1 16-OCT-04 1 17-OCT-04 1 SQL> SELECT c, COUNT(c) OVER (ORDER BY c 2 RANGE BETWEEN INTERVAL '1' SECOND PRECEDING 3 AND CURRENT ROW) range_count 4 from mc1; C RANGE_COUNT --------- ----------- 15-OCT-04 1 16-OCT-04 1 17-OCT-04 1 3 rows selected.Note, the INTERVAL syntax is quite extensive, you can for example specify 4 years and 2 months, by specifying INTERVAL '4-2' YEAR TO MONTH, or 11 hours and 20 minutes via INTERVAL '11:20' HOUR TO MINUTE.

Note, most of the INTERVAL types (DAY, MONTH, YEAR etc.) have an optional precision which defaults to 2. You have to manual specify this if greater than 2, i.e. 100 years has to be specified as :

SQL> SELECT c, COUNT(c) OVER (ORDER BY c 2 RANGE BETWEEN INTERVAL '100'Trying to omit the precision results in ORA-1873.YEAR(3)PRECEDING 3 AND CURRENT ROW) 4 FROM mc1;

SQL> SELECT c, COUNT(c) OVER (ORDER BY c 2 RANGE BETWEEN INTERVAL '100' YEAR PRECEDING 3 AND CURRENT ROW) 4 FROM mc1; SELECT c, COUNT(c) OVER (ORDER BY c RANGE BETWEEN INTERVAL '100' YEAR PRECEDING AND CURRENT ROW) * ERROR at line 1: ORA-01873: the leading precision of the interval is too smallSee http://download-west.oracle.com/docs/cd/B10501_01/server.920/a96540/sql_elements3a.htm#38599 for further information on INTERVAL literals.