CTEs And Window Functions

I am a big fan of code readability, but being a fan and expecting the code I write to be readable are two different things. Readable code is easier to understand and much more straightforward to grasp.

One significant difference I find while comparing readable and unreadable code, apart from the obvious, is how many different variables I have to hold in my head at a time while I try to grapple my head around it. If this task is simple, I believe that this is not because of some random act of brilliance but due to careful code construction.

CTEs helps achieve this by its ability to separate data into independent logical entities according to the context of the query.

WITH popular_posts 
     AS (SELECT id, 
                title, 
                body 
         FROM   posts 
         WHERE  posts.upvote > 5) 
SELECT * 
FROM   popular_posts; 
popular_posts: any post with 6 or more upvotes

A random use case.  

Let's look at a random use case and check how the query performance and readability can be improved. For a particular report in our e-commerce app, we were querying for each item’s latest n review-comments, in our case, we can set n = 2.

For the first iteration, we can write a simple join and then filter that in our application language (Ruby, Go, PHP etc.)

SELECT items.id             AS item_id, 
       review_comments.id   AS comment_id, 
       review_comments.body AS comment_body
FROM   items 
       LEFT OUTER JOIN review_comments 
                    ON items.id = review_comments.item_id; 
a join table: item and its review comments

For an e-commerce system like Amazon, this is not going to scale since the sheer number of items and the associated review comments are guaranteed to bog down the dbms.

Lets introduce a Window Function

In general, a Window Function, is a mathematical function that is zero valued outside of some interval. In case of SQL engines, they operate on rows that are related to the current row, in our case, this can be the item_id. All rows of the join that has a common item_id acts as a partition over which the window function acts.

We rewrite our query to do the following:

  • Window function should operate on a partition of same item_ids
  • Each of these partitions should be ordered by created time stamp of review_comment.
  • Select only the top two from each partition
SELECT 
  item_id, 
  comment_id, 
  comment_body 
FROM 
  (
    SELECT 
      items.id AS item_id, 
      review_comments.id AS comment_id, 
      review_comments.body AS comment_body, 
      Row_number() OVER(
        partition BY item_id 
        ORDER BY 
          review_comments.created_at DESC
      ) AS comment_row_number, 
    FROM 
      items 
      LEFT OUTER JOIN review_comments ON items.id = review_comments.item_id
  ) AS items_ranked_by_comments 
WHERE 
  comment_row_number < 3;

/* Result:
+----------+------------+--------------------+
|  item_id | comment_id |    comment_body    |
+----------+------------+--------------------+
|      900 |       8789 | This product is..  |
|      900 |      12301 | Fantastic produ..  |
|     1200 |      13466 | Great to see cu..  |
|     1200 |      19023 |                    | 
|       .. |         .. |                ..  |
+----------+------------+--------------------+
*/
item with at-least two review comments 

The sub-query is resulting in a full join table, but now we have comment_row_number field to let us know that the newest review_comment has a value of 1. All we need to do now is to filter using that column.

We have used row_number() function from PostgreSQL as the window function here. According to the docs, any built-in or user-defined aggregate function can be used as a window function.

Putting It All Together

Even though we have our answer, our query has become pretty hard to read. We can get better readability by introducing a Common Table Expression.

WITH items_ranked_by_comments AS (
  SELECT 
    items.id AS item_id, 
    review_comments.id AS comment_id, 
    review_comments.body AS comment_body, 
    Row_number() OVER(
      partition BY item_id 
      ORDER BY 
        review_comments.created_at DESC
    ) AS comment_row_number, 
  FROM 
    items 
    LEFT OUTER JOIN review_comments ON items.id = review_comments.item_id
) 
SELECT 
  item_id, 
  comment_id, 
  comment_body 
FROM 
  items_ranked_by_comments 
WHERE 
  comment_row_number < 3;
CTE implementation of item with at-least 2 review comments.

Well, that is it folks. We have put together two concepts, readability and performance to create something which is the best of both worlds.