Recommendation system

This tutorial shows you how to build a user behavior-based recommendation system, 100% in SQL in ClickHouse, with Big Data support via `MATERIALIZED VIEW'. (https://clickhouse.com/docs/materialized-view/incremental-materialized-view)

📌 Objective

For each content viewed (‘content_uuid’), the most frequently viewed content is recommended next to by other users in a sliding window (for example, the following 10 views).

📦 Step 1 – Create the source table

DROP TABLE IF EXISTS views;

CREATE TABLE views (
    id UInt64,
    user_uuid String,
    content_uuid String
) ENGINE = MergeTree()
ORDER BY (user_uuid, id);

This table stores user journeys, with each row representing a view.

  • id: chronological order (auto-increment)

  • user_uuid: user ID

  • content_uuid: accessed content

🧪 Step 2 - Test data insertion

INSERT INTO views (id, user_uuid, content_uuid) VALUES
    (1, 'user1', 'A'),
    (2, 'user1', 'B'),
    (3, 'user1', 'C'),
    (4, 'user1', 'D'),
    (5, 'user1', 'E'),
    (6, 'user2', 'A'),
    (7, 'user2', 'C'),
    (8, 'user2', 'E'),
    (9, 'user2', 'F'),
    (10, 'user2', 'G'),
    (11, 'user3', 'B'),
    (12, 'user3', 'C'),
    (13, 'user3', 'D'),
    (14, 'user3', 'A'),
    (15, 'user3', 'E');

⚙️ Step 3 - Create MATERIALIZED VIEW

DROP TABLE IF EXISTS transitions_mv;

CREATE MATERIALIZED VIEW transitions_mv
ENGINE = SummingMergeTree()
ORDER BY (from_content, to_content)
POPULATE
AS
SELECT
    a.content_uuid AS from_content,
    b.content_uuid AS to_content,
    count() AS transition_count
FROM views AS a
JOIN views AS b
    ON a.user_uuid = b.user_uuid
   AND b.id > a.id
   AND b.id <= a.id + 10  -- 10 c'est une valeur brute, stv 20 faut update ça
GROUP BY from_content, to_content;

✅ The keyword POPULATE is used to fill the view with existing data.

  • from_content: initial content

  • to_content: content then seen in the window

  • transition_count: number of times this transition was observed

🔍 Step 4 - Query the view

Query to retrieve the top 20 contents to recommend after viewing 'A'

SELECT
    to_content,
    transition_count
FROM transitions_mv
WHERE from_content = 'A'
ORDER BY transition_count DESC
LIMIT 20;

Probability calculation (bonus)

If you want a normalized score (probability that to_content follows from_content):

SELECT
    from_content,
    to_content,
    transition_count,
    round(transition_count / sum(transition_count) OVER (PARTITION BY from_content), 3) AS probability
FROM transitions_mv
WHERE from_content = 'A'
ORDER BY probability DESC
LIMIT 20;

Last updated