STRING Deep Search

When setting up an internal search engine based solely on ClickHouse, a central issue is the ability to manage both:

  • the exact phrase searches (e.g. "love chocolate")

  • Scattered keyword searches (e.g. like AND the chocolate AND)

  • lack of clear possibility of logical comparators (OR, AND, !=, ...)

Including large documents that may contain several hundred pages. However, the conventional indexing on raw content (content) does not allow to effectively capture the presence of keywords spread over several sections of a same document. So you need to cook a new solution

1. Dependencies.

Our problem is that we are dependent on too many factors, especially with the variable and often substantial text size. In a first time, it is necessary to cut the project into several parts (table sure Clickhouse hein).

  • The "simple" text part with short information, but global documents

  • The part with "chunks", such that all chunks are normalized and semi-equal parts of text portion.

  • The token part where we will play with the words themselves of the text in tables.

Which gives us the following tables

doc_id
title
....
url

jnbhghhze

L'amour du chocolats

https://...

ugyiazjdkf

je sais pas

https://...

zefhiuoje

un nom de livre

https://...

2. Architecture

This architecture allows us to split the problem into several parts, because then we have almost a table dedicated to each problem.

So we have the possibility SQL to search for plain text

SELECT DISTINCT doc_id
FROM doc_chunks
WHERE content ILIKE '%aimer le chocolat%';

But suddenly also the fact of searching in the array of tokens

-- ligne 1
SELECT doc_id FROM document_tokens_advanced
WHERE token_length = 2 AND hasAll(tokens, ['je', 'il'])

INTERSECT

-- ligne 2
SELECT doc_id FROM document_tokens_advanced
WHERE token_length = 4 AND hasAll(tokens, ['manger'])

INTERSECT

-- ligne 3
SELECT doc_id FROM document_tokens_advanced
WHERE token_length = 6 AND hasAll(tokens, ['chocolat'])

3. Merge

Good, we can merge like this

-- Partie 1 : doc_id des chunks contenant la phrase exacte
SELECT DISTINCT doc_id
FROM doc_chunks
WHERE content ILIKE '%aimer le chocolat%'

INTERSECT

-- Partie 2.1 : doc_id avec tokens de taille 2 ['je', 'il']
SELECT doc_id
FROM document_tokens_advanced
WHERE token_length = 2 AND hasAll(tokens, ['je', 'il'])

INTERSECT

-- Partie 2.2 : tokens de taille 6 ['manger']
SELECT doc_id
FROM document_tokens_advanced
WHERE token_length = 6 AND hasAll(tokens, ['manger'])

INTERSECT

-- Partie 2.3 : tokens de taille 8 ['chocolat']
SELECT doc_id
FROM document_tokens_advanced
WHERE token_length = 8 AND hasAll(tokens, ['chocolat']);

4. Build language

What to do: a trick that builds a SQL query with a natural language including the conditions of type AND, OR, ==, ! = and all the rest

Last updated