STRING Deep Search
When setting up an internal search engine based solely on ClickHouse, a central issue is the ability to manage both:
the exact phrase searches (e.g. "love chocolate")
Scattered keyword searches (e.g. like AND the chocolate AND)
lack of clear possibility of logical comparators (OR, AND, !=, ...)
Including large documents that may contain several hundred pages. However, the conventional indexing on raw content (content) does not allow to effectively capture the presence of keywords spread over several sections of a same document. So you need to cook a new solution
1. Dependencies.
Our problem is that we are dependent on too many factors, especially with the variable and often substantial text size. In a first time, it is necessary to cut the project into several parts (table sure Clickhouse hein).
The "simple" text part with short information, but global documents
The part with "chunks", such that all chunks are normalized and semi-equal parts of text portion.
The token part where we will play with the words themselves of the text in tables.
Which gives us the following tables
jnbhghhze
L'amour du chocolats
https://...
ugyiazjdkf
je sais pas
https://...
zefhiuoje
un nom de livre
https://...
2. Architecture
This architecture allows us to split the problem into several parts, because then we have almost a table dedicated to each problem.
So we have the possibility SQL to search for plain text
SELECT DISTINCT doc_id
FROM doc_chunks
WHERE content ILIKE '%aimer le chocolat%';
But suddenly also the fact of searching in the array of tokens
-- ligne 1
SELECT doc_id FROM document_tokens_advanced
WHERE token_length = 2 AND hasAll(tokens, ['je', 'il'])
INTERSECT
-- ligne 2
SELECT doc_id FROM document_tokens_advanced
WHERE token_length = 4 AND hasAll(tokens, ['manger'])
INTERSECT
-- ligne 3
SELECT doc_id FROM document_tokens_advanced
WHERE token_length = 6 AND hasAll(tokens, ['chocolat'])
3. Merge
Good, we can merge like this
-- Partie 1 : doc_id des chunks contenant la phrase exacte
SELECT DISTINCT doc_id
FROM doc_chunks
WHERE content ILIKE '%aimer le chocolat%'
INTERSECT
-- Partie 2.1 : doc_id avec tokens de taille 2 ['je', 'il']
SELECT doc_id
FROM document_tokens_advanced
WHERE token_length = 2 AND hasAll(tokens, ['je', 'il'])
INTERSECT
-- Partie 2.2 : tokens de taille 6 ['manger']
SELECT doc_id
FROM document_tokens_advanced
WHERE token_length = 6 AND hasAll(tokens, ['manger'])
INTERSECT
-- Partie 2.3 : tokens de taille 8 ['chocolat']
SELECT doc_id
FROM document_tokens_advanced
WHERE token_length = 8 AND hasAll(tokens, ['chocolat']);
4. Build language
What to do: a trick that builds a SQL query with a natural language including the conditions of type AND, OR, ==, ! = and all the rest
Last updated