På norsk.
There are many ways to apply weighting in search. We can apply weights during
indexing, based on which field a symbol comes from and what kind of tokenization
was used. Maybe ngrams should count for less than full words. In addition, we
can apply weights at search time, which allows us to offer different types of
searches from the same index.
In the first post about how full-text search works, we
built a single large index. That limits our ability to apply weighting at query
time, since we’ve already lost some information about the symbols in the index.
A strategy that gives us more flexibility is to build several smaller indexes
and tune searches by combining them in different ways.
Here’s a refresher of the data we’re indexing:
var data = [
{
id: "a1",
title: "Searching for Humor Among Clowns",
description: "About clown research’s exciting insights into the humor of the future."
},
{
id: "a2",
title: "The Role of Comical Feet in Humor",
description: "How clown feet affect our experience of humor."
},
{
id: "a3",
title: "Surprising Discoveries in Clown Research",
description: "Discover the unexpected findings that clown researchers have made."
}
];
Last time, we indexed the titles, the descriptions, and 2-length ngrams from the
titles. Now let’s do it again — but this time, we’ll place them into three
separate indexes:
{
"title": {
"searching": ["a1"],
"for": ["a1"],
"humor": ["a1", "a2"],
"among": ["a1"],
"clowns": ["a1"],
"the": ["a2"],
"role": ["a2"],
"of": ["a2"],
"comical": ["a2"],
"feet": ["a2"],
"in": ["a2", "a3"],
"surprising": ["a3"],
"discoveries": ["a3"],
"clown": ["a3"],
"research": ["a3"]
},
"description": {
"about": ["a1"],
"clown": ["a1", "a2", "a3"],
"researchs": ["a1"],
"exciting": ["a1"],
"insights": ["a1"],
"into": ["a1"],
"the": ["a1", "a1", "a3"],
"humor": ["a1", "a2"],
"of": ["a1", "a2"],
"future": ["a1"],
"how": ["a2"],
"feet": ["a2"],
"affect": ["a2"],
"our": ["a2"],
"experience": ["a2"],
"discover": ["a3"],
"unexpected": ["a3"],
"findings": ["a3"],
"that": ["a3"],
"researchers": ["a3"],
"have": ["a3"],
"made": ["a3"]
},
"titleNgram": {
"se": ["a1", "a3"],
"ea": ["a1", "a3"],
"ar": ["a1", "a3"],
"rc": ["a1", "a3"],
"ch": ["a1", "a3"],
"hi": ["a1"],
"in": ["a1", "a2", "a3", "a3"],
"ng": ["a1", "a1", "a3"],
"fo": ["a1"],
"or": ["a1", "a1", "a2"],
"hu": ["a1", "a2"],
"um": ["a1", "a2"],
"mo": ["a1", "a1", "a2"],
"am": ["a1"],
"on": ["a1"],
"cl": ["a1", "a3"],
"lo": ["a1", "a3"],
"ow": ["a1", "a3"],
"wn": ["a1", "a3"],
"ns": ["a1"],
"th": ["a2"],
"he": ["a2"],
"ro": ["a2"],
"ol": ["a2"],
"le": ["a2"],
"of": ["a2"],
"co": ["a2", "a3"],
"om": ["a2"],
"mi": ["a2"],
"ic": ["a2"],
"ca": ["a2"],
"al": ["a2"],
"fe": ["a2"],
"ee": ["a2"],
"et": ["a2"],
"su": ["a3"],
"ur": ["a3"],
"rp": ["a3"],
"pr": ["a3"],
"ri": ["a3", "a3"],
"is": ["a3", "a3"],
"si": ["a3"],
"di": ["a3"],
"sc": ["a3"],
"ov": ["a3"],
"ve": ["a3"],
"er": ["a3"],
"ie": ["a3"],
"es": ["a3", "a3"],
"re": ["a3"]
}
}
Let’s adjust the search function. Previously, it accepted an index and a search
string, performed a hardcoded tokenization before looking up in the index, and
used a hardcoded threshold for how many search tokens had to match in order to
include an ID. This time, we’ll parameterize all of those.
function search({index, requiredMatches, tokenizer}, q) {
var tokens = tokenizer(q);
var hits = _.flatMap(tokens, t => lookupToken(index, t));
var results = _.groupBy(hits, r => r.id);
var n = Math.floor(tokens.length * requiredMatches || 1);
return Object.keys(results)
.filter(r => isRelevantResult(results[r], n))
.map(id => getScoredResult(id, results[id]))
.toSorted((a, b) => b.score - a.score);
}
To recreate the search from last time, we can call it as such:
search({
index: index,
tokenizer: tokenize,
requiredMatches: 0.8
}, "humor feet");
Searching Across Multiple Indexes
Now that we have multiple indexes, we can perform multiple searches and combine
the results. Let’s write a small helper function to assist with this:
function searchAll({queries}, q) {
var results = _.groupBy(
_.flatMap(queries, config => search(config, q)),
r => r.id
);
return Object.keys(results)
.map(id => getScoredResult(id, results[id], 1));
}
This function is quite similar to the search function and works in much the same
way, but in practice it performs an OR across all the queries that are passed
in. Now we can specify that a search should look in multiple indexes and combine
the results:
searchAll({
queries: [
{
index: index.titleNgram,
requiredMatches: 0.8,
tokenizer: s => tokenizeNgrams(s, 2)
},
{
index: index.title,
tokenizer: tokenize
}
]
}, "research humor");
Weighting
Now that we can search by querying the different indexes with different
parameters, we can finally introduce weighting. By adding a boost
to a query,
all matches from that query will be multiplied by that number:
searchAll({
queries: [
{
index: index.titleNgram,
requiredMatches: 0.8,
tokenizer: s => tokenizeNgrams(s, 2)
},
{
index: index.title,
tokenizer: tokenize,
boost: 20
},
{
index: index.description,
tokenizer: tokenize,
boost: 5
}
]
}, "research humor");
Now we’re searching across all indexes, but we give higher weight to matches on
whole words, especially those that appear in the title.
Auto-complete
Last time, I mentioned the possibility of using so-called edge ngrams for
auto-complete searches. Let’s build one more index:
var index = {
title: indexDocuments(data, a => tokenize(a.title)),
description: indexDocuments(data, a => tokenize(a.description)),
titleNgram: indexDocuments(data, a => tokenizeNgrams(a.title, 2)),
titleEdgeGrams: indexDocuments(data, a => tokenizeEdgeNgrams(a.title, 2, 15))
};
searchAll({
queries: [
{
index: index.titleEdgeGrams,
tokenizer: s => tokenizeEdgeNgrams(s, 2, 15)
}
]
}, "the rol");
//=> "The Role of Comical Feet in Humor"
And with that, we’ve got a fairly powerful little search engine in just over
100 lines of plain
JavaScript
– and that’s even with the three lodash functions inlined, making the code
completely dependency-free. Not bad!
We use a search engine like this to search through 2,000 food items on the new
matvaretabellen.no. To reduce the amount of
work done on the client, we serve a pre-chewed
index from the server (a
statically generated one, of course). You
can judge for yourself how well it works, but we think it turned out quite nice.