Building a Full-Text Search Engine in a Single File with SQLite
SQLite's FTS5 plugin allows you to add full-text search to your apps with support for word stemming, ranking functions, and an advanced query syntax.
What is SQLite?
SQLite is a full, self-contained SQL database engine that stores information in a single file. It’s probably already installed on every device you own, and it’s built and maintained by a small team of only three developers.
If you aren’t familiar with it, I highly recommend you read their about page.
When should you use it?
You can use SQLite for most small to medium-scale web applications (in addition to many other documented uses) without facing any performance issues. For most read-heavy applications, SQLite is all you’ll ever need! If you need to scale your backend horizontally, you could still stick with SQLite if you wanted to via LiteFS or migrate to a client/server database like PostgreSQL.
Easysearch
If you are interested in a project built on the same concepts that this article explores, check out Easysearch. It is an open-source project that I built that crawls your site, adds pages to a search index, and exposes a JSON API and a prebuilt search interface for you.
Enabling FTS5
FTS5
(“Full Text Search 5”) is a built-in SQLite extension that allows you to create full-text search indexes on your data.
To enable it, you can pass a compile-time option, but you are most likely using a prebuilt version of SQLite with a library for your favorite programming language.
In most distributions, it’s enabled by default. You can check if it’s enabled by running
This opens an in-memory database. Then, run
If ENABLE_FTS5
is present in the output, then you’re good to go!
A specific case that I came across: if you’re using mattn/go-sqlite3, pass the fts5
compile tag to include the extension when compiling:
Creating a Table
To get started, let’s create a table for our documents:
Then, let’s create a virtual table for the FTS5
extension to maintain a search index:
This statement specifies the columns we want to be indexed (url
, title
, description
, and content
) and where the content is coming from (the pages
table).
Even though we specify that content comes from the pages
table, we still have to update the virtual table manually whenever we update the content table to keep the two in sync. This can be automated with triggers (see below).
Aside: why not just use one table?
We could use UNINDEXED
columns in an FTS5
virtual table instead of creating a separate table for metadata.
To do this, you would create the table like this:
Keeping all of our information in one table gives us simplicity at the cost of control.
- Pro: This way, you don’t have to keep two tables in sync, and your search index will never be out of sync with your content.
- Con: In FTS tables, all of your columns are of the
TEXT
type. - Con: You can’t add indexes to FTS5 virtual tables (besides the default full-text search index, of course). This means that filtering using
WHERE
clauses that don’t useMATCH
could cause a full table scan, greatly reducing performance. - Con: With a one-table setup, you have less control over when the index is updated, which can be computationally expensive. By separating content and metadata, you can decide to (for example) update metadata without updating content or update content in bulk after a certain amount of batched updates.
For these reasons, if you have any metadata that doesn’t need to be added to the full-text search index, I recommend going with a two-table approach.
Keeping Your Content Up-to-Date
The FTS5 documentation has an example set of triggers to keep your FTS table in sync with your content table.
Here it is:
In this example, the tbl_ai
(ai
= “auto insert”) trigger inserts a row into the FTS table after a row is inserted into the content table. The tbl_ad
(“auto delete”) and tbl_au
(“auto update”) triggers do the same.
Here are a few FTS-specific items to note:
- The auto-update trigger (
tbl_au
) has to manually remove and reinsert the row because there is no way toUPDATE
a row in an FTS virtual table. - The
INSERT INTO ... VALUES ('delete', ...)
lines invoke the special FTS5delete
command. Notice that, to delete a row, you must provide the values of all of its columns.
Now that the FTS virtual table will be updated automatically whenever we insert data, we can pretend that the it doesn’t exist until we need to query it.
Updating the index manually
You can also forgo the triggers, but you would have to update your FTS table every time you update the content table to keep the index in sync.
For example (using the same table names from the previous example):
Querying
Now that you’ve added your content, it’s time to run some search queries! You can search for content in your FTS5 virtual table using MATCH
expressions:
FTS5 has a custom query syntax with a few operators:
- Column specifiers: only search one column for matches
- AND and OR: combine clauses
- NOT: exclude matches
- NEAR: find matches within a certain proximity to other matches
- “quotes”: find a full phrase in a document
- *: wildcard
All of these are optional. If you want to, you can expose these options to your users in a search interface, but you need to be careful! Unclosed quotes or other incorrect syntax will cause your query to fail.
The simplest type of query is a bare word:
This searches for all items that contain the word “hello” in any column. You can search specfic columns using a column specifier:
…or, if you want to search multiple columns, format them as a space-separated list inside curly braces:
You can negate a column specifier using a -
before the expression:
Search for matches at the start of a column value using the ^
character, just like in a regular expression:
Note that, contrary to what you might expect, you cannot use $
for a string-end match.
If your phrase contains multiple words, make sure to surround it with quotes or join the words with the +
operator!
Combine different search queries into one using the boolean operators, AND
, OR
, and NOT
:
One of the most useful operators is *
, which marks the preceding token as a prefix token, matching the start of a word.
Note that this cannot be used at the beginning of a token. It only works for the end of a phrase. If it’s included anywhere other than the end of a phrase, the FTS5 extension will either ignore it or include it in the query, depending on the current tokenizer.
Finally, you can search for words based on their proximity to other words using the NEAR
function.
Note that there is no comma between the two phrases, but there is one before the number.
This query searches for documents matching "hello"
within 10 tokens of a match for "world"
.
Escaping User Queries
If you need to escape users’ search queries, perform some sanitization along these lines:
- Split words by word boundaries (
\W
in a regular expression) - Remove empty words (optional)
- Surround each word with double quotes
- Join the words together with spaces in between
- Add a
*
to the end of the string to help complete partial queries (optional)
Splitting words by word boundaries will also remove quotes from each word, so you don’t need to worry about stripping those manually.
As an example, here is what I do for Easysearch:
Ranking
If you are building a search engine, you will obviously want to sort results by relevance. With FTS5, you can do this by sorting by the rank
column:
If you SELECT
the rank
column, you can see the bm25
scores for each result. Lower numbers indicate better matches because the default sorting order is ascending.
Column Weighting
If some columns should be weighted higher than others in your results, you can use the built-in bm25
ranking function instead of the default rank
column.
This query ranks the first column in the table twice as heavily as the second column. Add as many numeric parameters as you have columns in your FTS table; if you don’t have enough, the remaining ones default to 1.0
.
Highlighting Matches in Returned Results
FTS5 provides two auxiliary functions to help expose the matched phrases in search results: highlight
and snippet
.
highlight
The highlight
function returns the entire column value for each row. When a match is encountered, it is surrounded by the characters passed as the third (starting character) and fourth (ending character) arguments.
The first argument is the name of the FTS table, and the second argument is the 0-based index of the column to match.
This allows you to emphasize the matches for the user’s query in your search results page.
snippet
The snippet
function is just like highlight
, but it only returns a short section before and after the matched text. This is helpful when dealing with long text content in a column since highlight
automatically performs truncation that you would likely have to do otherwise.
It accepts 6 arguments:
- The name of the FTS table
- The column index, or pass a negative number to automatically choose a column
- A prefix string for each match
- A suffix string for each match
- A string to add at the beginning and/or end when there is additional content that was truncated; I recommend ellipses (
…
) - The maximum number of tokens (≈ number of words) to include in the snippet; up to 64
Here’s an example:
This selects rows matching the search term “hello”, highlights the exact matches in the 2nd column by surrounding them with <b>
and </b>
tags, truncates the match to at most 8 tokens, and adds ellipses where the snippet was truncated. (In a real project, you wouldn’t use HTML tags like this since it would open you up to XSS attacks, but it’s a good demonstrative example.)
In Easysearch’s FTS queries, I use highlight
for the page title (since I always want the full title to be returned) and snippet
for the description and page content (for relevant search result snippets).
FTS5 Commands
FTS5 allows you to administrate your virtual table using a number of special commands. These are formatted as INSERT
statements — likely to avoid adding any new SQL keywords (maybe SQLite extensions aren’t allowed to do this?), but I couldn’t find a reliable source that explains this decision.
Here are some of the most useful commands:
optimize
This command merges all separate B-trees into one, making your database smaller and your queries faster.
If you want to reclaim disk space, you will have to run VACUUM
.
rebuild
If your content table gets out of sync with your FTS virtual table, you can use the rebuild
command to recreate your search index from your content.
integrity-check
This command allows you to ensure your search index is consistent internally and matches your content table. If there is a problem, these commands will fail with a SQLITE_CORRUPT_VTAB
error, which can be fixed by running the rebuild
command as described above.
More Commands
For a complete command reference, see the official documentation.
Further reading
- The official SQLite FTS5 documentation: https://sqlite.org/fts5.html
- Writing a custom scoring algorithm on top of SQLite’s FTS4 extension: https://simonwillison.net/2019/Jan/7/exploring-search-relevance-algorithms-sqlite/
- Building a SQLite-powered search engine with Python and Datasette: https://simonwillison.net/2018/Dec/19/fast-autocomplete-search/
- An article exploring the different ways to implement full-text search and their tradeoffs: https://medium.com/dev-channel/how-to-add-full-text-search-to-your-website-4e9c80ce2bf4
- A SQLite extension that helps account for misspellings in search queries: https://sqlite.org/spellfix1.html
- My open-source web crawler that builds an FTS5 index on your website: https://github.com/FluxCapacitor2/easysearch/