PostgreSQL and Sphinx Search Seamless Integration

Full-text search is one of the most common tasks in the web development. Popular RDBMSs, such as MySQL and PostgreSQL, have built-in mechanisms for it. Unfortunately, their implementations are far from perfect. A simple breakdown of the words is not suitable for inflected languages, like Russian.

Here, specialized tools that cater to the specifics of different languages come in handy. A search server Sphinx has proven itself well and it supports multiple languages. Sphinx even has such an interesting option as a phonetic search, meaning that you can look for words like "ya", "luv", "kool", "K" and find "you", "love", "cool", and "ok".

Unfortunately, the amazing possibility to search (and find) arbitrarily text also means less flexibility.

How it works

The traditional scheme of work with a search server is following:

indexing program accesses the database directly, scans the required tables and builds the index files;
search service (daemon) runs;
application calls in to the service with a search query and gets a result.

If the data is being updated (what is most likely), then the indexes need to be updated. And here the scheme is far from perfect:

indexing program builds a new index in the intermediate files;
the signal is sent to the search service and it switches to the new indexes;
the old indexes are removed.

The indexer has to be run independently (cron), and this leads to the fact that the data is added to the search indexes with a delay. During the upgrade, the index requires twice as much disk space. It is possible to do incremental updates, but they require an additional procedure to merge indexes. It is also necessary to define somehow which data has already been indexed and which not yet. There are many approaches that can be used here: to remember the last inserted id, Boolean field, etc. Anyway, it will require writing some code. If we take into consideration that the data can not only be added but also updated and even removed, a picture turns depressing.

Fortunately, there is a solution. Sphinx offers such a great feature as a real-time indexes. The main difference of RT-indexes is that they do not require an indexer and allow you to add, update and delete data by applying directly to the search service.

In the traditional setup, the indexer would query the database independently, but in this case, the application needs to update the data. Is it feasible to merge these two methods by shifting the responsibility to the database?

The answer is yes, modern RDBMSs can monitor data changes and respond with triggers. This means that the triggers can be used to update search indexes. However, triggers are unable to establish network connections and all the backend work must be carried out using user-defined functions (UDFs). To achieve optimal performance, a shared library would need to be written, such as in Python. However, a pre-existing solution for PostgreSQL and Sphinx search integration would be preferable.

Let's move from words to deeds!

Pg-sphinx, an extension for linking Sphinx search with PostgreSQL, allows you to update indexes directly from the triggers, stored procedures and any other places where a call to function is possible. For example, the SQL-code below updates (or inserts if it was not there yet) record nr. 3 in blog_posts index.

    SELECT sphinx_replace('blog_posts', 3, ARRAY[
      'title', 'Report',
      'content', 'Here are photos from the last journey'
      ]);

Deletion is even easier:

    SELECT sphinx_delete('blog_posts', 3);

Well, as we have an extension, why do not we go further? Besides updating the data, we can make search queries right from the database too.

    SELECT * FROM sphinx_search(
                    'blog_posts',         /* index */
                    'recipes mayonnaise', /* query */
                    'author_id = 361',    /* additional condition */
                    '@relevance DESC',    /* sort order */
                    0,                    /* offset */
                    3,                    /* limit */
                    null);                /* options */

Such request will give something like this:

id	weight
144	1661
135	1644
130	1640

There is nothing fancy so far. We could get the same result in the application by contacting directly the search service. However, such a function in SQL-server opens wide opportunities because it can be used in more complex queries. The first and easiest thing that comes to mind is to use it as a source for the INNER JOIN.

    SELECT posts.*, ss.weight
    FROM posts
    INNER JOIN sphinx_search(
                        'blog_posts',         /* index */
                        'recipes mayonnaise', /* query */
                        'author_id = 361',    /* additional condition */
                        '@relevance DESC',    /* sort order */
                        0,                    /* offset */
                        3,                    /* limit */
                        null)                 /* options */
        AS ss ON ss.id = posts.id;

Such a request does not just look for identifiers, it also picks the records themselves.

id	title	content	weight
144	Senseless and merciless	Why you should stuff meatballs with pasta...	1661
135	Treasury of ideas	Hopefully this recipe collection will refresh your imagination...	1644
130	When the soul needs a holiday	I would like a carnival, brazilian...	1640

It can already be used in the application. The result of this query can be easily transferred to ORM.

Summary

Let's sum up the main aspects of the described approach to connecting Sphinx with PostgreSQL.

What did we get?

We got rid of running the indexer, there are no load peaks to the database.
The data in the search index is always up-to-date.
The application does not care about updating indexes, it is a responsibility of RDBMS.
The search is performed on the side of the database server and the application does not need to maintain a connection with the search engine.
Search queries can be randomly mixed with each other and with queries to the data itself.

What is missing? What is not implemented?

We made an implicit assumption that all data is stored in UTF-8 encoding and support of other encodings is not done deliberately.
Highlighting of the found words is not implemented.
Reconfiguration of connection to the search server requires recompilation.
We did not implement such features as transactions (currently AUTOCOMMIT by default) and user-defined functions (it is necessary to deploy expressions explicitly).
...

Nonetheless, this extension already can be used in the majority of simple applications, despite some disadvantages.

Hopefully, the ideas listed above have given you some useful insights on how to integrate Sphinx with PostgreSQL for the benefit of your application.

References:

Development