RUM: improved inverted index for full-text search based on GIN index
RUM is an extension which adds a RUM index to Postgres.
RUM index is based on GIN that stores additional per-entry information in a posting tree. For example, positional information of lexemes or timestamps. In comparison to GIN it can use this information to make faster index-only scans for:
- Phrase search
- Text search with ranking by text distance operator
- Text
SELECTs with ordering by some non-indexed additional column e.g. by timestamp.
RUM works best in scenarios when the possible keys are highly repeatable. I.e. all texts are composed of a limited amount of words, so per-lexeme indexing gives significant speed-up in searching texts containing word combinations or phrases.
Main operators for ordering are:
tsvector <=> tsquery | float4 | Distance between tsvector and tsquery.
value <=> value | float8 | Distance between two values.
Where value is timestamp, timestamptz, int2, int4, int8, float4, float8, money and oid
Usage#
Enable the extension#
You can get started with rum by enabling the extension in your Supabase dashboard.
- Go to the Database page in the Dashboard.
- Click on Extensions in the sidebar.
- Search for "rum" and enable the extension.
Syntax#
For type: tsvector#
To understand the following you may need first to see Official Postgres documentation on text search
rum_tsvector_ops
1CREATE TABLE test_rum(t text, a tsvector);23CREATE TRIGGER tsvectorupdate4BEFORE UPDATE OR INSERT ON test_rum5FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger('a', 'pg_catalog.english', 't');67INSERT INTO test_rum(t) VALUES ('The situation is most beautiful');8INSERT INTO test_rum(t) VALUES ('It is a beautiful');9INSERT INTO test_rum(t) VALUES ('It looks like a beautiful place');1011CREATE INDEX rumidx ON test_rum USING rum (a rum_tsvector_ops);And we can execute tsvector selects with ordering by text distance operator:
1SELECT t, a `<=>` to_tsquery('english', 'beautiful | place') AS rank2 FROM test_rum3 WHERE a @@ to_tsquery('english', 'beautiful | place')4 ORDER BY a `<=>` to_tsquery('english', 'beautiful | place');5 t | rank6---------------------------------+---------7 It looks like a beautiful place | 8.224678 The situation is most beautiful | 16.44939 It is a beautiful | 16.449310(3 rows)rum_tsvector_addon_ops
1CREATE TABLE tsts (id int, t tsvector, d timestamp);2CREATE INDEX tsts_idx ON tsts USING rum (t rum_tsvector_addon_ops, d)3 WITH (attach = 'd', to = 't');Now we can execute the selects with ordering distance operator on attached column:
1SELECT id, d, d `<=>` '2016-05-16 14:21:25' FROM tsts WHERE t @@ 'wr&qh' ORDER BY d `<=>` '2016-05-16 14:21:25' LIMIT 5;2 id | d | ?column?3-----+---------------------------------+---------------4 355 | Mon May 16 14:21:22.326724 2016 | 2.6732765 354 | Mon May 16 13:21:22.326724 2016 | 3602.6732766 371 | Tue May 17 06:21:22.326724 2016 | 57597.3267247 406 | Wed May 18 17:21:22.326724 2016 | 183597.3267248 415 | Thu May 19 02:21:22.326724 2016 | 215997.3267249(5 rows)For type: anyarray#
rum_anyarray_ops
This operator class stores anyarray elements with length of the array. It supports operators &&, @>, <@, =, % operators. It also supports ordering by <=> operator.
1CREATE TABLE test_array (i int2[]);2INSERT INTO test_array VALUES ('{}'), ('{0}'), ('{1,2,3,4}'), ('{1,2,3}'), ('{1,2}'), ('{1}');3CREATE INDEX idx_array ON test_array USING rum (i rum_anyarray_ops);Now we can execute the query using index scan:
1SELECT * FROM test_array WHERE i && '{1}' ORDER BY i `<=>` '{1}' ASC;2 i3-----------4 {1}5 {1,2}6 {1,2,3}7 {1,2,3,4}8(4 rows)rum_anyarray_addon_ops
The does the same with anyarray index as rum_tsvector_addon_ops i.e. allows to order select results using distance
operator by attached column.
Limitations#
RUM has slower build and insert times than GIN due to:
- It is bigger due to the additional attributes stored in the index.
- It uses generic WAL records.