Твиты

Таблица:

=> \d tweets
                         Table "public.tweets"
   Column   |           Type           | Collation | Nullable | Default 
------------+--------------------------+-----------+----------+---------
 id         | bigint                   |           |          | 
 coord      | point                    |           |          | 
 hashtags   | text[]                   |           |          | 
 message    | text                     |           |          | 
 created_at | timestamp with time zone |           |          | 
 tweet      | jsonb                    |           |          | 
Indexes:
    "tweets_tweet_gin" gin (tweet)
    "tweets_tweet_path" gin (tweet jsonb_path_ops)

Столбец tweet содержит исходный JSON, часть полей из которого вынесена в отдельные столбцы.


Размер таблицы:

=> SELECT count(*) FROM tweets;
 count  
--------
 582643
(1 row)

=> SELECT pg_size_pretty(pg_table_size('tweets'));
 pg_size_pretty 
----------------
 1575 MB
(1 row)


Вот наши точки:

=> SELECT id, coord FROM tweets WHERE coord IS NOT NULL LIMIT 10;
         id         |              coord              
--------------------+---------------------------------
 952268627343888385 | (-84.3879824,33.7489954)
 952268343175598080 | (-104.8771726,39.5807452)
 953431814328107008 | (-117.94764155,33.71010495)
 943747045000544256 | (145.053135344,-37.97256651425)
 953420017630109697 | (-77.0143985,38.89860285)
 951179952065425409 | (-73.9894807,40.7354199)
 951159735033643009 | (-75.3599105,40.0462208)
 951151295779414017 | (-77.4310992,38.8942786)
 940564700500180993 | (-80.2781262,36.1341553)
 940564668644450304 | (-77.7199932,39.6417629)
(10 rows)


Есть ли кто из Москвы и окрестностей?

=> SELECT id, left(message,80), hashtags, tweet #>> '{place,name}'
FROM tweets WHERE coord <@ box '36.62, 54.75, 38.62, 56.75' \gx
-[ RECORD 1 ]------------------------------------------------------------------------------
id       | 943030276300705792
left     | #EU considers launching database of Bitcoin owners to crack down on criminals' |
hashtags | {EU}
?column? | Moscow
-[ RECORD 2 ]------------------------------------------------------------------------------
id       | 942987014651211782
left     | @leastfixedpoint The actual git database can represent empty dirs … no problem t
hashtags | {}
?column? | Moscow
-[ RECORD 3 ]------------------------------------------------------------------------------
id       | 951035902003998720
left     | Why Microsoft’s Cosmos DB may displace AWS’s cloud databases https://t.co/EZGYKa
hashtags | {}
?column? | Красногорск


Последовательное сканирование

Пока нет индекса, возможно только последовательное сканирование таблицы:

=> EXPLAIN (costs off,analyze)
SELECT * FROM tweets WHERE coord <@ box '36.62, 54.75, 38.62, 56.75';
                           QUERY PLAN                            
-----------------------------------------------------------------
 Seq Scan on tweets (actual time=17.683..202.791 rows=3 loops=1)
   Filter: (coord <@ '(38.62,56.75),(36.62,54.75)'::box)
   Rows Removed by Filter: 582640
 Planning time: 0.143 ms
 Execution time: 203.314 ms
(5 rows)


GiST

Создадим индекс GiST.

Тип индекса указывается в предложении USING (а по умолчанию - B-дерево).

=> CREATE INDEX tweets_coord_gist ON tweets USING gist(coord);
CREATE INDEX

=> EXPLAIN (costs off,analyze)
SELECT * FROM tweets WHERE coord <@ box '36.62, 54.75, 38.62, 56.75';
                                       QUERY PLAN                                       
----------------------------------------------------------------------------------------
 Bitmap Heap Scan on tweets (actual time=0.668..0.674 rows=3 loops=1)
   Recheck Cond: (coord <@ '(38.62,56.75),(36.62,54.75)'::box)
   Heap Blocks: exact=3
   ->  Bitmap Index Scan on tweets_coord_gist (actual time=0.658..0.658 rows=3 loops=1)
         Index Cond: (coord <@ '(38.62,56.75),(36.62,54.75)'::box)
 Planning time: 9.273 ms
 Execution time: 1.924 ms
(7 rows)

Теперь запрос выполняется с помощью индекса.


GiST: k-NN search

Как найти 10 твитов, сделанных из точек, максимально близких к Москве?

Можно постепенно увеличивать радиус... но на сколько?


GiST дает лучшее решение с помощью упорядочивающего оператора <->.

=> SELECT tweet #>> '{place,name}' AS location, left(message,60) AS message
FROM tweets ORDER BY coord <-> point(37.62, 55.75) LIMIT 10;
      location       |                           message                            
---------------------+--------------------------------------------------------------
 Moscow              | @leastfixedpoint The actual git database can represent empty
 Moscow              | #EU considers launching database of Bitcoin owners to crack 
 Красногорск         | Why Microsoft’s Cosmos DB may displace AWS’s cloud databases
 Коммунар            | @_vpol_ нашел уже, graph database
 Ukraine             | @StayDashy but seriously you must go to pornhub and make a d
 Ukraine             | We are proud to announce that United States #Patent and Trad
 Ukraine             | #ArcelorMittalKryvyiRih will invest 1.5 bln USD for moderniz
 Republic of Belarus | @abeisgreat @FirebaseRelease @Firebase i.e. all my database?
 Joensuu             | 1.4 Billion Clear Text #Credentials Discovered in a Single D
 Joensuu             | some of the last entries added to the leaked database. It’s 
(10 rows)


Запрос выполняется с помощью индекса:

=> EXPLAIN (costs off,analyze)
SELECT * FROM tweets ORDER BY coord <-> point(37.62, 55.75) LIMIT 10;
                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 Limit (actual time=0.065..0.164 rows=10 loops=1)
   ->  Index Scan using tweets_coord_gist on tweets (actual time=0.064..0.158 rows=10 loops=1)
         Order By: (coord <-> '(37.62,55.75)'::point)
 Planning time: 0.079 ms
 Execution time: 0.190 ms
(5 rows)


GiST как B-дерево?

Далеко не так эффективно, но иногда полезно. Например, хотим показать твиты отсортированными по дате.

=> EXPLAIN (costs off,analyze)
SELECT * FROM tweets
WHERE coord <@ '-74.0, 40.75, 2'::circle
ORDER BY created_at DESC LIMIT 10;
                                              QUERY PLAN                                              
------------------------------------------------------------------------------------------------------
 Limit (actual time=1.102..1.109 rows=10 loops=1)
   ->  Sort (actual time=1.101..1.103 rows=10 loops=1)
         Sort Key: created_at DESC
         Sort Method: top-N heapsort  Memory: 23kB
         ->  Bitmap Heap Scan on tweets (actual time=0.223..0.844 rows=498 loops=1)
               Recheck Cond: (coord <@ '<(-74,40.75),2>'::circle)
               Heap Blocks: exact=460
               ->  Bitmap Index Scan on tweets_coord_gist (actual time=0.167..0.167 rows=498 loops=1)
                     Index Cond: (coord <@ '<(-74,40.75),2>'::circle)
 Planning time: 0.103 ms
 Execution time: 1.151 ms
(11 rows)


=> CREATE INDEX ON tweets USING gist(coord,created_at);
ERROR:  data type timestamp with time zone has no default operator class for access method "gist"
HINT:  You must specify an operator class for the index or define a default operator class for the data type.

Встроенной поддержки обычных типов данных для GiST нет...


...но есть расширение:

=> CREATE EXTENSION btree_gist;
CREATE EXTENSION
=> CREATE INDEX tweets_coord_created_gist ON tweets USING gist(coord,created_at);
CREATE INDEX

Твиты в окрестности Нью-Йорка, "ближайшие к сегодня":

=> SELECT tweet #>> '{place,name}', created_at, left(message,40) FROM tweets
WHERE coord <@ '-74.0, 40.75, 2'::circle
ORDER BY created_at <-> now() LIMIT 10;
   ?column?    |       created_at       |                   left                   
---------------+------------------------+------------------------------------------
 Trenton       | 2018-01-24 07:06:19+03 | S978 [NEW] "New Jersey All-Payer Claims 
 Manhattan     | 2018-01-24 05:28:13+03 | We are hopefully moving on past “build a
 Manhattan     | 2018-01-24 00:35:29+03 | Want to work at Actus Consulting Group? 
 New Jersey    | 2018-01-23 23:32:32+03 | Want to work in #Fairfield, NJ? View our
 Manhattan     | 2018-01-23 22:52:26+03 | Join the Accenture team! See our latest 
 Philadelphia  | 2018-01-23 22:04:23+03 | See our latest #Philadelphia, PA #job an
 Kearny        | 2018-01-23 19:50:27+03 | We're #hiring! Read about our latest #jo
 Brooklyn      | 2018-01-23 19:26:22+03 | This #job might be a great fit for you: 
 Trenton       | 2018-01-23 18:26:18+03 | A1098 [NEW] Requires newly manufactured 
 New Brunswick | 2018-01-23 17:21:49+03 | Want to work at Kelly Services? We're #h
(10 rows)


Запрос снова выполняется с помощью индекса.

=> EXPLAIN (costs off,analyze)
SELECT * FROM tweets
WHERE coord <@ '-74.0, 40.75, 2'::circle
ORDER BY created_at <-> now() LIMIT 10;
                                              QUERY PLAN                                               
-------------------------------------------------------------------------------------------------------
 Limit (actual time=0.161..0.442 rows=10 loops=1)
   ->  Index Scan using tweets_coord_created_gist on tweets (actual time=0.159..0.435 rows=10 loops=1)
         Index Cond: (coord <@ '<(-74,40.75),2>'::circle)
         Order By: (created_at <-> now())
 Planning time: 0.124 ms
 Execution time: 0.494 ms
(6 rows)


SP-GiST (quadtree)

=> CREATE INDEX tweets_coord_quad_spgist ON tweets USING spgist(coord);
CREATE INDEX

Запрос использует созданный индекс:

=> EXPLAIN (costs off,analyze)
SELECT * FROM tweets WHERE coord <@ box '36.62, 54.75, 38.62, 56.75';
                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 Bitmap Heap Scan on tweets (actual time=0.020..0.023 rows=3 loops=1)
   Recheck Cond: (coord <@ '(38.62,56.75),(36.62,54.75)'::box)
   Heap Blocks: exact=3
   ->  Bitmap Index Scan on tweets_coord_quad_spgist (actual time=0.014..0.014 rows=3 loops=1)
         Index Cond: (coord <@ '(38.62,56.75),(36.62,54.75)'::box)
 Planning time: 0.177 ms
 Execution time: 0.046 ms
(7 rows)


SP-GiST (kd-tree)

Чтобы создать k-D-дерево с помощью SP-GiST, нужно явно указать класс операторов:

=> CREATE INDEX tweets_coord_kd_spgist ON tweets USING spgist(coord kd_point_ops);
CREATE INDEX
=> EXPLAIN (costs off,analyze)
SELECT * FROM tweets WHERE coord <@ box '36.62, 54.75, 38.62, 56.75';
                                         QUERY PLAN                                          
---------------------------------------------------------------------------------------------
 Bitmap Heap Scan on tweets (actual time=0.028..0.031 rows=3 loops=1)
   Recheck Cond: (coord <@ '(38.62,56.75),(36.62,54.75)'::box)
   Heap Blocks: exact=3
   ->  Bitmap Index Scan on tweets_coord_kd_spgist (actual time=0.022..0.022 rows=3 loops=1)
         Index Cond: (coord <@ '(38.62,56.75),(36.62,54.75)'::box)
 Planning time: 0.160 ms
 Execution time: 0.059 ms
(7 rows)

В нашем случае разница между классами не ощущается.


Сравним размеры трех индексов:

=> SELECT pg_size_pretty(pg_total_relation_size('tweets_coord_gist')) as gist,
pg_size_pretty(pg_total_relation_size('tweets_coord_quad_spgist')) as quad_spgist,
pg_size_pretty(pg_total_relation_size('tweets_coord_kd_spgist')) as kd_spgist;
 gist  | quad_spgist | kd_spgist 
-------+-------------+-----------
 20 MB | 14 MB       | 14 MB
(1 row)


Конец демонстрации.