Таблица:
=> \d tweets
Table "public.tweets"
Column | Type | Collation | Nullable | Default
------------+--------------------------+-----------+----------+---------
id | bigint | | |
coord | point | | |
hashtags | text[] | | |
message | text | | |
created_at | timestamp with time zone | | |
tweet | jsonb | | |
Indexes:
"tweets_tweet_gin" gin (tweet)
"tweets_tweet_path" gin (tweet jsonb_path_ops)
Столбец tweet содержит исходный JSON, часть полей из которого вынесена в отдельные столбцы.
Размер таблицы:
=> SELECT count(*) FROM tweets;
count -------- 582643 (1 row)
=> SELECT pg_size_pretty(pg_table_size('tweets'));
pg_size_pretty ---------------- 1575 MB (1 row)
Вот наши точки:
=> SELECT id, coord FROM tweets WHERE coord IS NOT NULL LIMIT 10;
id | coord
--------------------+---------------------------------
952268627343888385 | (-84.3879824,33.7489954)
952268343175598080 | (-104.8771726,39.5807452)
953431814328107008 | (-117.94764155,33.71010495)
943747045000544256 | (145.053135344,-37.97256651425)
953420017630109697 | (-77.0143985,38.89860285)
951179952065425409 | (-73.9894807,40.7354199)
951159735033643009 | (-75.3599105,40.0462208)
951151295779414017 | (-77.4310992,38.8942786)
940564700500180993 | (-80.2781262,36.1341553)
940564668644450304 | (-77.7199932,39.6417629)
(10 rows)
Есть ли кто из Москвы и окрестностей?
=> SELECT id, left(message,80), hashtags, tweet #>> '{place,name}' FROM tweets WHERE coord <@ box '36.62, 54.75, 38.62, 56.75' \gx
-[ RECORD 1 ]------------------------------------------------------------------------------
id | 943030276300705792
left | #EU considers launching database of Bitcoin owners to crack down on criminals' |
hashtags | {EU}
?column? | Moscow
-[ RECORD 2 ]------------------------------------------------------------------------------
id | 942987014651211782
left | @leastfixedpoint The actual git database can represent empty dirs … no problem t
hashtags | {}
?column? | Moscow
-[ RECORD 3 ]------------------------------------------------------------------------------
id | 951035902003998720
left | Why Microsoft’s Cosmos DB may displace AWS’s cloud databases https://t.co/EZGYKa
hashtags | {}
?column? | Красногорск
Пока нет индекса, возможно только последовательное сканирование таблицы:
=> EXPLAIN (costs off,analyze) SELECT * FROM tweets WHERE coord <@ box '36.62, 54.75, 38.62, 56.75';
QUERY PLAN
-----------------------------------------------------------------
Seq Scan on tweets (actual time=17.683..202.791 rows=3 loops=1)
Filter: (coord <@ '(38.62,56.75),(36.62,54.75)'::box)
Rows Removed by Filter: 582640
Planning time: 0.143 ms
Execution time: 203.314 ms
(5 rows)
Создадим индекс GiST.
Тип индекса указывается в предложении USING (а по умолчанию - B-дерево).
=> CREATE INDEX tweets_coord_gist ON tweets USING gist(coord);
CREATE INDEX
=> EXPLAIN (costs off,analyze) SELECT * FROM tweets WHERE coord <@ box '36.62, 54.75, 38.62, 56.75';
QUERY PLAN
----------------------------------------------------------------------------------------
Bitmap Heap Scan on tweets (actual time=0.668..0.674 rows=3 loops=1)
Recheck Cond: (coord <@ '(38.62,56.75),(36.62,54.75)'::box)
Heap Blocks: exact=3
-> Bitmap Index Scan on tweets_coord_gist (actual time=0.658..0.658 rows=3 loops=1)
Index Cond: (coord <@ '(38.62,56.75),(36.62,54.75)'::box)
Planning time: 9.273 ms
Execution time: 1.924 ms
(7 rows)
Теперь запрос выполняется с помощью индекса.
Как найти 10 твитов, сделанных из точек, максимально близких к Москве?
Можно постепенно увеличивать радиус... но на сколько?
GiST дает лучшее решение с помощью упорядочивающего оператора <->.
=> SELECT tweet #>> '{place,name}' AS location, left(message,60) AS message FROM tweets ORDER BY coord <-> point(37.62, 55.75) LIMIT 10;
location | message
---------------------+--------------------------------------------------------------
Moscow | @leastfixedpoint The actual git database can represent empty
Moscow | #EU considers launching database of Bitcoin owners to crack
Красногорск | Why Microsoft’s Cosmos DB may displace AWS’s cloud databases
Коммунар | @_vpol_ нашел уже, graph database
Ukraine | @StayDashy but seriously you must go to pornhub and make a d
Ukraine | We are proud to announce that United States #Patent and Trad
Ukraine | #ArcelorMittalKryvyiRih will invest 1.5 bln USD for moderniz
Republic of Belarus | @abeisgreat @FirebaseRelease @Firebase i.e. all my database?
Joensuu | 1.4 Billion Clear Text #Credentials Discovered in a Single D
Joensuu | some of the last entries added to the leaked database. It’s
(10 rows)
Запрос выполняется с помощью индекса:
=> EXPLAIN (costs off,analyze) SELECT * FROM tweets ORDER BY coord <-> point(37.62, 55.75) LIMIT 10;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Limit (actual time=0.065..0.164 rows=10 loops=1)
-> Index Scan using tweets_coord_gist on tweets (actual time=0.064..0.158 rows=10 loops=1)
Order By: (coord <-> '(37.62,55.75)'::point)
Planning time: 0.079 ms
Execution time: 0.190 ms
(5 rows)
Далеко не так эффективно, но иногда полезно. Например, хотим показать твиты отсортированными по дате.
=> EXPLAIN (costs off,analyze) SELECT * FROM tweets WHERE coord <@ '-74.0, 40.75, 2'::circle ORDER BY created_at DESC LIMIT 10;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Limit (actual time=1.102..1.109 rows=10 loops=1)
-> Sort (actual time=1.101..1.103 rows=10 loops=1)
Sort Key: created_at DESC
Sort Method: top-N heapsort Memory: 23kB
-> Bitmap Heap Scan on tweets (actual time=0.223..0.844 rows=498 loops=1)
Recheck Cond: (coord <@ '<(-74,40.75),2>'::circle)
Heap Blocks: exact=460
-> Bitmap Index Scan on tweets_coord_gist (actual time=0.167..0.167 rows=498 loops=1)
Index Cond: (coord <@ '<(-74,40.75),2>'::circle)
Planning time: 0.103 ms
Execution time: 1.151 ms
(11 rows)
=> CREATE INDEX ON tweets USING gist(coord,created_at);
ERROR: data type timestamp with time zone has no default operator class for access method "gist" HINT: You must specify an operator class for the index or define a default operator class for the data type.
Встроенной поддержки обычных типов данных для GiST нет...
...но есть расширение:
=> CREATE EXTENSION btree_gist;
CREATE EXTENSION
=> CREATE INDEX tweets_coord_created_gist ON tweets USING gist(coord,created_at);
CREATE INDEX
Твиты в окрестности Нью-Йорка, "ближайшие к сегодня":
=> SELECT tweet #>> '{place,name}', created_at, left(message,40) FROM tweets WHERE coord <@ '-74.0, 40.75, 2'::circle ORDER BY created_at <-> now() LIMIT 10;
?column? | created_at | left ---------------+------------------------+------------------------------------------ Trenton | 2018-01-24 07:06:19+03 | S978 [NEW] "New Jersey All-Payer Claims Manhattan | 2018-01-24 05:28:13+03 | We are hopefully moving on past “build a Manhattan | 2018-01-24 00:35:29+03 | Want to work at Actus Consulting Group? New Jersey | 2018-01-23 23:32:32+03 | Want to work in #Fairfield, NJ? View our Manhattan | 2018-01-23 22:52:26+03 | Join the Accenture team! See our latest Philadelphia | 2018-01-23 22:04:23+03 | See our latest #Philadelphia, PA #job an Kearny | 2018-01-23 19:50:27+03 | We're #hiring! Read about our latest #jo Brooklyn | 2018-01-23 19:26:22+03 | This #job might be a great fit for you: Trenton | 2018-01-23 18:26:18+03 | A1098 [NEW] Requires newly manufactured New Brunswick | 2018-01-23 17:21:49+03 | Want to work at Kelly Services? We're #h (10 rows)
Запрос снова выполняется с помощью индекса.
=> EXPLAIN (costs off,analyze) SELECT * FROM tweets WHERE coord <@ '-74.0, 40.75, 2'::circle ORDER BY created_at <-> now() LIMIT 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------------
Limit (actual time=0.161..0.442 rows=10 loops=1)
-> Index Scan using tweets_coord_created_gist on tweets (actual time=0.159..0.435 rows=10 loops=1)
Index Cond: (coord <@ '<(-74,40.75),2>'::circle)
Order By: (created_at <-> now())
Planning time: 0.124 ms
Execution time: 0.494 ms
(6 rows)
=> CREATE INDEX tweets_coord_quad_spgist ON tweets USING spgist(coord);
CREATE INDEX
Запрос использует созданный индекс:
=> EXPLAIN (costs off,analyze) SELECT * FROM tweets WHERE coord <@ box '36.62, 54.75, 38.62, 56.75';
QUERY PLAN
-----------------------------------------------------------------------------------------------
Bitmap Heap Scan on tweets (actual time=0.020..0.023 rows=3 loops=1)
Recheck Cond: (coord <@ '(38.62,56.75),(36.62,54.75)'::box)
Heap Blocks: exact=3
-> Bitmap Index Scan on tweets_coord_quad_spgist (actual time=0.014..0.014 rows=3 loops=1)
Index Cond: (coord <@ '(38.62,56.75),(36.62,54.75)'::box)
Planning time: 0.177 ms
Execution time: 0.046 ms
(7 rows)
Чтобы создать k-D-дерево с помощью SP-GiST, нужно явно указать класс операторов:
=> CREATE INDEX tweets_coord_kd_spgist ON tweets USING spgist(coord kd_point_ops);
CREATE INDEX
=> EXPLAIN (costs off,analyze) SELECT * FROM tweets WHERE coord <@ box '36.62, 54.75, 38.62, 56.75';
QUERY PLAN
---------------------------------------------------------------------------------------------
Bitmap Heap Scan on tweets (actual time=0.028..0.031 rows=3 loops=1)
Recheck Cond: (coord <@ '(38.62,56.75),(36.62,54.75)'::box)
Heap Blocks: exact=3
-> Bitmap Index Scan on tweets_coord_kd_spgist (actual time=0.022..0.022 rows=3 loops=1)
Index Cond: (coord <@ '(38.62,56.75),(36.62,54.75)'::box)
Planning time: 0.160 ms
Execution time: 0.059 ms
(7 rows)
В нашем случае разница между классами не ощущается.
Сравним размеры трех индексов:
=> SELECT pg_size_pretty(pg_total_relation_size('tweets_coord_gist')) as gist, pg_size_pretty(pg_total_relation_size('tweets_coord_quad_spgist')) as quad_spgist, pg_size_pretty(pg_total_relation_size('tweets_coord_kd_spgist')) as kd_spgist;
gist | quad_spgist | kd_spgist -------+-------------+----------- 20 MB | 14 MB | 14 MB (1 row)
Конец демонстрации.