Таблица:
=> \d tweets
Table "public.tweets" Column | Type | Collation | Nullable | Default ------------+--------------------------+-----------+----------+--------- id | bigint | | | coord | point | | | hashtags | text[] | | | message | text | | | created_at | timestamp with time zone | | | tweet | jsonb | | | Indexes: "tweets_tweet_gin" gin (tweet) "tweets_tweet_path" gin (tweet jsonb_path_ops)
Столбец tweet содержит исходный JSON, часть полей из которого вынесена в отдельные столбцы.
Размер таблицы:
=> SELECT count(*) FROM tweets;
count -------- 582643 (1 row)
=> SELECT pg_size_pretty(pg_table_size('tweets'));
pg_size_pretty ---------------- 1575 MB (1 row)
Вот наши точки:
=> SELECT id, coord FROM tweets WHERE coord IS NOT NULL LIMIT 10;
id | coord --------------------+--------------------------------- 952268627343888385 | (-84.3879824,33.7489954) 952268343175598080 | (-104.8771726,39.5807452) 953431814328107008 | (-117.94764155,33.71010495) 943747045000544256 | (145.053135344,-37.97256651425) 953420017630109697 | (-77.0143985,38.89860285) 951179952065425409 | (-73.9894807,40.7354199) 951159735033643009 | (-75.3599105,40.0462208) 951151295779414017 | (-77.4310992,38.8942786) 940564700500180993 | (-80.2781262,36.1341553) 940564668644450304 | (-77.7199932,39.6417629) (10 rows)
Есть ли кто из Москвы и окрестностей?
=> SELECT id, left(message,80), hashtags, tweet #>> '{place,name}' FROM tweets WHERE coord <@ box '36.62, 54.75, 38.62, 56.75' \gx
-[ RECORD 1 ]------------------------------------------------------------------------------ id | 943030276300705792 left | #EU considers launching database of Bitcoin owners to crack down on criminals' | hashtags | {EU} ?column? | Moscow -[ RECORD 2 ]------------------------------------------------------------------------------ id | 942987014651211782 left | @leastfixedpoint The actual git database can represent empty dirs … no problem t hashtags | {} ?column? | Moscow -[ RECORD 3 ]------------------------------------------------------------------------------ id | 951035902003998720 left | Why Microsoft’s Cosmos DB may displace AWS’s cloud databases https://t.co/EZGYKa hashtags | {} ?column? | Красногорск
Пока нет индекса, возможно только последовательное сканирование таблицы:
=> EXPLAIN (costs off,analyze) SELECT * FROM tweets WHERE coord <@ box '36.62, 54.75, 38.62, 56.75';
QUERY PLAN ----------------------------------------------------------------- Seq Scan on tweets (actual time=17.683..202.791 rows=3 loops=1) Filter: (coord <@ '(38.62,56.75),(36.62,54.75)'::box) Rows Removed by Filter: 582640 Planning time: 0.143 ms Execution time: 203.314 ms (5 rows)
Создадим индекс GiST.
Тип индекса указывается в предложении USING (а по умолчанию - B-дерево).
=> CREATE INDEX tweets_coord_gist ON tweets USING gist(coord);
CREATE INDEX
=> EXPLAIN (costs off,analyze) SELECT * FROM tweets WHERE coord <@ box '36.62, 54.75, 38.62, 56.75';
QUERY PLAN ---------------------------------------------------------------------------------------- Bitmap Heap Scan on tweets (actual time=0.668..0.674 rows=3 loops=1) Recheck Cond: (coord <@ '(38.62,56.75),(36.62,54.75)'::box) Heap Blocks: exact=3 -> Bitmap Index Scan on tweets_coord_gist (actual time=0.658..0.658 rows=3 loops=1) Index Cond: (coord <@ '(38.62,56.75),(36.62,54.75)'::box) Planning time: 9.273 ms Execution time: 1.924 ms (7 rows)
Теперь запрос выполняется с помощью индекса.
Как найти 10 твитов, сделанных из точек, максимально близких к Москве?
Можно постепенно увеличивать радиус... но на сколько?
GiST дает лучшее решение с помощью упорядочивающего оператора <->.
=> SELECT tweet #>> '{place,name}' AS location, left(message,60) AS message FROM tweets ORDER BY coord <-> point(37.62, 55.75) LIMIT 10;
location | message ---------------------+-------------------------------------------------------------- Moscow | @leastfixedpoint The actual git database can represent empty Moscow | #EU considers launching database of Bitcoin owners to crack Красногорск | Why Microsoft’s Cosmos DB may displace AWS’s cloud databases Коммунар | @_vpol_ нашел уже, graph database Ukraine | @StayDashy but seriously you must go to pornhub and make a d Ukraine | We are proud to announce that United States #Patent and Trad Ukraine | #ArcelorMittalKryvyiRih will invest 1.5 bln USD for moderniz Republic of Belarus | @abeisgreat @FirebaseRelease @Firebase i.e. all my database? Joensuu | 1.4 Billion Clear Text #Credentials Discovered in a Single D Joensuu | some of the last entries added to the leaked database. It’s (10 rows)
Запрос выполняется с помощью индекса:
=> EXPLAIN (costs off,analyze) SELECT * FROM tweets ORDER BY coord <-> point(37.62, 55.75) LIMIT 10;
QUERY PLAN ----------------------------------------------------------------------------------------------- Limit (actual time=0.065..0.164 rows=10 loops=1) -> Index Scan using tweets_coord_gist on tweets (actual time=0.064..0.158 rows=10 loops=1) Order By: (coord <-> '(37.62,55.75)'::point) Planning time: 0.079 ms Execution time: 0.190 ms (5 rows)
Далеко не так эффективно, но иногда полезно. Например, хотим показать твиты отсортированными по дате.
=> EXPLAIN (costs off,analyze) SELECT * FROM tweets WHERE coord <@ '-74.0, 40.75, 2'::circle ORDER BY created_at DESC LIMIT 10;
QUERY PLAN ------------------------------------------------------------------------------------------------------ Limit (actual time=1.102..1.109 rows=10 loops=1) -> Sort (actual time=1.101..1.103 rows=10 loops=1) Sort Key: created_at DESC Sort Method: top-N heapsort Memory: 23kB -> Bitmap Heap Scan on tweets (actual time=0.223..0.844 rows=498 loops=1) Recheck Cond: (coord <@ '<(-74,40.75),2>'::circle) Heap Blocks: exact=460 -> Bitmap Index Scan on tweets_coord_gist (actual time=0.167..0.167 rows=498 loops=1) Index Cond: (coord <@ '<(-74,40.75),2>'::circle) Planning time: 0.103 ms Execution time: 1.151 ms (11 rows)
=> CREATE INDEX ON tweets USING gist(coord,created_at);
ERROR: data type timestamp with time zone has no default operator class for access method "gist" HINT: You must specify an operator class for the index or define a default operator class for the data type.
Встроенной поддержки обычных типов данных для GiST нет...
...но есть расширение:
=> CREATE EXTENSION btree_gist;
CREATE EXTENSION
=> CREATE INDEX tweets_coord_created_gist ON tweets USING gist(coord,created_at);
CREATE INDEX
Твиты в окрестности Нью-Йорка, "ближайшие к сегодня":
=> SELECT tweet #>> '{place,name}', created_at, left(message,40) FROM tweets WHERE coord <@ '-74.0, 40.75, 2'::circle ORDER BY created_at <-> now() LIMIT 10;
?column? | created_at | left ---------------+------------------------+------------------------------------------ Trenton | 2018-01-24 07:06:19+03 | S978 [NEW] "New Jersey All-Payer Claims Manhattan | 2018-01-24 05:28:13+03 | We are hopefully moving on past “build a Manhattan | 2018-01-24 00:35:29+03 | Want to work at Actus Consulting Group? New Jersey | 2018-01-23 23:32:32+03 | Want to work in #Fairfield, NJ? View our Manhattan | 2018-01-23 22:52:26+03 | Join the Accenture team! See our latest Philadelphia | 2018-01-23 22:04:23+03 | See our latest #Philadelphia, PA #job an Kearny | 2018-01-23 19:50:27+03 | We're #hiring! Read about our latest #jo Brooklyn | 2018-01-23 19:26:22+03 | This #job might be a great fit for you: Trenton | 2018-01-23 18:26:18+03 | A1098 [NEW] Requires newly manufactured New Brunswick | 2018-01-23 17:21:49+03 | Want to work at Kelly Services? We're #h (10 rows)
Запрос снова выполняется с помощью индекса.
=> EXPLAIN (costs off,analyze) SELECT * FROM tweets WHERE coord <@ '-74.0, 40.75, 2'::circle ORDER BY created_at <-> now() LIMIT 10;
QUERY PLAN ------------------------------------------------------------------------------------------------------- Limit (actual time=0.161..0.442 rows=10 loops=1) -> Index Scan using tweets_coord_created_gist on tweets (actual time=0.159..0.435 rows=10 loops=1) Index Cond: (coord <@ '<(-74,40.75),2>'::circle) Order By: (created_at <-> now()) Planning time: 0.124 ms Execution time: 0.494 ms (6 rows)
=> CREATE INDEX tweets_coord_quad_spgist ON tweets USING spgist(coord);
CREATE INDEX
Запрос использует созданный индекс:
=> EXPLAIN (costs off,analyze) SELECT * FROM tweets WHERE coord <@ box '36.62, 54.75, 38.62, 56.75';
QUERY PLAN ----------------------------------------------------------------------------------------------- Bitmap Heap Scan on tweets (actual time=0.020..0.023 rows=3 loops=1) Recheck Cond: (coord <@ '(38.62,56.75),(36.62,54.75)'::box) Heap Blocks: exact=3 -> Bitmap Index Scan on tweets_coord_quad_spgist (actual time=0.014..0.014 rows=3 loops=1) Index Cond: (coord <@ '(38.62,56.75),(36.62,54.75)'::box) Planning time: 0.177 ms Execution time: 0.046 ms (7 rows)
Чтобы создать k-D-дерево с помощью SP-GiST, нужно явно указать класс операторов:
=> CREATE INDEX tweets_coord_kd_spgist ON tweets USING spgist(coord kd_point_ops);
CREATE INDEX
=> EXPLAIN (costs off,analyze) SELECT * FROM tweets WHERE coord <@ box '36.62, 54.75, 38.62, 56.75';
QUERY PLAN --------------------------------------------------------------------------------------------- Bitmap Heap Scan on tweets (actual time=0.028..0.031 rows=3 loops=1) Recheck Cond: (coord <@ '(38.62,56.75),(36.62,54.75)'::box) Heap Blocks: exact=3 -> Bitmap Index Scan on tweets_coord_kd_spgist (actual time=0.022..0.022 rows=3 loops=1) Index Cond: (coord <@ '(38.62,56.75),(36.62,54.75)'::box) Planning time: 0.160 ms Execution time: 0.059 ms (7 rows)
В нашем случае разница между классами не ощущается.
Сравним размеры трех индексов:
=> SELECT pg_size_pretty(pg_total_relation_size('tweets_coord_gist')) as gist, pg_size_pretty(pg_total_relation_size('tweets_coord_quad_spgist')) as quad_spgist, pg_size_pretty(pg_total_relation_size('tweets_coord_kd_spgist')) as kd_spgist;
gist | quad_spgist | kd_spgist -------+-------------+----------- 20 MB | 14 MB | 14 MB (1 row)
Конец демонстрации.