Talks and tutorials
In this talk, I will tell you the story of how a bunch of sysadmins got sick of having to resuscitate their petabyte-sized Elasticsearch cluster and decided to replace it with some tried technologies: PostgreSQL, Kafka, a bit of Redis, lots of glue, and the typical sysadmin stubbornness. The result is Bagger: the sysadmin answer to Big Data. A fast, fairly reliable, fault-tolerant store, used mostly for logging timestamped events for some amount of time. Bagger is named the Bagger series of bucket-wheel excavators, feats of German engineering and some of the largest land vehicles ever produced by man. Just like the excavators that dig through tons of material, our Bagger digs through tons data.
WAL-G is simple and effective disaster recovery tool for PostgreSQL using cloud storages. In its core functionality, WAL-G is the successor of WAL-E rewritten in Go. But there is one new neat feature - delate-backups. WAL-G delta-backups, whenever possible, stores only pages, changed since the previous backup. In this talk, I'm going to describe development process of this feature.
Surprisingly, most important and complicated question was the design of the interface: WAL-e is simple and comprehensive, keeping these properties was goal #1. Technical details of implementation were covering some underwater stones too. Besides these, I want to discuss the perspective of technological development and future coordination of recovery tools developers.
Last year we announced Skala-SR / Postgres Pro database machine with a hardware and software support for remote direct memory access (RDMA) as a key feature. The first appliances have been already installed at customer sites, and even with the first version customers got some constructions, that were impossible without RDMA and CPU offload (got with Mellanox networking). However, capabilities of this equipment are much wider, and this talk is dedicated to current works and prospective developments for this topic.
PostgreSQL looks very competitive with other mainstream databases on OLTP workload (execution of large number of simple queries). But on OLAP queries, requiring processing of larger volumes of data, DBMS-es oriented on analytic queries processing can provide an order of magnitude better speed. The following factors limit Postgres OLAP performance:
- Unpacking tuple overhead (tuple_deform)
- Interpretation overhead (Postgres executor has to interpret query execution plan)
- Abstraction penalty (support of abstract data types)
- Pull model overhead (operators are pulling tuples from heap page one-by-one, resulting numerous repeated accesses to the page)
- MVCC overhead (extra per-tuple storage + visibility check cost)
All this issues can be solved using vectorized executor, which proceed bulk of values at once. In this presentation I will show how vector operations can be implemented in Postgres as standard Postgres extension, not affecting Postgres core. The approach is based on introducing special types: tile types, which can be used instead of normal (scalar) types and implement vector operations. Postgres extension mechanism, such as UDT (user-defined type), FDW (foreign data wrappers), executor hooks are used to let users work with vectorized tables almost in the same way as with normal tables. But more than 10 times faster because of vector operations.
This talk explores the ways attackers with no authorized database access can steal Postgres passwords, see database queries and results, and even intercept database sessions and return false data. Postgres supports features to eliminate all of these threats, but administrators must understand the attack vulnerabilities to protect against them. This talk covers all known Postgres external attack methods.
It's not a secret for anyone that statistics can not be transferred with a major upgrade. For small and not heavily loaded databases this is not a problem, you can quickly collect new statistics. But we have databases with a volume of about 5TB and a load of about 100k rps, for which it became a big problem: taking off without statistics, the replicas could not even replay WAL. In my report I'll tell you what tricks we went to upgrade these databases with requirements of 100% read only availability, about what mistakes were made, and about how these errors were painfully corrected. The result of these errors was the extension called "pg_dirty_hands", in which we will collect various hacks, which can be last resort to repair data corruption.
In the last six months I have been working with a massive OLAP environment with 20TB shards, spanning around 400TB of data. Come to listen to how we make it all work, the challenges, and the skills involved. This talk has very little in common with the 10TB and Beyond talk because the data environments are very different.
We will cover analytics performance, data alignment, reasons for building extensions in C, and moving data around between servers in multiple data centers.
Jsonb is a popular data type in PostgreSQL, it provides the web developers an ability to work with ubiquitous json inside the database and use all the power of proven relational database. Fast querying of jsonb data is a challenge for database and PostgreSQL provides several options for indexing jsonb. We present the new way of efficient indexing of jsonb, based on improvement of indexing infrastructure.
It's known, that json is a greedy data type, it may contains many auxiliary data not interesting for searching and that affects the size of index. Partial index will not helps, since it filters the rows before indexing, while we are interested in extracting of parts of jsonb. Functional indexes on specific keys could introduce too big overhead. We present an improvement of indexing infrastructure, which allows to control the index behaviour by passing parameters to operator class at index creation. For example, to index a user-defined subset of jsonb it is possible to pass to operator class the powerful path expression (either jsonpath of upcoming sql/json or jspath from jsquery extension), which can be used to extract the parts of jsonb tree. That makes index more effective and reduces the overhead of its maintaining.
Another use of parameterized operator classes is to allow a user to specify parameters instead of hard coding them, for example, the GiST signature size is currently hard coded inside the implementations of several opclasses (tsvector, hstore, intarray, pg_trgm, ltree), while it is natural to use different signature length for different data to have optimal size of index and its performance.
Full text search on parts of document can be improved by passing labels to the operator class and letting him index only specified parts of document, that allow to avoid currently used recheck of the rows returned by the index.
I would like to share experience in runing PostgreSQL in dockerized environments, describe the specific issues and tools you will need to solve them.
- Which problems could be solved by Docker for PostgreSQL, e.g. PostgreSQLPro.9.6
Work of IT team with Docker in development, testing and production environments
- Using image repository and build servers for image testing
- Issues in production environment:
- With network activity
- With persistent repositories for Docker
- With additional services
- With load balancing and fail-safety
- Running PostgreSQL-base applications, such as:
- 1С platform
Avito is the biggest classified site of Russia, and the third largest classified site in the world (after Craigslist of USA and 58.com of China). In Avito, ads are stored in PostgreSQL databases. At the same time, for many years already the logical replication is actively used. With its help, the following issues are successfully solved: the growth of data volume and growth of number of requests to it, the scaling and distribution of the load, the delivery of data to the DWH and the search subsystems, inter-base and internetwork data synchronization etc. But nothing happens "for free" - at the output we have a complex distributed system. Hardware failures can happen - it is natural - you need to be always ready for it. There is plenty of samples of logical replication configuration and lots of success stories about using it. But with all this documentation there is nothing about samples of the recovery after crashes and data corruptions, moreover there are no ready-made tools for it. Over the years of constantly using PgQ replication, we have gained extensive experience, rethought a lot, implemented our own add-ins and extensions to restore and synchronize data after crashes in distributed data processing systems. In this report, we would like to show how our experience can be shifted to a new logical replication subsystem in 10th version of PostgreSQL. In the current implementation, these are only non-trivial solutions - there is a number of issues for the community, that come down to implementing simple recovery mechanisms - as simple as configuring the replication in 10th version.
In this talk I will share the experience of migration, NoSQL data to a relational view convert, and tell you how we managed to speed up the application in 2 times.
Initially, we used PosgtgreSQL and MongoDB to store all the data. In practice, we found out that the separation is extremely inconvenient. We spent a lot of time and attention on it.
My talk will describe how we moved the data from MongoDB to PostgreSQL using mosql. As a result, all the data can be received by one query now, and the table schema ensures data consistency.
The speaker will share his experience of transferring high-loaded game projects that were originally developed to work with MySQL into PostgreSQL. Problems that had been expected and ones that were actually faced. Tricky bugs that were found in production servers after the migration. Solutions that were used to deal with problems. PostgreSQL features that were very useful and desirable ones that were missing.
When one faces the issues with PostgreSQL, the main suspicion falls on vacuum. Experience of Data Egret team proves how many DBAs are attacking this rake. While there are tons of information, documentation and discussions on vacuum itself, the topic is still associated with a lot of myths, tales, horror stories and misconceptions. In my talk I will try to reveal the key points concerning the inner structure of vacuum, basic approaches to its adjustment and tuning, performance monitoring, and so on.
Tutorial on Full Text Seach in PostgreSQL, containing all recent improvemets. All recipies necessary for building an application will be given: dictionary and parser configuration, faceted search, fuzzy search, multilanguage search, ranking etc. Participants will be provided with a test database for exercises.
"And telling GIN from SP-GIST was quite beyond his wit, we found", said the classic. Can you? This masterclass is about not-so-often used index types (compared to conventional B-tree) which however can do a great job for you. We will look into internal mechanics of these indexes and discuss cases where they can be successfully applied. Also we will talk about some peculiarities of PostgreSQL index access. To spend time efficiently, listeners are required to have basic knowledge of PostgreSQL and should be used to read plans of simple queries.
Materials of the master class
Backup copy of the database with demo data can be downloaded here:
- Recovery with pg_restore (338 MB)
A database is one of the key components in any information system, requiring the monitoring of multiple metrics. The talk highlights examples and approaches of monitoring and analysis of PostgreSQL performance that allow to minimize the load on the database server from the monitoring and data collection system for the subsequent analysis of problem situations.
Storing binary data in database tables is sometimes a good solution for a particular project. But sometimes, due to changes in conditions or insufficient consideration of decisions, such storage is becoming a real nightmare. If there is an understanding of how and where to place these data, the transition to the new solutions are often very hard, often require modification in the application code and downtime the system for migration. The presentation is a particular solution of such problems. Our extension allows to move binary data from database to the storage Ceph and not only. And does it seamless for the applications.
MQTT is an effective data exchange protocol for IOT devices. Based on own modified EMQTT plug-in, the IoT project architecture uses PostgreSQL as a central processing and storing system for data coming from sensors in real time. The report will provide an example of the IoT hardware and software platform solution, based up on the MQTT protocol, where PostgreSQL is responsible for key functionality, providing processing, collecting and storage of data from a distributed network of IoT devices.
How to optimize query processing in PostgreSQL? What if we deal with hundreds of servers and thousands of instances? The company "Tensor" has developed a special tool - explain.sbis.ru, which enables synchronous collection and analysis of queries, visualization of implementation plans, and monitoring of bugs in database.
For the majority of System Administrators and DBAs performing an upgrade for RDBMS, let alone a major one, is a pain. That’s because one of the key factors that plays a role in a decision if and when to perform an upgrade is the downtime that it might come to during the process. This is true for any databases but especially important for those that are in production or under a high load.
Often, a major upgrade get’s cancelled and a DBA needs to go back to an older version due to the lack of experience or some basic errors that could have been easily avoided at the planning stage.
In our consultancy, we perform upgrades for our clients regularly and it allowed us to streamline the process and take some preventative measures that help us to perform it quickly, efficiently and with minimal or no downtime.
In this talk, I will share some key steps and tools that will help any DBA to become better at major upgrade performance. I will answer the following questions:
How to prepare for an upgrade of PostgreSQL? What one needs to do at the planning stage? How to plan your actions during the actual upgrade process? How to perform an upgrade successfully without going back to the older version? What actions one must perform following an upgrade?
I will also go through the two most popular processes of an upgrade: pg_upgrade и pg_dump/pg_restore, will compare some of the benefits and downfalls using each of these. I will also discuss some of the main issues one might face throughout the process and ways to avoid them.
This talk would be of interest to those who are new to PostgreSQL, as well as experienced DBAs who would like to learn more about upgrades or those who, in general, would like to understand why major upgrades should NOT be avoided like the plague.
Bringing the provability and immutability of blockchain to performance and efficiency of traditional DBMS.
Blockchain technology has several unique properties including provability and immutability. Every blockchain transaction is signed by its author, and it could be verified by any blockchain network member. Also, once data is stored in blockchain, it can't be altered in the future. Many databases operating traditional DBMS would also benefit from provability and immutability properties. However, inclusion of all the transaction data in the public blockchain is very expensive.
Credereum is the platform, which allows creation and maintaining of databases, whose contents and history are provable and immutable without sacrifice the performance and efficiency of traditional DBMS. Thanks to Credereum, database owner can prove the validity of query results, while users can verify them. Database owner don't have to reveal the whole database contents or full history of transactions to provide the proof of database query results. Therefore, Credereum database may contain private sensitive information. Credereum utilized bleeding-edge technologies including, but not limited to decentralized cloud, public blockchain with sharding. Credereum is an emerging technology of trusted and private databases.
We will explain why PostgreSQL is suitable database for Credereum and what we need to develop in Postgres to support signed transactions and cryptographic storage.
Azure Database for PostgreSQL is a managed database service built for developers using the community PostgreSQL database technology you love. Learn how you can leverage the managed service and the key capabilities that the service offers which enables you as a developer to focus on developing apps. Using customer stories and experiences including live demos, we will walk through best practices and demonstrate how this service integrates with the other Azure services.
One of advantages of document-oriented databases like MongoDB or Couchbase over RDBMSs is an ability to change the data scheme easily, fast and often. The traditional approach in RDBMS world involves doing an expensive ALTER TABLE operation, slow upgrade of an existing data, and stuff like this. This approach is often slow and inconvenient for application developers.
To solve this issue PostgreSQL provides JSON and JSONB datatypes. Also there are extensions like zson and pg_protobuf. From this talk you will learn how to work with these datatypes and extensions, their pros and cons and also related future work in this area.
It’s 3am. Your phone rings. PostgreSQL is down, you need to promote a replica to master. Why the h**l isn’t this automatic?
If you thought of this before, you want automatic High Availability (HA). Don’t miss this talk! We will enter the world of Modern PostgreSQL HA.
Good news, there are several new, “modern” solutions for PostgreSQL HA. However, there are several solutions and it's not easy to pick one. Most require non-trivial setups, and there are many small caveats about HA like how to provide entry points to the application, HA correctness, HA vs. read scaling, external dependencies, interaction with cloud environments, and so forth.
Join this talk to master PostgreSQL HA and how to deploy it on current times.
This report is about my PostgreSQL extension pg_profile. This extension all you need to create periodic statistics snapshots and to keep them. You can build a report on one or many serial snapshots. This report will contain statistics information about database workload in specified time period. It is very useful to start investigation on performance degragation or excessive resoure consumption in the past.
Lately, the PostgreSQL is more widely being used for Enterprise. Our company, Ingram Micro Cloud, is one of the first companies that did it. We have been using PostgreSQL for many years as the main DBMS for our products. In our report, we want to tell about the evolution of our High Availability (HA) PostgreSQL cluster. We will tell about how quickly we implemented the solution using pgpool-II, wrote failover scripts, tested Postgres-XL, and came up with unusual configurations of Stolon. We will also cover the problems of load balancing, pooling connections, and backups.
Pluggable storages is hot subject in PostgreSQL development. The period of heated debates about whether we need them is over. Skepticism about pluggable storages, based on concern that they may be source of inconsistent behavior, was weakened after criticism of PostgreSQL MVCC implementation from Uber side. It became widely understandable that pluggable storages are needed at least for an alternative MVCC implementation. And that is one of way-points for pluggable storages interface design.
At the moment, work on pluggable storages is in the practical stage. There is a thread is pgsql-hackers where few people are developing patchset and several people are doing review.
This talk will cover following subjects:
- overview of pluggable storages interface;
- changes in PostgreSQL core required to implement this interface;
- current and potential implementations of pluggable storages including heap with undo-log and in-memory OLTP engine;
- current state of patchset and prospective of its commit
- further development of interface allowing more possibilities in pluggable storages (columnar, index-organized, LSM and so on).
For over 15 years JetBrains have strived to make effective developer tools. IntelliJ IDEA for java is the most popular of them. IntelliJ supports databases as well, and finally we in JetBrains decided to bring this functionality and our experience in programming languages to the world of SQL. That’s how DataGrip appeared. In my talk I will tell how DataGrip helps developers by automating routine checks and corrections and speeds up developer’s production. I will also show how to extend IDE’s functionality and try to understand which actual problems we can solve in future.
DBeaver is the Universal Databases Manager. We creating DBeaver using Open Source Software model. We pay special attention to the PotgreSQL because of its capabilities, popularity and OSS nature. We will talk about DBeaver evolution and structure, will demonstrate the basic functionality. We are going to cover challenges of PostgreSQL client creation. Also we will discuss requirements management model and interaction with the PostgreSQL community. We will show in details how to debug the PL/pgSQL stored procedures interactively. We are going to introduce some new features of the nearest DBeaver release and to share our plans.
Patroni is a Python application to create high-availability PostgreSQL clusters based on the streaming replication. It is used by Red Hat, IBM Compose, Zalando and many other companies. This tutorial will highlight Patroni architecture, provide attendees with hands-on experience of configuring high-availability PostgreSQL clusters with Patroni, describe how to take advantage of numerous additional features and give an opportunity to learn more about common mistakes related to running Patroni and its troubleshooting.
In order to take most out of the Patroni tutorial one needs a laptop with git, vagrant and virtual box installed.
Alternatively, one can install your Linux distribution packages (or use homebrew on Mac).
Once Vagrant and Virtualbox are installed one can run the Patroni VM by issuing the following commands:
$ git clone https://github.com/alexeyklyukin/patroni-training $ cd patroni-training $ vagrant up
When the setup concludes Patroni box can be accessed via ssh using vagrant ssh command.
- Spatial and advanced spatial analysis with pure PostGIS (including cutting edge PostGIS functions available)
- How could we mix and tied efficiently PostgreSQL and Python data types (as NumPy ndarray, and Pandas DataFrames)
- Tools to improve our data manipulation environment (Jupyter tricks, easy dataviz...)
- How to go further throught GeoDataScience, with Python libs and framework tied with PostgreSQL/PostGIS (including Machine and DeepLearning)
Liquibase is a very convenient tool for sequential database migration, used both on our projects and in a large number of other projects and frameworks. It allows you to keep the code of the database together with the application code in VSC, track the attempts of repeated migrations and much, much more. But sooner or later the project grows, the data occupy terabytes, and liquibase still rolls the migration sequentially.
We could not afford to deploy migrations for 100 hours and came up with a framework for liquibase that expanded its capabilities and allowed to execute a whole series of scripts in parallel or to split one large migration into small partitions and migrate them in parallel.
This talk will compare architectural decisions that are made in PostgreSQL vs. ORACLE and will provide a closer look at the following components of both DBMSs:
- The ins and outs of the working DBMS, its processes and their function
- Structures that DBMS manages
- Durability mechanics of each respective DBMS
- MVCC design and database restoration options
- Storage of data on the physical media
Each architectural decision will be evaluated based on the experience with DBMS of choice, ease of administration and future improvement possibilities.
This review will demonstrate the notable strengths of PostgreSQL as an open-source DBMS compared to the commercial solution in many cases.
This talk will be interesting for:
- PostgreSQL users, as it will allow to take a closer look into an alternative DBMS;
- PostgreSQL administrators, that will be able to see huge administration possibilities that ORACLE offers and that could be adopted in PostgreSQL;
- PostgreSQL hackers, as Postgres is being actively developed and this talk will review new development segments;
- Those who are willing to migrate from ORACLE (or any other commercial DBMS) into an open-source project, as this talk will show the features of PostgreSQL compared to the commercial product.
Rambler & Co is a lot of publications, services and projects. Appear new and grow existing. This environment requires a reliable, fault-tolerant, scalable, automated system.
I'll tell you about the structure of our PostgreSQL SaaS, what tools and technologies we use. Quorum of 3 Data Centers. A single entry point for clients based on dynamic routing. Emergency switching of the primary server. Transparent scaling for reading. Create a replica without load on the cluster. Transparent transfer of PostgreSQL cluster to other servers. Update dev environment from prod for development. Backup with compression and the use of multiple CPUs on the side of the database, the restoration of one database from basebackup. Monitoring of sql queries.
Migration Technology of client-server applcations from Oracle to PostgreSQL: Principles, approaches and features.Talk
The report reviews approaches and implementattion options for the migration of Parus-Budget 8 client-server application from the Oracle Database platform to the PostgreSQL platform without changing the client application for Desktop and the Web. The proposed solution allows to make transparent transition of the existing users' workplaces.
PostGIS is well known and widely used since two decades, as the best OpenSource database solution for Spatial Analysis. This talk will focus on: spatial and advanced spatial analysis with pure PostGIS (including cutting edge PostGIS functions available); how to go further throught GeoDataScience, with Python libs and framework tied with PostgreSQL/PostGIS (including Machine and DeepLearning)
Postgres has a nice feature called Point-in-time Recovery (PITR) that would allow you to go back in time. In this talk, we will discuss what are the use-cases of PITR, how to prepare your database for PITR by setting good base backup and WAL shipping setups, with some examples. We will expand the discussion with how to achieve PITR if you have a distributed and sharded Postgres setup by mentioning challenges such as clock differences and ways to overcome them, such as two-phase commit and pg_create_restore_point.
If you honestly evaluate most of our business applications, you will see that they first collect and import the data into a database and then send the same data in the opposite direction.
What if we don't build an ORM wall between the application and the database, but try using the symbiosis of their strong points and special features instead?
I will tell you how we use PostgreSQL and Clojure for building data-intensive medical applications. We will cover the following topics:
- functional relational programming
- jsonb for modeling complex data domains
- functional indexes and json-knife extension for jsonb search
- graphql implementation on PostgreSQL
- logical replication for building reactive integrations
- asynchronous JDBC-free connector to PostgreSQL on netty
«KNOPKA» («The button»), the largest outsourcing of accounting, will tell you about your experience of transferring 500Gb of 1C Fresh databases to PostgreSQL. The report touches on the choice of DBMS and the fight against stereotypes, where we started and how developed the use of PostgreSQL, the approaches to backup and recovery of information security, our current performance of DBMS and 1C, a look into the future. Let us share why we believe that 1C + PostgreSQL is simple, reliable and fast.
Many of us know that it is MVCC that provides concurrency access to data in many relational databases that guarantee transactions consistency and isolation. But only deep understanding of the implementation of this mechanism in PostgreSQL allows us to better understand the processes in the database, to design the application logic and table structures to be the most effective in a high-load world. We'll take one of the processes in our product to understand how MVCC is implemented in PostgreSQL and we'll explain one of the peculiarity when seemingly unrelated activities can affect each other.
PostGIS is a spatial extension to PostgreSQL that enables spatial datatypes, access methods and a set of functions to perform geometric operations on them.
Typically PostGIS is used to select a small subset of a big static dataset. In this talk I'll cover issues that arise when working with big dynamic data flows, and ways to resolve them, on examples that we've met developing Juno ride sharing service backend.
Apart from its main purpose of scheduling tasks, pgpro_scheduler can also deal with chained transactions. It can be used in various scenarios of asynchronous data processing.
This tutorial demonstrates pgpro_scheduler features that ensure secure processing of chained transactions. We'll be using cryptocurrency transactions as an example.
pgpro_scheduler is included into Postgres Pro Enterprise as an extension.
Developers are often challenged to deliver results that are hard to implement using simple SQL queries. Fortunately, complex SQL capabilities exist in the SQL standards — common table expressions and window functions.
SQL is a declarative language, meaning the user submits an SQL command and the database determines the optimal execution. Common Table Expressions (CTEs) allow queries to be more imperative, allowing looping and processing hierarchical structures that are normally associated only with imperative languages.
Normal SQL queries return rows where each row is independent of the other returned rows. SQL window functions allow queries to return computed columns based on values in other rows in the result set.
This tutorial will help developers use CTE queries in their applications and allow operations that normally could only be done in application code to be done via SQL queries. It also explains the many window function facilities and how they can be used to produce useful SQL query results.