title

text

Yandong Yao
Yandong Yao Beijing Siwei Zongheng Data Technology Co., Ltd. CEO
11:20 23 March
40 min

YMatrix Domino: Design Considerations, Trade-offs, and Implementation of In-Database Stream Processing

YMatrix is a distributed, multi-model database built on PostgreSQL. Domino is its native in-database stream processing engine, enabling true unified batch and stream processing. This talk explores the design concepts, technology selection, and implementation of in-database stream processing in YMatrix, aiming to address the complexity, resource overhead, and consistency challenges of traditional external streaming architectures. Background and Motivation In conventional data architectures, data is extracted from OLTP systems and processed either through T+1 batch pipelines or external stream processing engines (such as Flink) before being loaded into data marts. This approach suffers from limited real-time capabilities, high system complexity, data redundancy, and potential consistency risks. In-database stream processing offers a more efficient alternative by simplifying system architecture, reducing resource consumption, strengthening data lineage, and improving flexibility. Core Design GoalsThe system is designed to support incremental processing (distinct from materialized views), near real-time latency (sub-second or configurable), multi-stage stream concatenation, minimal impact on OLTP workloads (better than trigger-based solutions), eventual consistency, and high-throughput processing. Meanwhile, it provides full SQL-level semantics, avoiding the need for user-defined code. Technical Implementation - SQL Syntax DesignThe SQL grammar is extended with CREATE STREAM ... AS SELECT ... STREAMING to define stream objects that subscribe to incremental changes from upstream tables. The semantics resemble materialized views but support continuous incremental updates. - Incremental Data CaptureLogical Decoding is used to create replication slots for capturing WAL changes. Combined with initial historical snapshots to initialize stream tables, background workers continuously execute incremental query plans to synchronize streaming tables. - Progress ManagementStream progress (restart_lsn / confirm_lsn) is maintained in shared memory and atomically updated on transaction commit. Integration with XLog and checkpoint mechanisms ensures reliable crash recovery. - Plugin FrameworkThe framework supports three primary SQL patterns: single-stream join with dimension tables (domino_one), single-stream aggregation (domino_agg), and dual-stream Inner Join (domino_join). Each plugin matches SQL query patterns and generates corresponding incremental execution plans. - Update HandlingLogical Decoding outputs DELETE/INSERT markers, and primary-key conflict handling (INSERT ... ON CONFLICT) enables incremental propagation of upstream updates and deletes. For aggregation, inverse transition functions (aggmtransfn / aggminvtransfn) are introduced to support retraction. Dual-stream join scenarios further support arbitrary update/delete operations on both input streams. BenefitsCompared with external streaming engines, this approach significantly simplifies system architecture, reduces resource costs (cutting down serialization and network transmission overhead), strengthens data consistency guarantees, and lowers the user adoption barrier through pure SQL semantics.