Breaking News

Implementing a CDC Performance Feature With the Help of ChatGPT

0 0

Read full article

TiCDC is a change data capture (CDC) solution that replicates incremental data changes in real time for TiDB, a distributed SQL database. Previously, TiCDC faced a challenge in its communication with downstream TiDB clusters using query commands, which led to higher CPU usage on SQL parsing and compiling. This issue was particularly significant in larger distributed database systems, where the increased workload could reduce performance and slow down data replication times. Moreover, the lack of plan cache hits further exacerbated the problem, resulting in even slower performance.

Thanks to ChatGPT, a natural language processing and artificial intelligence tool from OpenAI, I was able to resolve these issues. I implemented a TiCDC performance feature that improves the communication between TiCDC and downstream TiDB clusters.

What Do We Need From the TiCDC Performance Feature?

The TiCDC performance feature has clear goals: lower CPU usage on SQL parsing and compiling, and increased plan cache hits, which improves the communication between TiCDC and downstream TiDB clusters. This feature would help optimize TiCDC performance and ensure faster and more reliable data replication across multiple TiDB clusters.

Seeking Help from ChatGPT

My role at PingCAP is QA engineer. Due to limited development resources, I took on the development work, despite not being a full-time developer. This presented additional challenges as I lacked the experience and knowledge to design and implement such a feature.

On the other hand, I recognized how important this feature could be to TiCDC’s performance and reliability, so I was eager to take on the challenge. To bridge my knowledge gap, I turned to ChatGPT, which proved to be an invaluable resource. By leveraging the power of ChatGPT’s natural language processing capabilities, I quickly gained an understanding of the requirements and challenges involved in developing the performance feature. This allowed me to work more effectively with other team members and stakeholders and ultimately deliver a high-quality performance feature that met the needs of the TiCDC project.

Initial Results and Challenges

The module communicating with the downstream database in TiCDC is MySQL sinker. After reviewing TiCDC’s code, I was able to identify the main logic of MySQL sinker in the ` execDMLWithMaxRetries ` function. I shared the code in mysql.go with ChatGPT, and asked it to rewrite the code to implement a statement cache.

ChatGPT provided a helpful starting point for my analysis, and I was able to gain a deeper understanding of the function’s purpose and logic.

ChatGPT’s initial code prompt

It even reminded me to properly initiate and release the cache map. Nice!

Go

// Initialize the statement cache if it hasn’t been initialized yet

if s.stmtCache == nil {

s.stmtCache = make(map[string]*sql.Stmt)

}

// release the map in close Interface

stmts := make(map[string]*sql.Stmt)

defer func() {

for _, stmt := range stmts {

stmt.Close()

}

}()

Making the Code Work

MySQL sinker initializes a connection pool to communicate with downstream databases when started. The size of the connection pool is determined by the worker count configuration. The connection pool is created by a DB type defined in the Golang sql package. To ensure transactional semantics when syncing data changes to a downstream database, MySQL sinker uses a database transaction type called Tx.

In the first response of ChatGPT, the cache statements are created by DB.Prepare . However, cache statements created by DB.Prepare cannot be executed directly by the Tx type. After reviewing the Golang sql package, I discovered that I could use the Tx.Stmt function to return a transaction-specific prepared statement from an existing statement. Then I told ChatGPT about the Tx.Stmt function. It provided the following example:

ChatGPT ‘s proposed code sample to integrate with tx.Stmt

After I put the code together, the initial testing showed that the new code used a prepared statement interface as expected. You can see the code here.

A Happy Initial Result With 100% Plan Cache Hit

I used Sysbench, a popular open-source software tool to benchmark MySQL-compatible databases to compare the baseline and the new code.

The performance overview dashboard shown below illustrates the impact of the new implementation on TiCDC’s downstream communication with TiDB. The updated code, shown on the right side of the dashboard, uses the StmtExecute command and has resulted in a plan cache hit rate of 100%. In contrast, the baseline implementation on the left side only uses the query command, resulting in zero plan cache hits.

These performance metrics demonstrate the effectiveness of the updated implementation. It not only improves plan cache hit rates but also optimizes downstream communication between TiCDC and TiDB.

The Performance Overview dashboard for the downstream TiDB cluster

Performance Degradation of the First Version

The chart below illustrates the performance metrics of TiCDC’s baseline implementation (left side) and the updated code (right side). Contrary to expectations, the new code has degraded performance. Queries per Second (QPS) have decreased, and query duration has increased.

Query duration for downstream TiDB Cluster

The new code cached the prepared statements as expected, but the total number of prepared statements per TiDB instance was more than a 1 million. This results in high CPU usage and approximately 27 GB of memory usage.

TiDB dashboard for downstream TiDB clusters

The TiDB heap profile chart also revealed that the cached prepared statements used most of the memory.

TiDB heap profile

The TiDB CPU profile indicates the significant CPU usage on the memory tracker and Golang runtime, due to high memory pressure.

TiDB CPU profile

Solving the Performance Issue

Analyzing the Root Cause

In this particular scenario, the number of worker count is 128. MySQL sink initializes the size of the connection pool accordingly, so there are 128 connections. With 32 tables in the sysbench workload. For each Sysbench table, there are three types of update statements, each statement is cached across all the connections. This leads to an estimated total of 1,572,864 (128*128*32*3) prepared statements. As a result, the memory usage on each TiDB instance is approximately 27 GB.

The code snippet below illustrates how the statement cache is initialized for each worker.

GPT’s reaction to this article:

It seems like an informative article about the development of a performance feature for TiCDC, a change data capture solution for TiDB. The article describes the challenges faced by the developer and how they leveraged ChatGPT to improve the communication between TiCDC and downstream TiDB clusters. The initial results were promising, with a 100% plan cache hit rate, but there was a performance degradation issue due to high CPU and memory usage. The article goes on to explain how the developer analyzed the root cause and solved the performance issue. Overall, the article provides valuable insights into the development process of a performance feature for a distributed database system.

Happy
Happy
0 %
Sad
Sad
0 %
Excited
Excited
0 %
Sleepy
Sleepy
0 %
Angry
Angry
0 %
Surprise
Surprise
0 %