MySQL CDC Pipeline Connector

The MySQL CDC Pipeline Connector allows for reading snapshot data and incremental data from MySQL database and provides end-to-end full-database data synchronization capabilities. This document describes how to setup the MySQL CDC Pipeline connector.

How to create Pipeline

The pipeline for reading data from MySQL and sink to Doris can be defined as follows:

source:
   type: mysql
   name: MySQL Source
   hostname: 127.0.0.1
   port: 3306
   username: admin
   password: pass
   tables: adb.\.*, bdb.user_table_[0-9]+, [app|web].order_\.*
   server-id: 5401-5404

sink:
  type: doris
  name: Doris Sink
  fenodes: 127.0.0.1:8030
  username: root
  password: pass

pipeline:
   name: MySQL to Doris Pipeline
   parallelism: 4

Pipeline Connector Options

Option Required Default Type Description
hostname required (none) String IP address or hostname of the MySQL database server.
port optional 3306 Integer Integer port number of the MySQL database server.
username required (none) String Name of the MySQL database to use when connecting to the MySQL database server.
password required (none) String Password to use when connecting to the MySQL database server.
tables required (none) String Table name of the MySQL database to monitor. The table-name also supports regular expressions to monitor multiple tables that satisfy the regular expressions.
It is important to note that the dot (.) is treated as a delimiter for database and table names. If there is a need to use a dot (.) in a regular expression to match any character, it is necessary to escape the dot with a backslash.
eg. db0.\.*, db1.user_table_[0-9]+, db[1-2].[app|web]order_\.*
schema-change.enabled optional true Boolean Whether to send schema change events, so that downstream sinks can respond to schema changes and achieve table structure synchronization.
server-id optional (none) String A numeric ID or a numeric ID range of this database client, The numeric ID syntax is like '5400', the numeric ID range syntax is like '5400-5408', The numeric ID range syntax is recommended when 'scan.incremental.snapshot.enabled' enabled. Every ID must be unique across all currently-running database processes in the MySQL cluster. This connector joins the MySQL cluster as another server (with this unique ID) so it can read the binlog. By default, a random number is generated between 5400 and 6400, though we recommend setting an explicit value.
scan.incremental.snapshot.chunk.size optional 8096 Integer The chunk size (number of rows) of table snapshot, captured tables are split into multiple chunks when read the snapshot of table.
scan.snapshot.fetch.size optional 1024 Integer The maximum fetch size for per poll when read table snapshot.
scan.startup.mode optional initial String Optional startup mode for MySQL CDC consumer, valid enumerations are "initial", "earliest-offset", "latest-offset", "specific-offset" and "timestamp". Please see Startup Reading Position section for more detailed information.
scan.startup.specific-offset.file optional (none) String Optional binlog file name used in case of "specific-offset" startup mode
scan.startup.specific-offset.pos optional (none) Long Optional binlog file position used in case of "specific-offset" startup mode
scan.startup.specific-offset.gtid-set optional (none) String Optional GTID set used in case of "specific-offset" startup mode
scan.startup.specific-offset.skip-events optional (none) Long Optional number of events to skip after the specific starting offset
scan.startup.specific-offset.skip-rows optional (none) Long Optional number of rows to skip after the specific starting offset
connect.timeout optional 30s Duration The maximum time that the connector should wait after trying to connect to the MySQL database server before timing out.
connect.max-retries optional 3 Integer The max retry times that the connector should retry to build MySQL database server connection.
connection.pool.size optional 20 Integer The connection pool size.
jdbc.properties.* optional 20 String Option to pass custom JDBC URL properties. User can pass custom properties like 'jdbc.properties.useSSL' = 'false'.
heartbeat.interval optional 30s Duration The interval of sending heartbeat event for tracing the latest available binlog offsets.
debezium.* optional (none) String Pass-through Debezium's properties to Debezium Embedded Engine which is used to capture data changes from MySQL server. For example: 'debezium.snapshot.mode' = 'never'. See more about the Debezium's MySQL Connector properties
scan.incremental.close-idle-reader.enabled optional false Boolean Whether to close idle readers at the end of the snapshot phase.
The flink version is required to be greater than or equal to 1.14 when 'execution.checkpointing.checkpoints-after-tasks-finish.enabled' is set to true.
If the flink version is greater than or equal to 1.15, the default value of 'execution.checkpointing.checkpoints-after-tasks-finish.enabled' has been changed to true, so it does not need to be explicitly configured 'execution.checkpointing.checkpoints-after-tasks-finish.enabled' = 'true'

Startup Reading Position

The config option scan.startup.mode specifies the startup mode for MySQL CDC consumer. The valid enumerations are:

  • initial (default): Performs an initial snapshot on the monitored database tables upon first startup, and continue to read the latest binlog.

  • earliest-offset: Skip snapshot phase and start reading binlog events from the earliest accessible binlog offset.

  • latest-offset: Never to perform snapshot on the monitored database tables upon first startup, just read from the end of the binlog which means only have the changes since the connector was started.

  • specific-offset: Skip snapshot phase and start reading binlog events from a specific offset. The offset could be specified with binlog filename and position, or a GTID set if GTID is enabled on server.

  • timestamp: Skip snapshot phase and start reading binlog events from a specific timestamp.

Data Type Mapping

MySQL type CDC type NOTE
TINYINT(n) TINYINT
SMALLINT
TINYINT UNSIGNED
TINYINT UNSIGNED ZEROFILL
SMALLINT
INT
YEAR
MEDIUMINT
MEDIUMINT UNSIGNED
MEDIUMINT UNSIGNED ZEROFILL
SMALLINT UNSIGNED
SMALLINT UNSIGNED ZEROFILL
INT
BIGINT
INT UNSIGNED
INT UNSIGNED ZEROFILL
BIGINT
BIGINT UNSIGNED
BIGINT UNSIGNED ZEROFILL
SERIAL
DECIMAL(20, 0)
FLOAT
FLOAT UNSIGNED
FLOAT UNSIGNED ZEROFILL
FLOAT
REAL
REAL UNSIGNED
REAL UNSIGNED ZEROFILL
DOUBLE
DOUBLE UNSIGNED
DOUBLE UNSIGNED ZEROFILL
DOUBLE PRECISION
DOUBLE PRECISION UNSIGNED
DOUBLE PRECISION UNSIGNED ZEROFILL
DOUBLE
NUMERIC(p, s)
NUMERIC(p, s) UNSIGNED
NUMERIC(p, s) UNSIGNED ZEROFILL
DECIMAL(p, s)
DECIMAL(p, s) UNSIGNED
DECIMAL(p, s) UNSIGNED ZEROFILL
FIXED(p, s)
FIXED(p, s) UNSIGNED
FIXED(p, s) UNSIGNED ZEROFILL
where p <= 38
DECIMAL(p, s)
NUMERIC(p, s)
NUMERIC(p, s) UNSIGNED
NUMERIC(p, s) UNSIGNED ZEROFILL
DECIMAL(p, s)
DECIMAL(p, s) UNSIGNED
DECIMAL(p, s) UNSIGNED ZEROFILL
FIXED(p, s)
FIXED(p, s) UNSIGNED
FIXED(p, s) UNSIGNED ZEROFILL
where 38 < p <= 65
STRING The precision for DECIMAL data type is up to 65 in MySQL, but the precision for DECIMAL is limited to 38 in Flink. So if you define a decimal column whose precision is greater than 38, you should map it to STRING to avoid precision loss.
BOOLEAN
TINYINT(1)
BIT(1)
BOOLEAN
DATE DATE
TIME [(p)] TIME [(p)]
TIMESTAMP [(p)] TIMESTAMP_LTZ [(p)]
DATETIME [(p)] TIMESTAMP [(p)]
CHAR(n) CHAR(n)
VARCHAR(n) VARCHAR(n)
BIT(n) BINARY(⌈(n + 7) / 8⌉)
BINARY(n) BINARY(n)
VARBINARY(N) VARBINARY(N)
TINYTEXT
TEXT
MEDIUMTEXT
LONGTEXT
STRING
TINYBLOB
BLOB
MEDIUMBLOB
LONGBLOB
BYTES Currently, for BLOB data type in MySQL, only the blob whose length isn't greater than 2,147,483,647(2 ** 31 - 1) is supported.
ENUM STRING
JSON STRING The JSON data type will be converted into STRING with JSON format in Flink.
SET - Not supported yet.
GEOMETRY
POINT
LINESTRING
POLYGON
MULTIPOINT
MULTILINESTRING
MULTIPOLYGON
GEOMETRYCOLLECTION
STRING The spatial data types in MySQL will be converted into STRING with a fixed Json format. Please see MySQL Spatial Data Types Mapping section for more detailed information.

MySQL Spatial Data Types Mapping

The spatial data types except for GEOMETRYCOLLECTION in MySQL will be converted into Json String with a fixed format like:

{"srid": 0 , "type": "xxx", "coordinates": [0, 0]}

The field srid identifies the SRS in which the geometry is defined, SRID 0 is the default for new geometry values if no SRID is specified. As only MySQL 8+ support to specific SRID when define spatial data type, the field srid will always be 0 in MySQL with a lower version.

The field type identifies the spatial data type, such as POINT/LINESTRING/POLYGON.

The field coordinates represents the coordinates of the spatial data.

For GEOMETRYCOLLECTION, it will be converted into Json String with a fixed format like:

{"srid": 0 , "type": "GeometryCollection", "geometries": [{"type":"Point","coordinates":[10,10]}]}

The field geometries is an array contains all spatial data.

The example for different spatial data types mapping is as follows:

Spatial data in MySQL Json String converted in Flink
POINT(1 1) {"coordinates":[1,1],"type":"Point","srid":0}
LINESTRING(3 0, 3 3, 3 5) {"coordinates":[[3,0],[3,3],[3,5]],"type":"LineString","srid":0}
POLYGON((1 1, 2 1, 2 2, 1 2, 1 1)) {"coordinates":[[[1,1],[2,1],[2,2],[1,2],[1,1]]],"type":"Polygon","srid":0}
MULTIPOINT((1 1),(2 2)) {"coordinates":[[1,1],[2,2]],"type":"MultiPoint","srid":0}
MultiLineString((1 1,2 2,3 3),(4 4,5 5)) {"coordinates":[[[1,1],[2,2],[3,3]],[[4,4],[5,5]]],"type":"MultiLineString","srid":0}
MULTIPOLYGON(((0 0, 10 0, 10 10, 0 10, 0 0)), ((5 5, 7 5, 7 7, 5 7, 5 5))) {"coordinates":[[[[0,0],[10,0],[10,10],[0,10],[0,0]]],[[[5,5],[7,5],[7,7],[5,7],[5,5]]]],"type":"MultiPolygon","srid":0}
GEOMETRYCOLLECTION(POINT(10 10), POINT(30 30), LINESTRING(15 15, 20 20)) {"geometries":[{"type":"Point","coordinates":[10,10]},{"type":"Point","coordinates":[30,30]},{"type":"LineString","coordinates":[[15,15],[20,20]]}],"type":"GeometryCollection","srid":0}

FAQ