PostgreSQL13逻辑复制实战 1:N的主从数据库表的复制

从MySQL转投到PostgreSQL已经有5年了,去年开始用到逻辑复制功能, 当时从一台主数据库,通过Wireguard的网络,复制数据到从数据库。 一年的使用很平稳,现在需要增加一台从数据库,所以需要重新梳理一下逻辑复制的实现。

当我使用原来的从数据库的备份,作为新的从数据库时,发现无法正常复制数据,因此,进行了一面的实验,发现了一些问题,这里记录一下。

目前的架构

主数据库有replication_slot mysub, 有mypub,mypub2两个发布

 1\c mando
 2select * from pg_replication_slots;
 3 slot_name |  plugin  | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size 
 4-----------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------
 5 mysub     | pgoutput | logical   |  16385 | mando    | f         | t      |        862 |      |        21060 | 0/E5B50C0   | 0/E5B50F8           | reserved   |              
 6(1 row)
 7
 8
 9select * from pg_publication;
10  oid  | pubname | pubowner | puballtables | pubinsert | pubupdate | pubdelete | pubtruncate | pubviaroot 
11-------+---------+----------+--------------+-----------+-----------+-----------+-------------+------------
12 16656 | mypub   |       10 | f            | t         | t         | t         | t           | f
13 41305 | mypub2  |       10 | f            | t         | t         | t         | t           | f

实验1

建立一个已经存在的mysub上,这是第一个目前正在工作的从数据库的代码,看有什么情况出现

1CREATE SUBSCRIPTION mysub CONNECTION 'dbname=mando host=10.0.0.1 user=traffic password=password' PUBLICATION mypub WITH (connect = false, slot_name = 'mysub');
1
2//从机报错
32024-03-22 21:10:41.749 CST [22072] ERROR:  could not start WAL streaming: ERROR:  replication slot "mysub" is active for PID 862
42024-03-22 21:10:41.751 CST [457] LOG:  background worker "logical replication worker" (PID 22072) exited with exit code 1
5
6
7//主机日志报错
82024-03-22 13:11:06.926 UTC [14310] traffic@mando ERROR:  replication slot "mysub" is active for PID 862
92024-03-22 13:11:06.926 UTC [14310] traffic@mando STATEMENT:  START_REPLICATION SLOT "mysub" LOGICAL 0/0 (proto_version '1', publication_names '"mypub"')

结论: 不能建立同名的slot,因为原来的mysub已经被占用了

实验2

将从数据库的slot_name改为mysub2,看有什么情况出现, 这里mysub是从数据库的subscription, slot_name=mysub2,是主数据库的slot_name

1ALTER SUBSCRIPTION mysub SET (slot_name = "mysub2");
2ALTER SUBSCRIPTION mysub ENABLE;
1//从机报错
22024-03-22 20:55:43.453 CST [21784] ERROR:  could not start WAL streaming: ERROR:  replication slot "mysub2" does not exist
32024-03-22 20:55:43.455 CST [457] LOG:  background worker "logical replication worker" (PID 21784) exited with exit code 1
4
5//主机日志报错
62024-03-22 12:58:52.756 UTC [14105] traffic@mando ERROR:  replication slot "mysub2" does not exist
72024-03-22 12:58:52.756 UTC [14105] traffic@mando STATEMENT:  START_REPLICATION SLOT "mysub2" LOGICAL 0/0 (proto_version '1', publication_names '"mypub"')

结论: 主数据库的subscription不能通过alter来建立.

另外,下面的指令,也会提示上面一样的错误.

1CREATE SUBSCRIPTION mysub2 CONNECTION 'dbname=mando host=10.0.0.1 user=traffic password=password' PUBLICATION mypub WITH (connect = false, slot_name = 'mysub2');

删除原来的subscription, 再删除从数据库

为了干净的进行第三次实验,先删除原来的subscription,再重新建立数据库.

1\c mando2
2alter subscription mysub disable;
3alter subscription mysub set (slot_name = none);
4drop subscription mysub;
5\c postgres
6drop database mando2;
7create database mando2;

上面的步骤必须一步一步执行,否则会报错,因为有先后顺序的限制.

实验3

在从机上创建subscription

1CREATE SUBSCRIPTION sub_test
2   CONNECTION 'dbname=mando host=10.0.0.1 user=traffic password=password' PUBLICATION mypub 
3   WITH (create_slot = false);
1//主机日志报错
22024-03-22 23:48:03.200 UTC [45382] traffic@mando ERROR:  replication slot "sub_test" does not exist
32024-03-22 23:48:03.200 UTC [45382] traffic@mando STATEMENT:  START_REPLICATION SLOT "sub_test" LOGICAL 0/0 (proto_version '1', publication_names '"mypub"')
4
5//从机上的日志
62024-03-23 07:49:49.285 CST [9079] LOG:  logical replication apply worker for subscription "sub_test" has started
72024-03-23 07:49:50.590 CST [9081] LOG:  logical replication table synchronization worker for subscription "sub_test", table "login_devices" has started
82024-03-23 07:49:50.602 CST [9082] LOG:  logical replication table synchronization worker for subscription "sub_test", table "users" has started

实验4

在从机上运行, 下面的指令会在主机上创建replication_slot sub_testx

1\c mando2
2CREATE SUBSCRIPTION sub_testx
3   CONNECTION 'dbname=mando host=10.0.0.1 user=traffic password=password' PUBLICATION mypub 
4   WITH (create_slot = true);

在主机上,找到了sub_testx

1mando=# SELECT * FROM pg_replication_slots;
2         slot_name          |  plugin  | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size 
3----------------------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------
4 mysub                      | pgoutput | logical   |  16385 | mando    | f         | t      |        862 |      |        21073 | 0/E5D7778   | 0/E5DF068           | reserved   |              
5 sub_test                   | pgoutput | logical   |  16385 | mando    | f         | f      |            |      |        21073 | 0/E5D76D0   | 0/E5D7708           | reserved   |              
6 sub_testx                  | pgoutput | logical   |  16385 | mando    | f         | t      |      46668 |      |        21073 | 0/E5D7778   | 0/E5DF068           | reserved   |  

但是,从机的users表格,没有数据.

 1//从机上的日志
 22024-03-23 08:09:29.609 CST [9415] LOG:  logical replication apply worker for subscription "sub_testx" has started
 32024-03-23 08:09:30.912 CST [9416] LOG:  logical replication table synchronization worker for subscription "sub_testx", table "login_devices" has started
 42024-03-23 08:09:30.923 CST [9417] LOG:  logical replication table synchronization worker for subscription "sub_testx", table "users" has started
 5
 62024-03-23 08:18:56.249 CST [9414] postgres@mando2 ERROR:  subscription "sub_testx" already exists
 72024-03-23 08:18:56.249 CST [9414] postgres@mando2 STATEMENT:  CREATE SUBSCRIPTION sub_testx
 8	   CONNECTION 'dbname=mando host=10.0.0.1 user=traffic password=password' PUBLICATION mypub 
 9	   WITH (create_slot = true);
102024-03-23 08:30:30.764 CST [9416] ERROR:  could not start initial contents copy for table "public.login_devices": server closed the connection unexpectedly
11		This probably means the server terminated abnormally
12		before or while processing the request.
13	server closed the connection unexpectedly
 1//主机上的日志
 22024-03-23 00:09:30.832 UTC [46668] traffic@mando STATEMENT:  START_REPLICATION SLOT "sub_testx" LOGICAL 0/0 (proto_version '1', publication_names '"mypub"')
 32024-03-23 00:09:30.832 UTC [46668] traffic@mando LOG:  logical decoding found consistent point at 0/E5D7708
 42024-03-23 00:09:30.832 UTC [46668] traffic@mando DETAIL:  There are no running transactions.
 52024-03-23 00:09:30.832 UTC [46668] traffic@mando STATEMENT:  START_REPLICATION SLOT "sub_testx" LOGICAL 0/0 (proto_version '1', publication_names '"mypub"')
 62024-03-23 00:09:32.164 UTC [46671] traffic@mando LOG:  logical decoding found consistent point at 0/E5D7740
 72024-03-23 00:09:32.164 UTC [46671] traffic@mando DETAIL:  There are no running transactions.
 82024-03-23 00:09:32.164 UTC [46671] traffic@mando STATEMENT:  CREATE_REPLICATION_SLOT "sub_testx_16749_sync_16690" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT
 92024-03-23 00:09:32.189 UTC [46672] traffic@mando LOG:  logical decoding found consistent point at 0/E5D7778
102024-03-23 00:09:32.189 UTC [46672] traffic@mando DETAIL:  There are no running transactions.
112024-03-23 00:09:32.189 UTC [46672] traffic@mando STATEMENT:  CREATE_REPLICATION_SLOT "sub_testx_16749_sync_16710" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT
12
132024-03-23 00:26:39.078 UTC [46672] traffic@mando STATEMENT:  COPY public.users TO STDOUT
142024-03-23 00:26:45.222 UTC [46671] traffic@mando LOG:  could not receive data from client: Connection timed out
152024-03-23 00:26:45.222 UTC [46671] traffic@mando LOG:  unexpected EOF on client connection with an open transaction

结论: 实验4创建了一个新的slot,但是,从机上的数据没有同步过来,原因是从机上的连接超时了.经过一段时间分析, 估计是主机的配质问题.

解决问题

从日志可以看出,实验4创建了一个新的slot,经过一段时间的分析,可能是发送的进程资源不够. 修改主数据库 /etc/postgresql/13/main/postgresql.conf 设置如下参数,重启数据库,问题解决.

1max_wal_senders = 20           # max number of walsender processes
2
3max_replication_slots = 20     # max number of replication slots

默认的设置都是10,显然是不够第二个从数据库使用的.

其他有用的设置

Decrease wal_sender_timeout from 30 seconds (default) to 15 seconds to ensure that idle WAL sender processes are terminated more quickly, which frees up resources for active replication.

Decrease wal_receiver_timeout from 30 seconds (default) to 15 seconds to ensure that idle WAL receiver processes are terminated more quickly, which frees up resources for active replication.

Increase max_logical_replication_workers from 4 (default) to 8 to ensure that there are enough logical replication worker processes to keep up with replication demand.

创建发布指令

 1
 2CREATE PUBLICATION mypub2 WITH (publish = 'insert, update, delete, truncate');
 3
 4
 5ALTER PUBLICATION mypub2 OWNER TO postgres;
 6
 7--
 8-- Name: mypub login_devices; Type: PUBLICATION TABLE; Schema: public; Owner: postgres
 9--
10
11ALTER PUBLICATION mypub ADD TABLE ONLY public.login_devices;
12
13
14--
15-- Name: mypub2 login_devices; Type: PUBLICATION TABLE; Schema: public; Owner: postgres
16--
17
18ALTER PUBLICATION mypub2 ADD TABLE ONLY public.login_devices;

用户权限

1GRANT ALL ON TABLE public.login_devices TO traffic;
2
3GRANT ALL ON TABLE public.users TO traffic;