博文

PG中JSONB存储和访问效率（PG JSON系列5）

已有 9593 次阅读 2016-10-18 04:12 |个人分类:postgresql|系统分类:科研笔记| postgresql, jsonb

前4篇博文，大概写了一下如何使用PG的JSONB类型：

这些数据量都还没上规模，这一篇写写上规模之后的JSONB的存储和访问效率。

测试环境：CENTOS 7 64bit, PG 9.6.0，三星SSD 840，CPU 酷睿G640,2.8GHz, VirtualBox 5.1.6

我们的第1个问题是：

相对于普通表，JSONB的存储大小是否过大？还是变化不大。

Alexander Korotkov在 https://www.pgcon.org/2015/schedule/track/Hacking/784.en.html 说：

The first problem of jsonb is its size overhead (4-5 times) in comparison with storing decomposed json data in multiple plain tables, whereas binary format of jsonb has very low storage overhead with regard to a plain text (<4%). The overhead comes mainly from redundant storing of keys names, which can be quite long, in each document.

可以这么理解：二进制形式存储的jsonb的存储比普通表大4-5倍，事实如何？我们来做个实验。

回到组合使用PostgreSQL的ARRAY和JSONB数据类型（PG JSON系列3）的例子，即照片中的每个人的信息，如下图：

如果用普通表设计为（我们略作修改，把attendanceYear，graduationYear设置为数值型）：

-- Table: R_PIC_PERSON_PLAINTABLE

-- 照片中的人的信息表,用普通关联表实现

DROP TABLE IF EXISTS R_PIC_PERSON_PLAINTABLE CASCADE;

CREATE TABLE R_PIC_PERSON_PLAINTABLE (

id serial,-- 照片id

position text, -- 如"前1排左1"

name text,--如"张三"

phone text,

city text,

organization text,

attendanceYear numeric,

graduationYear numeric

)WITH (OIDS=FALSE);

我们使用两种JSONB的存储形式，第1种：

-- Table: R_PIC_PERSON_JSONB_NOARRAY

-- 照片中的人的信息表，用JSONB实现，非JSONB数组形式，即一行对应普通表的一行

DROP TABLE IF EXISTS R_PIC_PERSON_JSONB_NOARRAY CASCADE;

CREATE TABLE R_PIC_PERSON_JSONB_NOARRAY (

id serial,-- 照片id

person jsonb NOT NULL -- 格式为：{"position":"前1排左1","name":"张晓陆","phone":"18910180100","city":"济南","organization":"中科院计算所","attendanceYear":2012,"graduationYear":2016}

)WITH (OIDS=FALSE);

第2种：

-- Table: R_PIC_PERSON_JSONB_USEARRAY

-- 照片中的人的信息表，用JSONB实现，JSONB数组形式，即一行对应普通表的多行

DROP TABLE IF EXISTS R_PIC_PERSON_JSONB_USEARRAY CASCADE;

CREATE TABLE R_PIC_PERSON_JSONB_USEARRAY (

id serial,-- 照片id

persons jsonb NOT NULL -- 格式为：[{"position":"前1排左1","name":"张晓陆","phone":"18910180100","city":"济南","organization":"中科院计算所","attendanceYear":2012,"graduationYear":2016},{"position":"前1排左2","name":"张晓陆","phone":"18910180100","organization":"中科院计算所","attendanceYear":2012,"graduationYear":2016}]

)WITH (OIDS=FALSE);

第1种JSONB的形式，一行对应一行普通表，好比1张照片中有20个人，则分别有20行。第2种JSONB的形式，即一张照片只有一行。

然后我们灌测试数据（参考francs的 PostgreSQL9.4: jsonb 性能测试）：

20万条，普通表R_PIC_PERSON_PLAINTABLE（假设平均1张照片20个人）：

insert into R_PIC_PERSON_PLAINTABLE

select r/20,

round(random()*200000) || '_position',

round(random()*200000) || '_name',

round(random()*200000) || '_phone',

round(random()*200000) || '_city',

round(random()*200000) || '_organization',

round(random()*200000) ,

round(random()*200000)

from generate_series(1,200000) as r;

20万条，第1类JSONB表R_PIC_PERSON_JSONB_NOARRAY：

insert into R_PIC_PERSON_JSONB_NOARRAY

SELECT id, row_to_json(R_PIC_PERSON_PLAINTABLE)::jsonb - 'id' from R_PIC_PERSON_PLAINTABLE;

假设平均一张照片20个人，插入20万/20=1万条JSONB表R_PIC_PERSON_JSONB_USEARRAY：

insert into R_PIC_PERSON_JSONB_USEARRAY

select id,array_to_json(array_agg(person))

from R_PIC_PERSON_JSONB_NOARRAY

group by id;

上面是基于20万条，我们分别就2000，2万，20万，200万，2000万，分别做几组实验，分别用：

jsontestdb=> dt+ R_PIC_PERSON_PLAINTABLE;

jsontestdb=> dt+ R_PIC_PERSON_JSONB_NOARRAY;

jsontestdb=> dt+ R_PIC_PERSON_JSONB_USEARRAY;

获得各自的存储大小，得出如下表：

	2K	2万	20万	200万	2000万
PLAINTABLE	248	2264	23	240	2521
JSONB_NOARRAY	496	4856	49	488	5042
JSONB_USEARRAY	136	1176	13	156	异常
单位	KB	KB	MB	MB	MB

最后2000万那列的R_PIC_PERSON_JSONB_USEARRAY表，生成时出现了异常（回头查一下原因，大家也测试）：

LOG: redo starts at 2/5918E7B8

invalid record length at 2/591902D8: wanted 24, got 0

redo done at 2/5918E7B8

结论：总体上来讲本例中，在没有建索引的情况下，R_PIC_PERSON_JSONB_NOARRAY表大概是普通的R_PIC_PERSON_PLAINTABLE表的2倍大的空间，R_PIC_PERSON_JSONB_USEARRAY表则为R_PIC_PERSON_JSONB_NOARRAY表的50-65%左右。

那我们看看建立索引的情况下的大小，由于JSONB可建立Btree，Hash和GIN索引，我们先看看在三张表上均建立Btree索引的情况：

CREATE INDEX idx_btree_plaintbl ON R_PIC_PERSON_PLAINTABLE USING BTREE (name);

CREATE INDEX idx_btree_jsonb_noarray ON R_PIC_PERSON_JSONB_NOARRAY USING BTREE ((person->'name'));

CREATE INDEX idx_btree_jsonb_userarray ON R_PIC_PERSON_JSONB_USEARRAY USING BTREE ((persons));

获得各自的索引存储大小，得出如下表：

	2K	2万	20万	200万
PLAINTABLE	0.078	0.617	6.039	60.17
JSONB_NOARRAY	0.093	0.789	7.773	77.48
JSONB_USEARRAY	0.156	1.609	19.539	244.11
单位	MB	MB	MB	MB

Hash索引就不测了，然后在JSONB上测一下GIN：

CREATE INDEX idx_btree_plaintbl ON R_PIC_PERSON_PLAINTABLE USING BTREE (name);

CREATE INDEX idx_gin_jsonb_noarray ON R_PIC_PERSON_JSONB_NOARRAY USING GIN ((person->'name'));

CREATE INDEX idx_gin_jsonb_userarray ON R_PIC_PERSON_JSONB_USEARRAY USING GIN ((persons));

获得各自的索引存储大小，得出如下表：

	2K	2万	20万	200万
PLAINTABLE	0.078	0.617	6.039	60.17
JSONB_NOARRAY	0.109	0.960	9.578	62.58
JSONB_USEARRAY	0.617	6.117	44.97	476.8
单位	MB	MB	MB	MB

综上，在单一字段上建立的Btree和GIN索引，R_PIC_PERSON_PLAINTABLE表和R_PIC_PERSON_JSONB_NOARRAY表建立索引占用的存储空间略大一些（如上表中标红色的部分），但差别不大。

然后我们比较一下GIN索引在R_PIC_PERSON_JSONB_NOARRAY和R_PIC_PERSON_JSONB_USEARRAY表上的所有字段上建立的索引，所占空间有何区别。由于2K,2万，20万，200万呈现出良好的线性，所以这回我们仅对200万行记录进行测试。

CREATE INDEX idx_gin_jsonb_all_noarray ON R_PIC_PERSON_JSONB_NOARRAY USING GIN ((person));

CREATE INDEX idx_gin_jsonb_all_userarray ON R_PIC_PERSON_JSONB_USEARRAY USING GIN ((persons));

结果如下：

				200万
JSONB_NOARRAY				481.3906
JSONB_USEARRAY				475.4844
单位				MB

发现二者区别不大，JSONB_NOARRAY全字段索引所占空间略大些。

我们的第2个问题是：

2.JSONB的建立索引后，访问效率提升是多少？

在上面200万行的记录里，创建了如下全字段索引后:

CREATE INDEX idx_gin_jsonb_all_noarray ON R_PIC_PERSON_JSONB_NOARRAY USING GIN ((person));

CREATE INDEX idx_gin_jsonb_all_userarray ON R_PIC_PERSON_JSONB_USEARRAY USING GIN ((persons));

在R_PIC_PERSON_JSONB_NOARRAY 上面执行：

explain analyze select * from R_PIC_PERSON_JSONB_NOARRAY where person @> '{"name":"1349018_name"}';

QUERY PLAN

---------------------------------------

Bitmap Heap Scan on r_pic_person_jsonb_noarray (cost=59.50..6912.05 rows=2000 width=222

) (actual time=0.167..0.170 rows=2 loops=1)

Recheck Cond: (person @> '{"name": "1349018_name"}'::jsonb)

Heap Blocks: exact=2

-> Bitmap Index Scan on idx_gin_jsonb_all_noarray (cost=0.00..59.00 rows=2000 width=

0) (actual time=0.155..0.155 rows=2 loops=1)

Index Cond: (person @> '{"name": "1349018_name"}'::jsonb)

Planning time: 0.065 ms

Execution time: 0.196 ms

(7 rows)

Time: 0.593 ms

在R_PIC_PERSON_JSONB_USEARRAY 上面执行：

explain analyze select * from R_PIC_PERSON_JSONB_USEARRAY where personS @> '[{"name":"1349018_name"}]';

QUERY PLAN

-----------------------------

Bitmap Heap Scan on r_pic_person_jsonb_usearray (cost=44.78..424.81 rows=100 width=1365

) (actual time=0.099..0.114 rows=2 loops=1)

Recheck Cond: (persons @> '[{"name": "1349018_name"}]'::jsonb)

Heap Blocks: exact=2

-> Bitmap Index Scan on idx_gin_jsonb_all_userarray (cost=0.00..44.75 rows=100 width

=0) (actual time=0.077..0.077 rows=2 loops=1)

Index Cond: (persons @> '[{"name": "1349018_name"}]'::jsonb)

Planning time: 0.066 ms

Execution time: 0.140 ms

(7 rows)

Time: 0.571 ms

二者均走了索引。

下面这个SQL语句，没走索引：

针对R_PIC_PERSON_JSONB_NOARRAY ：

explain analyze SELECT count(*) FROM R_PIC_PERSON_JSONB_NOARRAY a WHERE (a.person->>'attendanceyear')::integer <1998;

QUERY PLAN

--------------

Aggregate (cost=104167.56..104167.57 rows=1 width=8) (actual time=1356.895..1356.895 ro

ws=1 loops=1)

-> Seq Scan on r_pic_person_jsonb_noarray a (cost=0.00..102500.90 rows=666665 width=

0) (actual time=0.224..1355.766 rows=1991 loops=1)

Filter: (((person ->> 'attendanceyear'::text))::integer < 1998)

Rows Removed by Filter: 1998009

Planning time: 0.052 ms

Execution time: 1356.922 ms

(6 rows)

Time: 1357.409 ms

针对R_PIC_PERSON_JSONB_USEARRAY：

explain analyze WITH rows_filtered AS (

SELECT DISTINCT id FROM (

select a.id, b.* from R_PIC_PERSON_JSONB_USEARRAY a, jsonb_to_recordset(a.persons) as b(position text,name text, phone text, city text, organization text, attendanceYear numeric, graduationYear numeric)

) c WHERE c.attendanceYear < 1998)

SELECT count(*) FROM R_PIC_PERSON_JSONB_USEARRAY a WHERE a.id IN (SELECT * FROM rows_filtered);

QUERY PLAN

------------------------

Aggregate (cost=666790.68..666790.69 rows=1 width=8) (actual time=5306.279..5306.279 ro

ws=1 loops=1)

CTE rows_filtered

-> Unique (cost=626533.97..643034.14 rows=100001 width=4) (actual time=5214.361..5

215.161 rows=1975 loops=1)

-> Sort (cost=626533.97..634784.06 rows=3300033 width=4) (actual time=5214.3

59..5214.633 rows=1991 loops=1)

Sort Key: a_1.id

Sort Method: quicksort Memory: 142kB

-> Nested Loop (cost=0.00..179001.59 rows=3300033 width=4) (actual tim

e=1.558..5212.525 rows=1991 loops=1)

-> Seq Scan on r_pic_person_jsonb_usearray a_1 (cost=0.00..21000

.01 rows=100001 width=1365) (actual time=0.002..140.555 rows=100001 loops=1)

-> Function Scan on jsonb_to_recordset b (cost=0.00..1.25 rows=3

3 width=0) (actual time=0.050..0.050 rows=0 loops=100001)

Filter: (attendanceyear < '1998'::numeric)

Rows Removed by Filter: 20

-> Hash Join (cost=2254.52..23631.54 rows=50000 width=0) (actual time=5217.060..5305

.864 rows=1975 loops=1)

Hash Cond: (a.id = rows_filtered.id)

-> Seq Scan on r_pic_person_jsonb_usearray a (cost=0.00..21000.01 rows=100001

width=4) (actual time=0.028..68.248 rows=100001 loops=1)

-> Hash (cost=2252.02..2252.02 rows=200 width=4) (actual time=5216.963..5216.9

63 rows=1975 loops=1)

Buckets: 2048 (originally 1024) Batches: 1 (originally 1) Memory Usage:

86kB

-> HashAggregate (cost=2250.02..2252.02 rows=200 width=4) (actual time=5

216.258..5216.586 rows=1975 loops=1)

Group Key: rows_filtered.id

-> CTE Scan on rows_filtered (cost=0.00..2000.02 rows=100001 width

=4) (actual time=5214.363..5215.643 rows=1975 loops=1)

Planning time: 0.171 ms

Execution time: 5306.357 ms

(21 rows)

Time: 5307.197 ms

结论：建立JSONB的索引后，对字符串相等匹配会有效，数值类操作无效，这是因为就算取到数值型的jsonb ，还得要做转换（如上面的(a.person->>'attendanceyear')::integer <1998)。在上面的例子里，从200万行的数据里查询，使用索引后，从全表扫描的Execution time为1356.922 ms降低到0.196 ms，提升还是巨大的。

参考：

码字码了半天，手指头都麻了，如果觉得对您还有用的话，赞助一下辛苦费吧：

转载本文请联系原作者获取授权，同时请注明本文来自孙鹏科学网博客。
链接地址：https://blog.sciencenet.cn/blog-419883-1009363.html

上一篇：PG中验证JSON Schema（PG JSON系列4）
下一篇：Node.js如何方便的使用PostgreSQL中的JSONB数据（PG JSON系列6）

收藏 IP: 61.135.169.*| 热度|

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

数据加载中...

返回顶部

孙鹏

扫一扫，分享此博文

hillpig的个人博客分享 http://blog.sciencenet.cn/u/hillpig 畅想ing,思考ing,前行ing Email:bluevaley@gmail.com

博文

PG中JSONB存储和访问效率（PG JSON系列5）

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

孙鹏

全部作者的其他最新博文

全部精选博文导读

相关博文

hillpig的个人博客分享 http://blog.sciencenet.cn/u/hillpig 畅想ing,思考ing,前行ing Email:bluevaley@gmail.com

博文

PG中JSONB存储和访问效率（PG JSON系列5）

当前推荐数：0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

孙鹏

全部作者的其他最新博文

全部精选博文导读

相关博文

该博文允许注册用户评论请点击登录评论 (0 个评论)