|||
前4篇博文,大概写了一下如何使用PG的JSONB类型:
这些数据量都还没上规模,这一篇写写上规模之后的JSONB的存储和访问效率。
测试环境:CENTOS 7 64bit, PG 9.6.0,三星SSD 840,CPU 酷睿G640,2.8GHz, VirtualBox 5.1.6
我们的第1个问题是:
相对于普通表,JSONB的存储大小是否过大?还是变化不大。
Alexander Korotkov在 https://www.pgcon.org/2015/schedule/track/Hacking/784.en.html 说:
The first problem of jsonb is its size overhead (4-5 times) in comparison with storing decomposed json data in multiple plain tables, whereas binary format of jsonb has very low storage overhead with regard to a plain text (<4%). The overhead comes mainly from redundant storing of keys names, which can be quite long, in each document.
可以这么理解:二进制形式存储的jsonb的存储比普通表大4-5倍,事实如何?我们来做个实验。
回到组合使用PostgreSQL的ARRAY和JSONB数据类型(PG JSON系列3)的例子,即照片中的每个人的信息,如下图:
如果用普通表设计为(我们略作修改,把attendanceYear,graduationYear设置为数值型):
-- Table: R_PIC_PERSON_PLAINTABLE -- 照片中的人的信息表,用普通关联表实现 DROP TABLE IF EXISTS R_PIC_PERSON_PLAINTABLE CASCADE; CREATE TABLE R_PIC_PERSON_PLAINTABLE ( id serial,-- 照片id position text, -- 如"前1排左1" name text,--如"张三" phone text, city text, organization text, attendanceYear numeric, graduationYear numeric )WITH (OIDS=FALSE); |
我们使用两种JSONB的存储形式,第1种:
-- Table: R_PIC_PERSON_JSONB_NOARRAY -- 照片中的人的信息表,用JSONB实现,非JSONB数组形式,即一行对应普通表的一行 DROP TABLE IF EXISTS R_PIC_PERSON_JSONB_NOARRAY CASCADE; CREATE TABLE R_PIC_PERSON_JSONB_NOARRAY ( id serial,-- 照片id person jsonb NOT NULL -- 格式为:{"position":"前1排左1","name":"张晓陆","phone":"18910180100","city":"济南","organization":"中科院计算所","attendanceYear":2012,"graduationYear":2016} )WITH (OIDS=FALSE); |
第2种:
-- Table: R_PIC_PERSON_JSONB_USEARRAY -- 照片中的人的信息表,用JSONB实现,JSONB数组形式,即一行对应普通表的多行 DROP TABLE IF EXISTS R_PIC_PERSON_JSONB_USEARRAY CASCADE; CREATE TABLE R_PIC_PERSON_JSONB_USEARRAY ( id serial,-- 照片id persons jsonb NOT NULL -- 格式为:[{"position":"前1排左1","name":"张晓陆","phone":"18910180100","city":"济南","organization":"中科院计算所","attendanceYear":2012,"graduationYear":2016},{"position":"前1排左2","name":"张晓陆","phone":"18910180100","organization":"中科院计算所","attendanceYear":2012,"graduationYear":2016}] )WITH (OIDS=FALSE); |
第1种JSONB的形式,一行对应一行普通表,好比1张照片中有20个人,则分别有20行。第2种JSONB的形式,即一张照片只有一行。
然后我们灌测试数据(参考francs的 PostgreSQL9.4: jsonb 性能测试 ):
20万条,普通表R_PIC_PERSON_PLAINTABLE(假设平均1张照片20个人):
insert into R_PIC_PERSON_PLAINTABLE select r/20, round(random()*200000) || '_position', round(random()*200000) || '_name', round(random()*200000) || '_phone', round(random()*200000) || '_city', round(random()*200000) || '_organization', round(random()*200000) , round(random()*200000) from generate_series(1,200000) as r; |
20万条,第1类JSONB表R_PIC_PERSON_JSONB_NOARRAY:
insert into R_PIC_PERSON_JSONB_NOARRAY SELECT id, row_to_json(R_PIC_PERSON_PLAINTABLE)::jsonb - 'id' from R_PIC_PERSON_PLAINTABLE; |
假设平均一张照片20个人,插入20万/20=1万条JSONB表R_PIC_PERSON_JSONB_USEARRAY:
insert into R_PIC_PERSON_JSONB_USEARRAY select id,array_to_json(array_agg(person)) from R_PIC_PERSON_JSONB_NOARRAY group by id; |
上面是基于20万条,我们分别就2000,2万,20万,200万,2000万,分别做几组实验,分别用:
jsontestdb=> dt+ R_PIC_PERSON_PLAINTABLE;
jsontestdb=> dt+ R_PIC_PERSON_JSONB_NOARRAY;
jsontestdb=> dt+ R_PIC_PERSON_JSONB_USEARRAY;
获得各自的存储大小,得出如下表:
2K | 2万 | 20万 | 200万 | 2000万 | |
PLAINTABLE | 248 | 2264 | 23 | 240 | 2521 |
JSONB_NOARRAY | 496 | 4856 | 49 | 488 | 5042 |
JSONB_USEARRAY | 136 | 1176 | 13 | 156 | 异常 |
单位 | KB | KB | MB | MB | MB |
最后2000万那列的R_PIC_PERSON_JSONB_USEARRAY表,生成时出现了异常(回头查一下原因,大家也测试):
LOG: redo starts at 2/5918E7B8
invalid record length at 2/591902D8: wanted 24, got 0
redo done at 2/5918E7B8
结论:总体上来讲本例中,在没有建索引的情况下,R_PIC_PERSON_JSONB_NOARRAY表大概是普通的R_PIC_PERSON_PLAINTABLE表的2倍大的空间,R_PIC_PERSON_JSONB_USEARRAY表则为R_PIC_PERSON_JSONB_NOARRAY表的50-65%左右。
那我们看看建立索引的情况下的大小,由于JSONB可建立Btree,Hash和GIN索引,我们先看看在三张表上均建立Btree索引的情况:
CREATE INDEX idx_btree_plaintbl ON R_PIC_PERSON_PLAINTABLE USING BTREE (name); CREATE INDEX idx_btree_jsonb_noarray ON R_PIC_PERSON_JSONB_NOARRAY USING BTREE ((person->'name')); CREATE INDEX idx_btree_jsonb_userarray ON R_PIC_PERSON_JSONB_USEARRAY USING BTREE ((persons)); |
获得各自的索引存储大小,得出如下表:
2K | 2万 | 20万 | 200万 | |
PLAINTABLE | 0.078 | 0.617 | 6.039 | 60.17 |
JSONB_NOARRAY | 0.093 | 0.789 | 7.773 | 77.48 |
JSONB_USEARRAY | 0.156 | 1.609 | 19.539 | 244.11 |
单位 | MB | MB | MB | MB |
Hash索引就不测了,然后在JSONB上测一下GIN:
CREATE INDEX idx_btree_plaintbl ON R_PIC_PERSON_PLAINTABLE USING BTREE (name); CREATE INDEX idx_gin_jsonb_noarray ON R_PIC_PERSON_JSONB_NOARRAY USING GIN ((person->'name')); CREATE INDEX idx_gin_jsonb_userarray ON R_PIC_PERSON_JSONB_USEARRAY USING GIN ((persons)); |
获得各自的索引存储大小,得出如下表:
2K | 2万 | 20万 | 200万 | |
PLAINTABLE | 0.078 | 0.617 | 6.039 | 60.17 |
JSONB_NOARRAY | 0.109 | 0.960 | 9.578 | 62.58 |
JSONB_USEARRAY | 0.617 | 6.117 | 44.97 | 476.8 |
单位 | MB | MB | MB | MB |
综上,在单一字段上建立的Btree和GIN索引,R_PIC_PERSON_PLAINTABLE表和R_PIC_PERSON_JSONB_NOARRAY表建立索引占用的存储空间略大一些(如上表中标红色的部分),但差别不大。
然后我们比较一下GIN索引在R_PIC_PERSON_JSONB_NOARRAY和R_PIC_PERSON_JSONB_USEARRAY表上的所有字段上建立的索引,所占空间有何区别。由于2K,2万,20万,200万呈现出良好的线性,所以这回我们仅对200万行记录进行测试。
CREATE INDEX idx_gin_jsonb_all_noarray ON R_PIC_PERSON_JSONB_NOARRAY USING GIN ((person)); CREATE INDEX idx_gin_jsonb_all_userarray ON R_PIC_PERSON_JSONB_USEARRAY USING GIN ((persons)); |
结果如下:
200万 | ||||
JSONB_NOARRAY | 481.3906 | |||
JSONB_USEARRAY | 475.4844 | |||
单位 | MB |
发现二者区别不大,JSONB_NOARRAY全字段索引所占空间略大些。
我们的第2个问题是:
2.JSONB的建立索引后,访问效率提升是多少?
在上面200万行的记录里,创建了如下全字段索引后:
CREATE INDEX idx_gin_jsonb_all_noarray ON R_PIC_PERSON_JSONB_NOARRAY USING GIN ((person)); CREATE INDEX idx_gin_jsonb_all_userarray ON R_PIC_PERSON_JSONB_USEARRAY USING GIN ((persons)); |
在R_PIC_PERSON_JSONB_NOARRAY 上面执行:
explain analyze select * from R_PIC_PERSON_JSONB_NOARRAY where person @> '{"name":"1349018_name"}'; QUERY PLAN --------------------------------------- Bitmap Heap Scan on r_pic_person_jsonb_noarray (cost=59.50..6912.05 rows=2000 width=222 ) (actual time=0.167..0.170 rows=2 loops=1) Recheck Cond: (person @> '{"name": "1349018_name"}'::jsonb) Heap Blocks: exact=2 -> Bitmap Index Scan on idx_gin_jsonb_all_noarray (cost=0.00..59.00 rows=2000 width= 0) (actual time=0.155..0.155 rows=2 loops=1) Index Cond: (person @> '{"name": "1349018_name"}'::jsonb) Planning time: 0.065 ms Execution time: 0.196 ms (7 rows) Time: 0.593 ms |
在R_PIC_PERSON_JSONB_USEARRAY 上面执行:
explain analyze select * from R_PIC_PERSON_JSONB_USEARRAY where personS @> '[{"name":"1349018_name"}]'; QUERY PLAN ----------------------------- Bitmap Heap Scan on r_pic_person_jsonb_usearray (cost=44.78..424.81 rows=100 width=1365 ) (actual time=0.099..0.114 rows=2 loops=1) Recheck Cond: (persons @> '[{"name": "1349018_name"}]'::jsonb) Heap Blocks: exact=2 -> Bitmap Index Scan on idx_gin_jsonb_all_userarray (cost=0.00..44.75 rows=100 width =0) (actual time=0.077..0.077 rows=2 loops=1) Index Cond: (persons @> '[{"name": "1349018_name"}]'::jsonb) Planning time: 0.066 ms Execution time: 0.140 ms (7 rows) Time: 0.571 ms |
二者均走了索引。
下面这个SQL语句,没走索引:
针对R_PIC_PERSON_JSONB_NOARRAY :
explain analyze SELECT count(*) FROM R_PIC_PERSON_JSONB_NOARRAY a WHERE (a.person->>'attendanceyear')::integer <1998; QUERY PLAN -------------- Aggregate (cost=104167.56..104167.57 rows=1 width=8) (actual time=1356.895..1356.895 ro ws=1 loops=1) -> Seq Scan on r_pic_person_jsonb_noarray a (cost=0.00..102500.90 rows=666665 width= 0) (actual time=0.224..1355.766 rows=1991 loops=1) Filter: (((person ->> 'attendanceyear'::text))::integer < 1998) Rows Removed by Filter: 1998009 Planning time: 0.052 ms Execution time: 1356.922 ms (6 rows) Time: 1357.409 ms |
针对R_PIC_PERSON_JSONB_USEARRAY:
explain analyze WITH rows_filtered AS ( SELECT DISTINCT id FROM ( select a.id, b.* from R_PIC_PERSON_JSONB_USEARRAY a, jsonb_to_recordset(a.persons) as b(position text,name text, phone text, city text, organization text, attendanceYear numeric, graduationYear numeric) ) c WHERE c.attendanceYear < 1998) SELECT count(*) FROM R_PIC_PERSON_JSONB_USEARRAY a WHERE a.id IN (SELECT * FROM rows_filtered); QUERY PLAN ------------------------ Aggregate (cost=666790.68..666790.69 rows=1 width=8) (actual time=5306.279..5306.279 ro ws=1 loops=1) CTE rows_filtered -> Unique (cost=626533.97..643034.14 rows=100001 width=4) (actual time=5214.361..5 215.161 rows=1975 loops=1) -> Sort (cost=626533.97..634784.06 rows=3300033 width=4) (actual time=5214.3 59..5214.633 rows=1991 loops=1) Sort Key: a_1.id Sort Method: quicksort Memory: 142kB -> Nested Loop (cost=0.00..179001.59 rows=3300033 width=4) (actual tim e=1.558..5212.525 rows=1991 loops=1) -> Seq Scan on r_pic_person_jsonb_usearray a_1 (cost=0.00..21000 .01 rows=100001 width=1365) (actual time=0.002..140.555 rows=100001 loops=1) -> Function Scan on jsonb_to_recordset b (cost=0.00..1.25 rows=3 3 width=0) (actual time=0.050..0.050 rows=0 loops=100001) Filter: (attendanceyear < '1998'::numeric) Rows Removed by Filter: 20 -> Hash Join (cost=2254.52..23631.54 rows=50000 width=0) (actual time=5217.060..5305 .864 rows=1975 loops=1) Hash Cond: (a.id = rows_filtered.id) -> Seq Scan on r_pic_person_jsonb_usearray a (cost=0.00..21000.01 rows=100001 width=4) (actual time=0.028..68.248 rows=100001 loops=1) -> Hash (cost=2252.02..2252.02 rows=200 width=4) (actual time=5216.963..5216.9 63 rows=1975 loops=1) Buckets: 2048 (originally 1024) Batches: 1 (originally 1) Memory Usage: 86kB -> HashAggregate (cost=2250.02..2252.02 rows=200 width=4) (actual time=5 216.258..5216.586 rows=1975 loops=1) Group Key: rows_filtered.id -> CTE Scan on rows_filtered (cost=0.00..2000.02 rows=100001 width =4) (actual time=5214.363..5215.643 rows=1975 loops=1) Planning time: 0.171 ms Execution time: 5306.357 ms (21 rows) Time: 5307.197 ms |
结论:建立JSONB的索引后,对字符串相等匹配会有效,数值类操作无效,这是因为就算取到数值型的jsonb ,还得要做转换(如上面的(a.person->>'attendanceyear')::integer <1998)。在上面的例子里,从200万行的数据里查询,使用索引后,从全表扫描的Execution time为1356.922 ms降低到0.196 ms,提升还是巨大的。
参考:
francs的 PostgreSQL9.4: jsonb 性能测试
https://hashrocket.com/blog/posts/faster-json-generation-with-postgresql
http://bitnine.net/blog-postgresql/postgresql-internals-jsonb-type-and-its-indexes/?ckattempt=1
https://blog.codeship.com/unleash-the-power-of-storing-json-in-postgres/
码字码了半天,手指头都麻了,如果觉得对您还有用的话,赞助一下辛苦费吧:
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-11-23 23:09
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社