MemSQL简介

# 为什么要看MemSQL

因为 MemSQL 自称突出一个快!而我们团队的数据平台正需要一个ms级别查询的数据库,因此花了一点时间来调研 MemSQL 。

# 初体验结果

MemSQL与我们期望的ms级别查询相差甚远,最简单的查询也通常是500ms返回,并不能应对线上的高并发实时查询系统。秒级查询已经有presto等各种大数据组件,对于我们团队已经没有理由再来深入使用MemSQL。

MemSQL 其实有很多优点,有完善的文档,精致的页面管理工具,方便的安装,甚至一键导入大量数据来方便用户体验 (就差把sql的耗时贴出来了,要是贴出来就省了我调研的时间了),可惜在同一梯队还有太多免费、开源、有活跃社区的竞争对手。

# 特点

  • MemSQL属于商业公司,免费4 units. (每8核32G 为1 unit),不开源
  • 自带portal界面管理,MemSQL Studio,自带安装工具
  • 完全支持mysql语法,jdbc,增删改查
  • 支持 in-memory rowstore && on-disk columnstore
  • 大吞吐量数据写入(loading TPC-H SF100 (approximately 100 GBs of row files) will take around four minutes)
  • exactly-once
  • aggregator节点运行sql,聚合结果;leaf节点存储&处理数据

# 数据导入

  1. file , 指定分隔符
  2. Streaming (Kafka , S3 , Azure Blob , Filesystem) 通过Pipelines
  3. MySQL (sql文件)

# 选择 shard key (类似ES的routing)

  • Using a column or set of columns unique enough to minimize skew.

    独特的key,使数据均匀分布

  • Sharding on columns which you expect to filter or join on often. This allows the optimizer to minimize network traffic during the execution of the query (see [Distributed DML] for more details).

    使用你经常要join或过滤的字段,减小网络传输。比如订单相关,使用user_id

# 查询调优

  1. 加index 优化过滤 groupby sort
  2. shard key : Gather partitions:all 优化为 Gather partitions:single
  3. Reference Tables 小表,不常更新 ,每个节点都有复制

profile

# 表结构优化

# Rowstore vs. Columnstore

https://docs.memsql.com/tutorials/v6.8/optimizing-table-data-structures/

Rowstore : 查询指定列,并发更新。使用无锁索引,支持多索引。频繁更新的事务场景。unique constraints 。同样会往disk写一份 用于恢复。

Columnstore: 顺序扫描,单索引。适用场景:聚合仅很少列(10列内),扫描大量行,很少频繁单行删改,更新应该是大批量的修改

# 分布式SQL

# DDL

  • 每一个并行的查询都在每个leaf的每个partition上分开执行。
  • 默认每个leaf的partition数量=cpu数量
  • 每个partition都是个单独的库
  • reference tables在每个aggregator和leaf都有一份复制,适合小表
  • shard key 决定数据落到哪一partition
  • 主键和unique key 需要包含shard key ,例如:
1
2
3
4
5
6
7
CREATE TABLE clicks (
        click_id BIGINT AUTO_INCREMENT,
        user_id INT,
        page_id INT,
        SHARD KEY (user_id),
        PRIMARY KEY (click_id, user_id)
    );

# DML

  • 分布式join 相同的shard key来join可以提升性能
  • 分布式事务,等待每个partition都ready,再一起提交
  • 引用表必须具有显式主键。 分布式表中的AUTO_INCREMENT列必须是BIGINT。 自动增量值在每个聚合器上单调递增,但在整个群集中不连续。 分片表不支持唯一键(除非唯一键是分片键的超集)。 分片表不支持修改分片键的UPDATE查询。 分片表不支持UPDATE … LIMIT。

# 遇到问题

  1. mysql front连接不上

    换用其它客户端可以,比如idea

  2. https://docs.memsql.com/tutorials/v6.8/build-stock-trade-database/ 导入csv文件提示找不到文件

    docker cp /home/hxy/download/companylist.csv 62b8df2d8761://usr/share

# 性能体验

# 测试机器

AWS 8c32g

# 测试数据

官方S3的数据,我们使用 columnstore 和 rowstore 各测一次sql查询。

https://docs.memsql.com/guides/latest/load-data/pipelines/step-1/

# columnstore数据

columnstore.png

# rowstore 数据

rowstore.png

由于150W行数据存disk 占1G,存memory需要占5G,发现机器内存不足,只导入了一张150W行和一张25行的表。

# 执行sql耗时

1
select * from nation limit 20;

columnstore 770ms

rowstore 700ms

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
select l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from lineitem
where l_shipdate <= date('1998-12-01' - interval '90' day)
group by l_returnflag, l_linestatus
order by l_returnflag, l_linestatus;

columnstore 19s 5s 5s

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
select
l_orderkey,
sum(l_extendedprice * (1 - l_discount)) as revenue,
o_orderdate,
o_shippriority
from
customer,
orders,
lineitem
where
c_mktsegment = 'BUILDING'
and c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate < date('1995-03-15')
and l_shipdate > date('1995-03-15')
group by
l_orderkey,
o_orderdate,
o_shippriority
order by
revenue desc,
o_orderdate
limit 10;

columnstore 10s 5s 5s

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
select
c_custkey,
c_name,
sum(l_extendedprice * (1 - l_discount)) as revenue,
c_acctbal,
n_name,
c_address,
c_phone,
c_comment
from
customer,
orders,
lineitem,
nation
where
c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate >= date('1993-10-01')
and o_orderdate < date('1993-10-01') + interval '3' month
and l_returnflag = 'R'
and c_nationkey = n_nationkey
group by
c_custkey,
c_name,
c_acctbal,
c_phone,
n_name,
c_address,
c_comment
order by
revenue desc
limit 20;

columnstore 12s 9s 9s

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
select
nation,
o_year,
sum(amount) as sum_profit
from
(
select
n_name as nation,
extract(year from o_orderdate) as o_year,
l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity as amount
from
part,
supplier,
lineitem,
partsupp,
orders,
nation
where
s_suppkey = l_suppkey
and ps_suppkey = l_suppkey
and ps_partkey = l_partkey
and p_partkey = l_partkey
and o_orderkey = l_orderkey
and s_nationkey = n_nationkey
and p_name like '%green%'
) as profit
group by
nation,
o_year
order by
nation,
o_year desc;

columnstore 32s 19s 19s

1
2
3
4
5
6
select *
from customer c
left join orders o on c.c_custkey = o.o_custkey
left join nation n on c.c_nationkey = n.n_nationkey
where c.c_custkey = '8367'
limit 10;

columnstore 32s 4s 4s

1
2
3
4
5
6
select *
from customer c
left join orders o on c.c_custkey = o.o_custkey and o_custkey = '2141'
left join nation n on c.c_nationkey = n.n_nationkey
where c.c_custkey = '2141'
limit 10;

columnstore 2s 1s 1s

1
2
3
4
select *
from customer c
where c.c_custkey = '2141'
limit 10;

rowstore 500ms