MemSQL简介

# 为什么要看MemSQL

因为 MemSQL 自称突出一个快！而我们团队的数据平台正需要一个ms级别查询的数据库，因此花了一点时间来调研 MemSQL 。

# 初体验结果

MemSQL与我们期望的ms级别查询相差甚远，最简单的查询也通常是500ms返回，并不能应对线上的高并发实时查询系统。秒级查询已经有presto等各种大数据组件，对于我们团队已经没有理由再来深入使用MemSQL。

MemSQL 其实有很多优点，有完善的文档，精致的页面管理工具，方便的安装，甚至一键导入大量数据来方便用户体验（就差把sql的耗时贴出来了，要是贴出来就省了我调研的时间了），可惜在同一梯队还有太多免费、开源、有活跃社区的竞争对手。

# 特点

MemSQL属于商业公司，免费4 units. (每8核32G 为1 unit)，不开源
自带portal界面管理，MemSQL Studio，自带安装工具
完全支持mysql语法，jdbc，增删改查
支持 in-memory rowstore && on-disk columnstore
大吞吐量数据写入(loading TPC-H SF100 (approximately 100 GBs of row files) will take around four minutes)
exactly-once
aggregator节点运行sql，聚合结果；leaf节点存储&处理数据

# 数据导入

file , 指定分隔符
Streaming (Kafka , S3 , Azure Blob , Filesystem) 通过Pipelines
MySQL (sql文件)

# 选择 shard key (类似ES的routing)

Using a column or set of columns unique enough to minimize skew.
独特的key，使数据均匀分布
Sharding on columns which you expect to filter or join on often. This allows the optimizer to minimize network traffic during the execution of the query (see [Distributed DML] for more details).
使用你经常要join或过滤的字段，减小网络传输。比如订单相关，使用user_id

# 查询调优

加index 优化过滤 groupby sort
shard key : Gather partitions:all 优化为 Gather partitions:single
Reference Tables 小表，不常更新，每个节点都有复制

# 表结构优化

# Rowstore vs. Columnstore

https://docs.memsql.com/tutorials/v6.8/optimizing-table-data-structures/

Rowstore : 查询指定列，并发更新。使用无锁索引，支持多索引。频繁更新的事务场景。unique constraints 。同样会往disk写一份用于恢复。

Columnstore：顺序扫描，单索引。适用场景：聚合仅很少列（10列内）,扫描大量行，很少频繁单行删改，更新应该是大批量的修改

# 分布式SQL

# DDL

每一个并行的查询都在每个leaf的每个partition上分开执行。
默认每个leaf的partition数量=cpu数量
每个partition都是个单独的库
reference tables在每个aggregator和leaf都有一份复制，适合小表
shard key 决定数据落到哪一partition
主键和unique key 需要包含shard key ，例如：

1
2
3
4
5
6
7
CREATE TABLE clicks (
        click_id BIGINT AUTO_INCREMENT,
        user_id INT,
        page_id INT,
        SHARD KEY (user_id),
        PRIMARY KEY (click_id, user_id)
    );

# DML

分布式join 相同的shard key来join可以提升性能
分布式事务，等待每个partition都ready，再一起提交
引用表必须具有显式主键。分布式表中的AUTO_INCREMENT列必须是BIGINT。自动增量值在每个聚合器上单调递增，但在整个群集中不连续。分片表不支持唯一键（除非唯一键是分片键的超集）。分片表不支持修改分片键的UPDATE查询。分片表不支持UPDATE … LIMIT。

# 遇到问题

mysql front连接不上
换用其它客户端可以，比如idea
https://docs.memsql.com/tutorials/v6.8/build-stock-trade-database/ 导入csv文件提示找不到文件
docker cp /home/hxy/download/companylist.csv 62b8df2d8761://usr/share

# 性能体验

# 测试机器

AWS 8c32g

# 测试数据

官方S3的数据，我们使用 columnstore 和 rowstore 各测一次sql查询。

https://docs.memsql.com/guides/latest/load-data/pipelines/step-1/

# columnstore数据

# rowstore 数据

由于150W行数据存disk 占1G，存memory需要占5G，发现机器内存不足，只导入了一张150W行和一张25行的表。

# 执行sql耗时

1
select * from nation limit 20;

columnstore 770ms

rowstore 700ms

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
select l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from lineitem
where l_shipdate <= date('1998-12-01' - interval '90' day)
group by l_returnflag, l_linestatus
order by l_returnflag, l_linestatus;

columnstore 19s 5s 5s

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
select
l_orderkey,
sum(l_extendedprice * (1 - l_discount)) as revenue,
o_orderdate,
o_shippriority
from
customer,
orders,
lineitem
where
c_mktsegment = 'BUILDING'
and c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate < date('1995-03-15')
and l_shipdate > date('1995-03-15')
group by
l_orderkey,
o_orderdate,
o_shippriority
order by
revenue desc,
o_orderdate
limit 10;

columnstore 10s 5s 5s

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
select
c_custkey,
c_name,
sum(l_extendedprice * (1 - l_discount)) as revenue,
c_acctbal,
n_name,
c_address,
c_phone,
c_comment
from
customer,
orders,
lineitem,
nation
where
c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate >= date('1993-10-01')
and o_orderdate < date('1993-10-01') + interval '3' month
and l_returnflag = 'R'
and c_nationkey = n_nationkey
group by
c_custkey,
c_name,
c_acctbal,
c_phone,
n_name,
c_address,
c_comment
order by
revenue desc
limit 20;

columnstore 12s 9s 9s

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
select
nation,
o_year,
sum(amount) as sum_profit
from
(
select
n_name as nation,
extract(year from o_orderdate) as o_year,
l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity as amount
from
part,
supplier,
lineitem,
partsupp,
orders,
nation
where
s_suppkey = l_suppkey
and ps_suppkey = l_suppkey
and ps_partkey = l_partkey
and p_partkey = l_partkey
and o_orderkey = l_orderkey
and s_nationkey = n_nationkey
and p_name like '%green%'
) as profit
group by
nation,
o_year
order by
nation,
o_year desc;

columnstore 32s 19s 19s

1
2
3
4
5
6
select *
from customer c
left join orders o on c.c_custkey = o.o_custkey
left join nation n on c.c_nationkey = n.n_nationkey
where c.c_custkey = '8367'
limit 10;

columnstore 32s 4s 4s

1
2
3
4
5
6
select *
from customer c
left join orders o on c.c_custkey = o.o_custkey and o_custkey = '2141'
left join nation n on c.c_nationkey = n.n_nationkey
where c.c_custkey = '2141'
limit 10;

columnstore 2s 1s 1s

1
2
3
4
select *
from customer c
where c.c_custkey = '2141'
limit 10;

rowstore 500ms