ClickHouse技術分享第二彈(英文講義)

前言

以下是今天為公司小伙伴們做的ClickHouse技術分享的講義。由于PPT太難做了,索性直接用Markdown來寫,搭配Chrome上的Markdown Preview Plus插件來渲染,效果非常好。

以下全文奉上,濃縮的都是精華,包含之前寫過的兩篇文章《物化視圖簡介與ClickHouse中的應用示例》《ClickHouse Better Practices》中的全部內(nèi)容,另外也包含一些新內(nèi)容,如:

  • ClickHouse聚合函數(shù)的combinator后綴
  • 分布式join/in的讀放大和GLOBAL關鍵字
  • ClickHouse SQL缺乏開窗分析函數(shù)的解決方案
    • 示例:排名榜和同比、環(huán)比計算
    • 以及array join的用法
  • LowCardinality數(shù)據(jù)類型
  • MergeTree索引結構

數(shù)組和高階函數(shù)沒時間說了,之后再提。

開始~


Advanced Usage & Better Practice of ClickHouse

Part I - Materialized View

Intro

  • Materialized view (MV): A copy (persistent storage) of query result set

  • MVs ≠ normal views, but ≈ tables

  • Space trade-off for time

  • Exists in various DBMSs (Oracle/SQL Server/PostgreSQL/...)

  • MV in ClickHouse = Precomputation + Incremental refreshing + Explicit data cache

  • Usage: Relieve from frequent & patterned aggregating queries

Engines

  • MaterializedView: Implicit

  • (Replicated)AggregatingMergeTree: Do auto aggregation upon insertion according to user-defined logic

  • Distributed: Just like distributed tables before

Creation

  • Best-selling merchandise points: PV/UV/first visiting time/last visiting time

【此處圖片涉及業(yè)務數(shù)據(jù),故刪掉】

CREATE MATERIALIZED VIEW IF NOT EXISTS dw.merchandise_point_pvuv_agg
ON CLUSTER sht_ck_cluster_1
ENGINE = ReplicatedAggregatingMergeTree('/clickhouse/tables/{shard}/dw/merchandise_point_pvuv_agg','{replica}')
PARTITION BY ts_date
ORDER BY (ts_date,site_id,point_index,merchandise_id)
SETTINGS index_granularity = 8192
[POPULATE] AS SELECT
  ts_date,
  site_id,
  site_name,
  point_index,
  merchandise_id,
  merchandise_abbr,
  sumState(1) AS pv,
  uniqState(user_id) AS uv,
  maxState(ts_date_time) AS last_time,
  minState(ts_date_time) AS first_time
FROM ods.analytics_access_log
WHERE event_type = 'shtOpenGoodsDetail'
AND active_id = 0
AND site_id >= 0 AND merchandise_id >= 0 AND point_index >= 0
GROUP BY ts_date,site_id,site_name,point_index,merchandise_id,merchandise_abbr;
  • MVs can have partition keys, order (primary) keys and setting parameters (again, like tables)

  • The POPULATE keyword:

    • Without POPULATE = Only compute the data inserted to the table after MV creation

    • With POPULATE = Compute all history data while creating the MV, but ignore new data ingested during this period

  • sum/uniq/max/minState() ???

Under the Hood

Distributed MV

CREATE TABLE IF NOT EXISTS dw.merchandise_point_pvuv_agg_all
ON CLUSTER sht_ck_cluster_1
AS dw.merchandise_point_pvuv_agg
ENGINE = Distributed(sht_ck_cluster_1,dw,merchandise_point_pvuv_agg,rand());

Query

SELECT
  merchandise_id,
  merchandise_abbr,
  sumMerge(pv) AS pv,
  uniqMerge(uv) AS uv,
  maxMerge(last_time) AS last_time,
  minMerge(first_time) AS first_time,
  arrayStringConcat(groupUniqArray(site_name),'|') AS site_names
FROM dw.merchandise_point_pvuv_agg_all
WHERE ts_date = today()
AND site_id IN (10030,10031,10036,10037,10038)
AND point_index = 2
GROUP BY merchandise_id,merchandise_abbr
ORDER BY pv DESC LIMIT 10;

【此處圖片涉及業(yè)務數(shù)據(jù),故刪掉】

  • sum/uniq/max/minMerge() ???

Part II - Aggregate Function Combinators

-State

  • Do not return the aggregation result directly, but keeps an intermediate result (a "state") of the aggregating process

  • e.g. uniqState() keeps the hash table for cardinality approximation

  • Aggregate functions combined with -State will produce a column of type AggregateFunction(func,type)

  • AggregateFunction columns cannot be queried directly

【此處圖片涉及業(yè)務數(shù)據(jù),故刪掉】

-Merge

  • Aggregate the intermediate results and gives out the final value

  • A variant '-MergeState', aggregates intermediate results to a new intermediate result (But what's the point?)

-If

  • Conditional aggregation

  • Perform multi-condition processing within one statement

SELECT
  sumIf(quantity, merchandise_abbr LIKE '%蘋果%') AS apple_quantity,
  countIf(toStartOfHour(ts_date_time) = '2020-06-09 20:00:00') AS eight_oclock_sub_order_num,
  maxIf(quantity * price, coupon_money > 0) AS couponed_max_gmv
FROM ods.ms_order_done
WHERE ts_date = toDate('2020-06-09');
┌─apple_quantity─┬─eight_oclock_sub_order_num─┬─couponed_max_gmv─┐
│           1365 │                      19979 │           318000 │
└────────────────┴────────────────────────────┴──────────────────┘

-Array

  • Array aggregation
SELECT avgArray([33, 44, 99, 110, 220]);
┌─avgArray([33, 44, 99, 110, 220])─┐
│                            101.2 │
└──────────────────────────────────┘

-ForEach

  • Array aggregation by indexes (position)
SELECT sumForEach(arr)
FROM (
  SELECT 1 AS id, [3, 6, 12] AS arr
  UNION ALL
  SELECT 2 AS id, [7, 14, 7, 5] AS arr
);
┌─sumForEach(arr)─┐
│ [10,20,19,5]    │
└─────────────────┘

Part III - Using JOIN Correctly

Only consider 2-table equi-joins

Use IN When Possible

  • Prefer IN over JOIN when we only want to fetch data from the left table
SELECT sec_category_name,count()
FROM ods.analytics_access_log
WHERE ts_date = today() - 1
AND site_name like '長沙%'
AND merchandise_id IN (
  SELECT merchandise_id
  FROM ods.ms_order_done
  WHERE price > 10000
)
GROUP BY sec_category_name;

Put Small Table at Right

  • ClickHouse will utilize hash-join algorithm whenever memory is enough
  • Right table is always treated as build table (resides in memory), while left table is always treated as probe table

  • Convert to merge-join on disk when running out of memory (not as efficient as hash-join)

No Predicate Pushdown

  • Predicate pushdown is a common query optimization approach. e.g. in MySQL:
SELECT l.col1,r.col2 FROM left_table l
INNER JOIN right_table r ON l.key = r.key
WHERE l.col3 > 123 AND r.col4 = '...';
  • The WHERE predicates will be executed early during scan phase, thus reducing data size in join phase

  • But ClickHouse optimizer is fairly weak and has no support for this. We should manually put the predicates "inside"

SELECT l.col1,r.col2 FROM (
  SELECT col1,key FROM left_table
  WHERE col3 > 123
) l INNER JOIN (
  SELECT col2,key FROM right_table
  WHERE col4 = '...'
) r ON l.key = r.key;

Distributed JOIN/IN with GLOBAL

  • When joining or doing IN on two distributed tables/MVs, the GLOBAL keyword is crucial
SELECT
  t1.merchandise_id,t1.merchandise_abbr,t1.pv,t1.uv,
  t2.total_quantity,t2.total_gmv
FROM (
  SELECT
    merchandise_id,merchandise_abbr,
    sumMerge(pv) AS pv,
    uniqMerge(uv) AS uv
  FROM dw.merchandise_point_pvuv_agg_all  -- Distributed
  WHERE ts_date = today()
  AND site_id IN (10030,10031,10036,10037,10038)
  AND point_index = 1
  GROUP BY merchandise_id,merchandise_abbr
) t1
GLOBAL LEFT JOIN (  -- GLOBAL
  SELECT
    merchandise_id,
    sumMerge(total_quantity) AS total_quantity,
    sumMerge(total_gmv) AS total_gmv
  FROM dw.merchandise_gmv_agg_all  -- Distributed
  WHERE ts_date = today()
  AND site_id IN (10030,10031,10036,10037,10038)
  GROUP BY merchandise_id
) t2
ON t1.merchandise_id = t2.merchandise_id;
  • Distributed joining without GLOBAL
  • Causes read amplification: Right table will be read M*N times (or N2 when shards are equal), very wasteful

  • Distributed joining with GLOBAL is all right with an intermediate cache of right table

ARRAY JOIN

  • Special. Not related to table joining, but arrays

  • Used to convert a row of an array to multiple rows with extra column(s)

  • Seems like LATERAL VIEW EXPLODE in Hive?

  • An example in the next section

Part IV - Alternative to Windowed Analytical Functions

Drawback

  • ClickHouse lacks basic windowed analytical functions, such as (in Hive):
row_number() OVER (PARTITION BY col1 ORDER BY col2)
rank() OVER (PARTITION BY col1 ORDER BY col2)
dense_rank() OVER (PARTITION BY col1 ORDER BY col2)
lag(col,num) OVER (PARTITION BY col1 ORDER BY col2)
lead(col,num) OVER (PARTITION BY col1 ORDER BY col2)
  • Any other way around?

arrayEnumerate*()

  • arrayEnumerate(): Returns index array [1, 2, 3, …, length(array)]
SELECT arrayEnumerate([99, 88, 77, 66, 88, 99, 88, 55]);
┌─arrayEnumerate([99, 88, 77, 66, 88, 99, 88, 55])─┐
│ [1,2,3,4,5,6,7,8]                                │
└──────────────────────────────────────────────────┘
  • arrayEnumerateDense(): Returns an array of the same size as the source array, indicating where each element first appears in the source array
SELECT arrayEnumerateDense([99, 88, 77, 66, 88, 99, 88, 55]);
┌─arrayEnumerateDense([99, 88, 77, 66, 88, 99, 88, 55])─┐
│ [1,2,3,4,2,1,2,5]                                     │
└───────────────────────────────────────────────────────┘
  • arrayEnumerateUniq(): Returns an array the same size as the source array, indicating for each element what its position is among elements with the same value
SELECT arrayEnumerateUniq([99, 88, 77, 66, 88, 99, 88, 55]);
┌─arrayEnumerateUniq([99, 88, 77, 66, 88, 99, 88, 55])─┐
│ [1,1,1,1,2,2,3,1]                                    │
└──────────────────────────────────────────────────────┘

Ranking List

  • When the array is ordered, arrayEnumerate() = row_number(), arrayEnumerateDense() = dense_rank()

  • Pay attention to the usage of ARRAY JOIN --- it 'flattens' the result of arrays into human-readable columns

SELECT main_site_id,merchandise_id,gmv,row_number,dense_rank
FROM (
  SELECT main_site_id,
    groupArray(merchandise_id) AS merchandise_arr,
    groupArray(gmv) AS gmv_arr,
    arrayEnumerate(gmv_arr) AS gmv_row_number_arr,
    arrayEnumerateDense(gmv_arr) AS gmv_dense_rank_arr
  FROM (
    SELECT main_site_id,
      merchandise_id,
      sum(price * quantity) AS gmv
    FROM ods.ms_order_done
    WHERE ts_date = toDate('2020-06-01')
    GROUP BY main_site_id,merchandise_id
    ORDER BY gmv DESC
  )
  GROUP BY main_site_id
) ARRAY JOIN
  merchandise_arr AS merchandise_id,
  gmv_arr AS gmv,
  gmv_row_number_arr AS row_number,
  gmv_dense_rank_arr AS dense_rank
ORDER BY main_site_id ASC,row_number ASC;
┌─main_site_id─┬─merchandise_id─┬────gmv─┬─row_number─┬─dense_rank─┐
│          162 │         379263 │ 136740 │          1 │          1 │
│          162 │         360845 │  63600 │          2 │          2 │
│          162 │         400103 │  54110 │          3 │          3 │
│          162 │         404763 │  52440 │          4 │          4 │
│          162 │          93214 │  46230 │          5 │          5 │
│          162 │         304336 │  45770 │          6 │          6 │
│          162 │         392607 │  45540 │          7 │          7 │
│          162 │         182121 │  45088 │          8 │          8 │
│          162 │         383729 │  44550 │          9 │          9 │
│          162 │         404698 │  43750 │         10 │         10 │
│          162 │         102725 │  33284 │         11 │         11 │
│          162 │         404161 │  29700 │         12 │         12 │
│          162 │         391821 │  28160 │         13 │         13 │
│          162 │         339499 │  26069 │         14 │         14 │
│          162 │         404548 │  25600 │         15 │         15 │
│          162 │         167303 │  25520 │         16 │         16 │
│          162 │         209754 │  23940 │         17 │         17 │
│          162 │         317795 │  22950 │         18 │         18 │
│          162 │         404158 │  21780 │         19 │         19 │
│          162 │         326096 │  21540 │         20 │         20 │
│          162 │         404493 │  20950 │         21 │         21 │
│          162 │         389508 │  20790 │         22 │         22 │
│          162 │         301524 │  19900 │         23 │         23 │
│          162 │         404506 │  19900 │         24 │         23 │
│          162 │         404160 │  18130 │         25 │         24 │
........................
  • Use WHERE row_number <= N/dense_rank <= N to extract grouped top-N

neighbor()

  • neighbor() is actually the combination of lag & lead
neighbor(column,offset[,default_value])
-- offset > 0 = lead
-- offset < 0 = lag
-- default_value is used when the offset is out of bound

Baseline (YoY/MoM)

  • “同比”—— YoY (year-over-year) rate = {value[month,year] - value[month,year - 1]} / value[month,year - 1]

  • “環(huán)比”—— MoM (month-over-month) rate = {value[month] - value[month - 1]} / value[month - 1]

  • Let's make up some fake data and test it over

WITH toDate('2019-01-01') AS start_date
SELECT
  toStartOfMonth(start_date + number * 32) AS dt,
  rand(number) AS val,
  neighbor(val,-12) AS prev_year_val,
  neighbor(val,-1) AS prev_month_val,
  if (prev_year_val = 0,-32768,round((val - prev_year_val) / prev_year_val, 4) * 100) AS yoy_percent,
  if (prev_month_val = 0,-32768,round((val - prev_month_val) / prev_month_val, 4) * 100) AS mom_percent
FROM numbers(18);
┌─────────dt─┬────────val─┬─prev_year_val─┬─prev_month_val─┬─yoy_percent─┬─────────mom_percent─┐
│ 2019-01-01 │  344308231 │             0 │              0 │      -32768 │              -32768 │
│ 2019-02-01 │ 2125630486 │             0 │      344308231 │      -32768 │              517.36 │
│ 2019-03-01 │  799858939 │             0 │     2125630486 │      -32768 │ -62.370000000000005 │
│ 2019-04-01 │ 1899653667 │             0 │      799858939 │      -32768 │               137.5 │
│ 2019-05-01 │ 3073278541 │             0 │     1899653667 │      -32768 │               61.78 │
│ 2019-06-01 │  882031881 │             0 │     3073278541 │      -32768 │               -71.3 │
│ 2019-07-01 │ 3888311917 │             0 │      882031881 │      -32768 │              340.84 │
│ 2019-08-01 │ 3791703268 │             0 │     3888311917 │      -32768 │               -2.48 │
│ 2019-09-01 │ 3472517572 │             0 │     3791703268 │      -32768 │               -8.42 │
│ 2019-10-01 │ 1010491656 │             0 │     3472517572 │      -32768 │  -70.89999999999999 │
│ 2019-11-01 │ 2841992923 │             0 │     1010491656 │      -32768 │              181.25 │
│ 2019-12-01 │ 1783039500 │             0 │     2841992923 │      -32768 │              -37.26 │
│ 2020-01-01 │ 2724427263 │     344308231 │     1783039500 │      691.28 │  52.800000000000004 │
│ 2020-02-01 │ 2472851287 │    2125630486 │     2724427263 │       16.33 │  -9.229999999999999 │
│ 2020-03-01 │ 1699617807 │     799858939 │     2472851287 │      112.49 │ -31.269999999999996 │
│ 2020-04-01 │  873033696 │    1899653667 │     1699617807 │      -54.04 │              -48.63 │
│ 2020-05-01 │ 3524933462 │    3073278541 │      873033696 │        14.7 │              303.76 │
│ 2020-06-01 │   85437434 │     882031881 │     3524933462 │      -90.31 │              -97.58 │
└────────────┴────────────┴───────────────┴────────────────┴─────────────┴─────────────────────┘

Part V - More on Data Types

Date/DateTime

  • Do not use String for Date/DateTime (other types also fit for this rule)

    • ClickHouse is strongly typed, no implicit conversions

    • All-String tables (as in Hive) do not agree with ClickHouse

  • Do not use Int-type timestamp for Date/DateTime

    • Date: Stored as the date difference to 1970-01-01

    • DateTime: Stored directly as timestamp (fast)

  • Very flexible date/time functions

Nullable

  • ClickHouse doesn't provide NULL by default, but if you want to...
merchandise_id Nullable(Int64)
  • But try to stay away from Nullable

    • Need a separate mark file for NULLs

    • Nullable columns cannot be indexed

  • Default value itself can indicate NULL (0 for Int, '' for String, etc.), or explicitly define it when creating tables

merchandise_id Int64 DEFAULT -1

LowCardinality

  • ClickHouse applies dictionary coding to LowCardinality columns. Operating with such kind of data significantly increases performance of SELECT queries for many applications

  • LowCardinality is almost always used together with less diversified String columns (cardinality < 10000)

-- event_type in access logs is quite suitable
event_type LowCardinality(String)

Arrays & Higher-order Functions [TBD]

  • TBD...

Part VI - MergeTree Indices & Table Settings

Index Structure

  • Not B-Tree style, but rather like Kafka log indices (sparse)

  • .bin (data), .mrk (index marker) files for each column on disks

  • primary.idx stores the indexed data according to index granularity

Index Settings

  • Must include those columns which occur frequently as predicates (in WHERE clause)

  • Date/DateTime columns come first (when partitioning with date/time)

  • Very distinctive columns are not suitable for indexing

  • Do not use too many columns, also do not change index_granularity = 8192 setting when everything's fine

Table TTL

  • Determines the lifetime of rows, thus enabling auto expiration of history data

  • When creating a table

PARTITION BY ...
ORDER BY (...)
TTL ts_date + INTERVAL 6 MONTH
  • Or modify an existing table (only affects the data inserted after modification)
ALTER TABLE ods.analytics_access_log ON CLUSTER sht_ck_cluster_1
MODIFY TTL ts_date + INTERVAL 6 MONTH;
  • The settings parameter for TTL-ed part merging frequency
SETTINGS merge_with_ttl_timeout = 86400  -- 1 day

ZooKeeper

  • ClickHouse utilizes ZooKeeper as: Coordination service + Mini log service + Metadata storage

  • Quite heavy, so try to keep ZooKeeper cluster happy

autopurge.purgeInterval = 1
autopurge.snapRetainCount = 5
  • Also, replicated tables can store the headers of the data parts compactly using a single znode by defining:
SETTINGS use_minimalistic_part_header_in_zookeeper = 1

Review CREATE TABLE statement

CREATE TABLE IF NOT EXISTS ods.analytics_access_log
ON CLUSTER sht_ck_cluster_1 (
  ts_date Date,
  ts_date_time DateTime,
  user_id Int64,
  event_type String,
  column_type String,
  groupon_id Int64,
  site_id Int64,
  site_name String,
  -- ...
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/ods/analytics_access_log','{replica}')
PARTITION BY ts_date
ORDER BY (ts_date,toStartOfHour(ts_date_time),main_site_id,site_id,event_type,column_type)
TTL ts_date + INTERVAL 6 MONTH
SETTINGS index_granularity = 8192,
use_minimalistic_part_header_in_zookeeper = 1,
merge_with_ttl_timeout = 86400;

The End

民那晚安晚安。

?著作權歸作者所有,轉載或內(nèi)容合作請聯(lián)系作者
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌,老刑警劉巖,帶你破解...
    沈念sama閱讀 227,818評論 6 531
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異,居然都是意外死亡,警方通過查閱死者的電腦和手機,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 98,185評論 3 414
  • 文/潘曉璐 我一進店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人,你說我怎么就攤上這事。” “怎么了?”我有些...
    開封第一講書人閱讀 175,656評論 0 373
  • 文/不壞的土叔 我叫張陵,是天一觀的道長。 經(jīng)常有香客問我,道長,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 62,647評論 1 309
  • 正文 為了忘掉前任,我火速辦了婚禮,結果婚禮上,老公的妹妹穿的比我還像新娘。我一直安慰自己,他們只是感情好,可當我...
    茶點故事閱讀 71,446評論 6 405
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著,像睡著了一般。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 54,951評論 1 321
  • 那天,我揣著相機與錄音,去河邊找鬼。 笑死,一個胖子當著我的面吹牛,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播,決...
    沈念sama閱讀 43,041評論 3 440
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了?” 一聲冷哼從身側響起,我...
    開封第一講書人閱讀 42,189評論 0 287
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 48,718評論 1 333
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 40,602評論 3 354
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發(fā)現(xiàn)自己被綠了。 大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 42,800評論 1 369
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情,我是刑警寧澤,帶...
    沈念sama閱讀 38,316評論 5 358
  • 正文 年R本政府宣布,位于F島的核電站,受9級特大地震影響,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點故事閱讀 44,045評論 3 347
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧,春花似錦、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 34,419評論 0 26
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 35,671評論 1 281
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人。 一個月前我還...
    沈念sama閱讀 51,420評論 3 390
  • 正文 我出身青樓,卻偏偏與公主長得像,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當晚...
    茶點故事閱讀 47,755評論 2 371