在Windows下面的虛擬機Ubuntu上面試跑了一下druid,主要包括安裝,配置,寫數據,查數據,目錄如下,
0. Overview
- 架構與節點類型
- 數據格式
- 寫入過程
- 查詢過程
1. Version
2. Install
3. Index data
4. Query data
- Build-in Queries
- SQL Queries
- Sql TopN
- Timeseries
- GroupBy
- Scan
- Explain
5. data migration
6. delete data
7. Reference
Overview
Apache Druid是一個OLAP查詢引擎,能對歷史
和實時
數據提供亞秒級別的查詢,提供低延時的數據寫入,靈活的數據探索分析,高性能的數據聚合。
架構與節點類型
- Historical:從
Deep storage
加載并提供Segment文件供數據查詢 - MiddleManager:主要用于從外部數據源讀取數據,并
寫入
到druid以完成segment分配 - Broker:代理節點接收來自外部client的查詢請求,并轉發這些請求給
historical
和middlemanager
。當代理節點接收到結果時,將來自historical和middlemanager的結果merge然后返回給調用方。為了知道整個拓撲結構,代理節點通過使用Zookeeper在確定哪些historical和middlemanager存活 - Coordinator:協調節點對
historical
的分組進行監控,以確保數據可用,和最佳的配置。協調節點通過從元數據存儲中讀取元數據信息來判斷哪些segments是應該加載到集群的,使用Zookeeper去判斷哪些歷史節點是存活的,在Zookeeper中創建任務條目告訴歷史節點去加載和刪除segments - Overlord:監控
MiddleManagers
和負責接收任務,協調和分配任務,為任務創建鎖,并返回任務狀態給任務發送方 Router:當集群規模很大時,主要負責將查詢請求路由到不同的Broker節點上Indexing Service:索引服務節點由多個worker組成的集群,負責為加載批量的和實時的數據創建索引,并且允許對已經存在的數據進行修改Realtime(Deprecated):實時節點負責加載實時的數據到系統中,在生產使用的幾個限制成本上實時節點比索引服務節點更容易搭建
數據格式
-
DataSource:Druid 的基本數據結構,在邏輯上可以理解為關系型數據庫中的表。它包含
時間
、維度
和指標
三列,- Timestamp列:我們將timestamp區別開是因為我們所有的查詢都以
時間
為中心 - Dimension列:Dimensions對應事件的維度,通常用于篩選
過濾
數據。 在我們例子中的數據有四個dimensions: publisher, advertiser, gender, and country。 它們每一個都可以看作是我們已選都數據的主體 - Metric列:Metrics是用于
聚合
和計算的列。在我們的例子中,click和price就是metrics。 Metrics通常是數字,并且包含支持count、sum、mean等計算操作。 在OLAP的術語中也被叫做measures。
- Timestamp列:我們將timestamp區別開是因為我們所有的查詢都以
Segment:Druid 用來存儲
索引
的數據格式,不同的索引按照時間跨度來分區,分區可通過 segmentGranularity(劃分索引的時間粒度)進行配置
寫入過程
-
Committing Data(數據提交前)
一個datasource可以由一個或者成千上萬個segments組成。每個segment從被middlemanager創建開始,此時是可變mutable
和未提交uncommitted的。為了創建一個緊湊且支持快速查詢(倒排)的segment,需要以下創建過程,- 轉換為列式存儲(columnar format)
- 位圖索引,位圖壓縮(bitmap)
- 壓縮(RLC)
- 字符串編碼(mapping dict)
- 類型感知壓縮
Committed Data(數據提交過程)
segment會被定期刷到deep storage(overtim or oversize),flush后就變得immutable
不可變了,同時從middlemanager遷移到historical。關于這個flush segment的入口信息會被寫到metadata。入口信息是segment的自描述,包括segment的schema,size,存放在deep storage的位置等,這個信息被coordinator從metadata(mysql, PostgreSQL)獲取,從而定位具體數據。
查詢過程
Query Basic Flow(基本流程)
Queries請求首先進入broker,此時broker會去historical和middlemanager里面找到包含Queries的segments,然后發送一個rewritten subquery到historical和middlemanager,這兩節點接收并執行subquery,接著各自返回結果到broker,之后broker將結果merge并返回-
Optimization Method(查詢優化)
- 只獲取關于該Query的segment
- 對于每個segment,使用索引去判別哪些行是需要的
- 當知道了所屬行之后,利用列式存儲去只讀相關列,而不用將整行都讀取
Version
- Ubuntu 16.04.5 LTS
- druid-0.12.3-bin.tar.gz
- zookeeper-3.4.10.tar.gz
- tutorial-examples.tar.gz
Install
# install druid
tar -xzf druid-0.12.3-bin.tar.gz
tar zxvf tutorial-examples.tar.gz
# install zk
tar -xzf zookeeper-3.4.10.tar.gz
# configure
cd zookeeper-3.4.10
cp conf/zoo_sample.cfg conf/zoo.cfg
./bin/zkServer.sh start
cd druid-0.12.3
bin/init
# 這里沒有采用0.12.3官方提供的java -cp命令,而采用了0.9.0提供的。
java `cat conf-quickstart/druid/coordinator/jvm.config | xargs` -cp conf-quickstart/druid/_common:conf-quickstart/druid/coordinator:lib/* io.druid.cli.Main server coordinator
java `cat conf-quickstart/druid/overlord/jvm.config | xargs` -cp conf-quickstart/druid/_common:conf-quickstart/druid/overlord:lib/* io.druid.cli.Main server overlord
java `cat conf-quickstart/druid/historical/jvm.config | xargs` -cp conf-quickstart/druid/_common:conf-quickstart/druid/historical:lib/* io.druid.cli.Main server historical
java `cat conf-quickstart/druid/middleManager/jvm.config | xargs` -cp conf-quickstart/druid/_common:conf-quickstart/druid/middleManager:lib/* io.druid.cli.Main server middleManager
java `cat conf-quickstart/druid/broker/jvm.config | xargs` -cp conf-quickstart/druid/_common:conf-quickstart/druid/broker:lib/* io.druid.cli.Main server broker
Index data
try to insert data into druid.
curl -X 'POST' -H 'Content-Type:application/json' -d @examples/wikipedia-index.json http://localhost:8090/druid/indexer/v1/task
examples/wikipedia-index.json
{
"type" : "index",
"spec" : {
"dataSchema" : {
"dataSource" : "wikipedia",
"parser" : {
"type" : "string",
"parseSpec" : {
"format" : "json",
"dimensionsSpec" : {
"dimensions" : [
"channel",
"cityName",
"comment",
"countryIsoCode",
"countryName",
"isAnonymous",
"isMinor",
"isNew",
"isRobot",
"isUnpatrolled",
"metroCode",
"namespace",
"page",
"regionIsoCode",
"regionName",
"user",
{ "name": "added", "type": "long" },
{ "name": "deleted", "type": "long" },
{ "name": "delta", "type": "long" }
]
},
"timestampSpec": {
"column": "time",
"format": "iso"
}
}
},
"metricsSpec" : [],
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "day",
"queryGranularity" : "none",
"intervals" : ["2015-09-12/2015-09-13"],
"rollup" : false
}
},
"ioConfig" : {
"type" : "index",
"firehose" : {
"type" : "local",
"baseDir" : "quickstart/",
"filter" : "wikiticker-2015-09-12-sampled.json.gz"
},
"appendToExisting" : false
},
"tuningConfig" : {
"type" : "index",
"targetPartitionSize" : 5000000,
"maxRowsInMemory" : 25000,
"forceExtendableShardSpecs" : true
}
}
}
Query data
Build-in Queries
try to query data from druid.
curl -X 'POST' -H 'Content-Type:application/json' -d @examples/wikipedia-top-pages.json http://localhost:8082/druid/v2?pretty
wikipedia-top-pages.json
{
"queryType" : "topN",
"dataSource" : "wikipedia",
"intervals" : ["2015-09-12/2015-09-13"],
"granularity" : "all",
"dimension" : "page",
"metric" : "count",
"threshold" : 10,
"aggregations" : [
{
"type" : "count",
"name" : "count"
}
]
}
SQL Queries
上面topN是druid內置的TopN queries。而通過上述配置,sql是默認沒有開啟的。如果需要開啟sql,需要在broker/common下面添加druid.sql.enable=true
。官網是在examples/conf/druid/_common/common.runtime.properties
下面默認開啟的。而我們這邊是通過conf-quickstart/druid/broker/runtime.properties
啟動的,兩種方式皆可,取決于druid.sql.enable=true
是否被識別而已。
Sql TopN
curl -X 'POST' -H 'Content-Type:application/json' -d @examples/wikipedia-top-pages-sql.json http://localhost:8082/druid/v2/sql
wikipedia-top-pages-sql.json
{
"query":"SELECT page, COUNT(*) AS Edits FROM wikipedia WHERE \"__time\" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY page ORDER BY Edits DESC LIMIT 10"
}
Timeseries
curl -X 'POST' -H 'Content-Type:application/json' -d @examples/wikipedia-timeseries-sql.json http://localhost:8082/druid/v2/sql
wikipedia-timeseries-sql.json
{
"query":"SELECT FLOOR(__time to HOUR) AS HourTime, SUM(deleted) AS LinesDeleted FROM wikipedia WHERE \"__time\" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY FLOOR(__time to HOUR)"
}
GroupBy
curl -X 'POST' -H 'Content-Type:application/json' -d @examples/wikipedia-groupby-sql.json http://localhost:8082/druid/v2/sql
wikipedia-groupby-sql.json
{
"query":"SELECT channel, SUM(added) FROM wikipedia WHERE \"__time\" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY channel ORDER BY SUM(added) DESC LIMIT 5"
}
Scan
curl -X 'POST' -H 'Content-Type:application/json' -d @examples/wikipedia-scan-sql.json http://localhost:8082/druid/v2/sql
wikipedia-scan-sql.json
{
"query":"SELECT user, page FROM wikipedia WHERE \"__time\" BETWEEN TIMESTAMP '2015-09-12 02:00:00' AND TIMESTAMP '2015-09-12 03:00:00' LIMIT 5"
}
Explain
curl -X 'POST' -H 'Content-Type:application/json' -d @examples/wikipedia-explain-top-pages-sql.json http://localhost:8082/druid/v2/sql
wikipedia-explain-top-pages-sql.json
{
"query":"EXPLAIN PLAN FOR SELECT page, COUNT(*) AS Edits FROM wikipedia WHERE \"__time\" BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' GROUP BY page ORDER BY Edits DESC LIMIT 10"
}
data migration
有時候當舊集群不足以再存放新數據的時候,一方面可以對舊集群擴容,另一方面可以新開一個集群。而新開一個集群就需要將歷史數據導過來,不然search就缺了舊數據,一般步驟如下,
- 切流。切換新數據到新集群
- 搬移。將舊數據搬移到新集群
舊數據搬移到新集群
在druid的具體操作是,
- 將具體的segments從srcPath直接cp到destPath
hadoop distcp hdfs://nn1:8020/src/path/to/segment/file hdfs://nn1:8020/dest/path/to/segment/file
- 在新的destPath下面為segments建立新的metadata
java \
-cp "/home/chenfh5/project/druid/druid-0.12.3/lib/*" \
-Ddruid.metadata.storage.type=mysql \
-Ddruid.metadata.storage.connector.connectURI=jdbc:mysql://localhost:3306/druid2 \
-Ddruid.metadata.storage.connector.user=yourname\
-Ddruid.metadata.storage.connector.password=yourpwd\
-Ddruid.extensions.loadList=[\"mysql-metadata-storage\",\"druid-hdfs-storage\"] \
-Ddruid.storage.type=hdfs \
io.druid.cli.Main tools insert-segment-to-db --workingDir hdfs://nn1:8020/dest/path/to/segment/ --updateDescriptor true
如果mysql-metadata-storage
不在extensions文件夾下,就到官網下載一個,然后解壓過去。
雙集群遷移測試
為了在本地虛擬上部署雙集群,因為用到insert-segment-to-db
這個tool,所以,
- 修改-Ddruid.metadata.storage為默認db(Derby )
- 修改默認metadata.storage為mysql,使其適配-Ddruid.metadata.storage
在install
章節里面,看到
java
cat conf-quickstart/druid/coordinator/jvm.config | xargs-cp conf-quickstart/druid/_common:conf-quickstart/druid/coordinator:lib/* io.druid.cli.Main server coordinator
是通過conf-quickstart/druid/_common
來啟動的,其下的properties是,
# For Derby server on your Druid Coordinator (only viable in a cluster with a single Coordinator, no fail-over):
druid.metadata.storage.type=derby
druid.metadata.storage.connector.connectURI=jdbc:derby://localhost:1527/var/druid/metadata.db;create=true
druid.metadata.storage.connector.host=localhost
druid.metadata.storage.connector.port=1527
改為,For MySQL:
druid.extensions.loadList=["mysql-metadata-storage"]
druid.metadata.storage.type=mysql
druid.metadata.storage.connector.connectURI=jdbc:mysql://db.example.com:3306/druid
druid.metadata.storage.connector.user=...
druid.metadata.storage.connector.password=...
另外,修改為mysql
之后,可能會出現Table doesn't exist
的情況,網上有解決方式。
然后就可以啟動了。
cluster1
啟動cluster;insert wikipedia數據集
cluster2
cat conf-quickstart/druid/*/runtime.properties | grep -C2 port
看到五大角色已占用端口,修改這里,使第二集群不會產生端口沖突。
cp -r /home/chenfh5/project/druid/druid-0.12.3/var/druid/segments/wikipedia/ /home/chenfh5/project/druid/druid-0.12.3.bak/var/druid/segments/
insert-segment-to-db
check
curl -XGET http://localhost:9081/druid/coordinator/v1/metadata/datasources?full
delete data
# marked as "unused"
curl -XDELETE http://localhost:8081/druid/coordinator/v1/datasources/{dataSourceName}
curl -XDELETE http://localhost:8081/druid/coordinator/v1/datasources/{dataSourceName}/intervals/{interval}
`interval = 2016-06-27_2016-06-28` (_)
# run Kill Task will delete any "unused" segments
curl -X 'POST' -H 'Content-Type:application/json' http://localhost:8090/druid/indexer/v1/task -d'{
"type": "kill",
"dataSource": "deletion-tutorial",
"interval" : "2015-09-12/2015-09-13"
}'
`interval = 2015-09-12/2015-09-13` (/)