Prometheus

TSDB是什么？ (Time Series Database)

簡單的理解為.一個優化后用來處理時間序列數據的軟件,并且數據中的數組是由時間進行索引的

l 大部分時間都是寫入操作

l 寫入操作幾乎是順序添加;大多數時候數據到達后都以時間排序.

l 寫操作很少寫入很久之前的數據,也很少更新數據.大多數情況在數據被采集到數秒或者數分鐘后就會被寫入數據庫.

l 刪除操作一般為區塊刪除,選定開始的歷史時間并指定后續的區塊.很少單獨刪除某個時間或者分開的隨機時間的數據.

l 數據一般遠遠超過內存大小,所以緩存基本無用.系統一般是 IO 密集型

l 讀操作是十分典型的升序或者降序的順序讀,

l 高并發的讀操作十分常見.

Prometheus是什么

Prometheus 是由 SoundCloud 開發的開源監控報警系統和時序列數據庫(TSDB)

Prometheus 在2016加入 CNCF (Cloud Native Computing Foundation), 作為在 kubernetes 之后的第二個由基金會主持的項目

Prometheus 的特點

l 多維數據模型（時序列數據由metric名和一組key/value組成）

l 在多維度上靈活的查詢語言(PromQl)

l 不依賴分布式存儲，單主節點工作.

l 通過基于HTTP的pull方式采集時序數據

l 可以通過中間網關進行時序列數據推送(pushing)

l 目標服務器可以通過發現服務或者靜態配置實現

l 多種可視化和儀表盤支持

Prometheus 生態系統

l Prometheus 主服務,用來抓取和存儲時序數據

l client library 用來構造應用或 exporter 代碼 (go,java,python,ruby)

l push 網關可用來支持短連接任務

l 可視化的dashboard (兩種選擇,promdash 和 grafana.目前主流選擇是 grafana.)

l 一些特殊需求的數據出口(用于HAProxy, StatsD, Graphite等服務)

l 實驗性的報警管理端(alartmanager,單獨進行報警匯總,分發,屏蔽等 )

<v:shapetype id="_x0000_t75" coordsize="21600,21600" o:spt="75" o:preferrelative="t" path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f"><v:stroke joinstyle="miter"><v:formulas></v:formulas><v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect"></v:path></v:stroke></v:shapetype><v:shape id="圖片_x0020_2" o:spid="_x0000_i1031" type="#_x0000_t75" style="width:414.75pt;height:195.75pt;visibility:visible;mso-wrap-style:square"><v:imagedata src="file:///C:/Users/ccsou/AppData/Local/Temp/msohtmlclip1/01/clip_image001.png" o:title=""></v:imagedata></v:shape>

部署和配置

下載

地址: https://prometheus.io/download/

部署

下載 prometheus-*.tar.gz

解壓

配置

在prometheus目錄下有一個名為 prometheus.yml 的主配置文件.其中包含大多數標準配置及 prometheus 的自檢控配置,配置文件如下:

my global config

global:

scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. [ 抓取的間隔時間]

evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. [計算的間隔時間]

scrape_timeout is set to the global default (10s).

Alertmanager configuration

alerting:

alertmanagers:

static_configs:
targets:
'172.17.20.231:20507' [連接報警管理器]

Load rules once and periodically evaluate them according to the global 'evaluation_interval'.

rule_files:

- "first_rules.yml"

- "second_rules.yml"

"alert-rule.yml" [此處有兩個規則，一個為計算規則，一個為報警規則]

A scrape configuration containing exactly one endpoint to scrape:

Here it's Prometheus itself.

scrape_configs:

The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.

job_name: 'prometheus' [抓取的目標]

metrics_path defaults to '/metrics' // [連接的prometheus 自帶的 exporter]

scheme defaults to 'http'.

static_configs:

targets: ['localhost:20504'] // [prometheus 啟動的端口]
job_name: 'spring-boot'

metrics_path: '/prometheus' // [自己寫的spring-boot的exporter地址]

static_configs:

targets: ['localhost:20506'] [spring-boot 啟動的端口]

啟動

編寫啟動腳本

nohup ./prometheus --config.file=prometheus.yml --web.enable-admin-api --web.listen-address=:20504 >/dev/null 2>&1 &

靜默啟動 --web-listen-address 指定端口

數據類型

l Counter : Counter表示收集的數據是按照某個趨勢（增加／減少）一直變化的。

l Gauge:
Gauge表示搜集的數據是瞬時的，可以任意變高變低。

l Histogram: Histogram可以理解為直方圖，主要用于表示一段時間范圍內對數據進行采樣，（通常是請求持續時間或響應大小），并能夠對其指定區間以及總數進行統計。

l Summary: Summary和Histogram十分相似，主要用于表示一段時間范圍內對數據進行采樣，（通常是請求持續時間或響應大小），它直接存儲了 quantile 數據，而不是根據統計區間計算出來的。

時序數據-打點-查詢

我們知道每條時序數據都是由 metric（指標名稱），一個或一組label（標簽），以及float64的值組成的。

標準格式為 <metric name>{<label name>=<label value>,...}

例如：

rpc_invoke_cnt_c{code="0",method="Session.GenToken",job="Center"} 5

rpc_invoke_cnt_c{code="0",method="Relation.GetUserInfo",job="Center"} 12

rpc_invoke_cnt_c{code="0",method="Message.SendGroupMsg",job="Center"} 12

rpc_invoke_cnt_c{code="4",method="Message.SendGroupMsg",job="Center"} 3

rpc_invoke_cnt_c{code="0",method="Tracker.Tracker.Get",job="Center"} 70

這是一組用于統計RPC接口處理次數的監控數據。

其中rpc_invoke_cnt_c為指標名稱，每條監控數據包含三個標簽：code 表示錯誤碼，service表示該指標所屬的服務，method表示該指標所屬的方法，最后的數字代表監控值。

針對這個例子，我們共有四個維度（一個指標名稱、三個標簽），這樣我們便可以利用Prometheus強大的查詢語言PromQL進行極為復雜的查詢。

PromQL

PromQL(Prometheus Query Language) 是 Prometheus 自己開發的數據查詢 DSL 語言，語言表現力非常豐富，支持條件查詢、操作符，并且內建了大量內置函，供我們針對監控數據的各種維度進行查詢。

我們想統計Center組件Relation.GetUserInfo的頻率，可使用如下Query語句：

rate(rpc_invoke_cnt_c{method="Relation.GetUserInfo",job="Center"}[1m])

或者基于方法和錯誤碼統計Center的整體RPC請求錯誤頻率：

sum by (method, code)(rate(rpc_invoke_cnt_c{job="Center",code!="0"}[1m]))

如果我們想統計Center各方法的接口耗時，使用如下Query語句即可：

rate(rpc_invoke_time_h_sum{job="Center"}[1m]) / rate(rpc_invoke_time_h_count{job="Center"}[1m])

rate(http_requests_total[5m])

返回范圍向量中每個時間序列在過去5分鐘內測量的HTTP請求的每秒速率

increase(http_request_total[5m])

返回范圍向量中每個時間序列在過去5分鐘內測得的HTTP請求數

官方函數庫: https://prometheus.io/docs/querying/functions/

另外，配合查詢，在打點時metric和labal名稱的定義也有一定技巧。

rpc_invoke_cnt_c 表示rpc調用統計

api_req_num_cv 表示httpapi調用統計

msg_queue_cnt_c 表示隊列長度統計

命名官方引導： https://prometheus.io/docs/practices/naming/

報警

部署安裝

下載地址： https://prometheus.io/download/

制作啟動腳本

nohup ./alertmanager --web.listen-address=:20507 >/dev/null 2>&1 &

調整配置文件

alertmanager.yml 文件

制定報警規則

首先制定報警規則，在prometheus 上進行報警 rules 的配置

rule_files:

- "first_rules.yml"

- "second_rules.yml"

"alert-rule.yml" [此處有兩個規則，一個為計算規則，一個為報警規則]

自己寫對應的報警規則：

groups:

name: example

interval: 1s

rules:

Alert for any instance that is unreachable for >5 minutes.

alert: InstanceDown

expr: up == 0

for: 1s

labels:

severity: page

annotations:

summary: "Instance {{ $labels.instance }} down"

description: "{{ $labels.instance }} of job {{ $labels.job }} has been down"

以上為宕機的報警規則

配置報警設置

以下為簡易配置

global:

smtp_smarthost: 'smtp.exmail.qq.com:25' // 配置smtp服務器用于發信

smtp_from: xxx@ulopay.com'

smtp_auth_username: xxx@ulopay.com'

smtp_auth_password: 'xxx'

The directory from which notification templates are read.

templates:

'/etc/alertmanager/template/*.tmpl'

The root route on which each incoming alert enters.

route:

The labels by which incoming alerts are grouped together. For example,

multiple alerts coming in for cluster=A and alertname=LatencyHigh would

be batched into a single group.

group_by: ['alertname', 'cluster', 'service'] //配置組用于后面的一些規則制定

When a new group of alerts is created by an incoming alert, wait at

least 'group_wait' to send the initial notification.

This way ensures that you get multiple alerts for the same group that start

firing shortly after another are batched together on the first

notification. //新建立的組，在發信之前等待時間。組隊上車

group_wait: 5s

When the first notification was sent, wait 'group_interval' to send a batch

of new alerts that started firing for that group.

group_interval: 1m // 一個組的發送間隔

If an alert has successfully been sent, wait 'repeat_interval' to

resend them.

repeat_interval: 3h // 重發的間隔

A default receiver

receiver: zhangm // 默認收件人

receivers: //配置所有收件人

name: 'zhangm'

email_configs:

to: 'zhangm@ulopay.com'

繪圖展示

啟動

安裝Grafana。https://grafana.com/

下載 grafana.tar.gz 包

解壓

進入bin目錄

nohup ./grafana-server >/dev/null 2>&1 &

后臺啟動 grafana

配置

更改端口 conf 目錄下的 default.ini http_port 參數

界面

<v:shape id="圖片_x0020_1" o:spid="_x0000_i1030" type="#_x0000_t75" style="width:414.75pt;height:265.5pt;
visibility:visible;mso-wrap-style:square"><v:imagedata src="file:///C:/Users/ccsou/AppData/Local/Temp/msohtmlclip1/01/clip_image002.png" o:title=""></v:imagedata></v:shape>

賬號密碼

默認賬號：admin 密碼： admin

新增數據源

<v:shape id="圖片_x0020_3" o:spid="_x0000_i1029" type="#_x0000_t75" style="width:415.5pt;height:369.75pt;
visibility:visible;mso-wrap-style:square"><v:imagedata src="file:///C:/Users/ccsou/AppData/Local/Temp/msohtmlclip1/01/clip_image003.png" o:title=""></v:imagedata></v:shape>

<v:shape id="圖片_x0020_4" o:spid="_x0000_i1028" type="#_x0000_t75" style="width:415.5pt;height:167.25pt;
visibility:visible;mso-wrap-style:square"><v:imagedata src="file:///C:/Users/ccsou/AppData/Local/Temp/msohtmlclip1/01/clip_image004.png" o:title=""></v:imagedata></v:shape>

<v:shape id="圖片_x0020_5" o:spid="_x0000_i1027" type="#_x0000_t75" style="width:415.5pt;height:345pt;
visibility:visible;mso-wrap-style:square"><v:imagedata src="file:///C:/Users/ccsou/AppData/Local/Temp/msohtmlclip1/01/clip_image005.png" o:title=""></v:imagedata></v:shape>

<v:shape id="圖片_x0020_6" o:spid="_x0000_i1026" type="#_x0000_t75" style="width:415.5pt;height:207pt;
visibility:visible;mso-wrap-style:square"><v:imagedata src="file:///C:/Users/ccsou/AppData/Local/Temp/msohtmlclip1/01/clip_image006.png" o:title=""></v:imagedata></v:shape>

<v:shape id="圖片_x0020_7" o:spid="_x0000_i1025" type="#_x0000_t75" style="width:414.75pt;height:221.25pt;
visibility:visible;mso-wrap-style:square"><v:imagedata src="file:///C:/Users/ccsou/AppData/Local/Temp/msohtmlclip1/01/clip_image007.png" o:title=""></v:imagedata></v:shape>

集成

集成相關參考 [[Prometheus官方示例]] [Play集成 Prometheus] [Spring集成Prometheus]

參考文獻

Prometheus 官網

[Prometheus入門] (http://www.10tiao.com/html/357/201705/2247485232/1.html)

[Prometheus進階] (http://www.10tiao.com/html/357/201705/2247485249/1.html)

360基于Prometheus的在線服務監控實踐

Prometheus官方示例

Play集成 Prometheus

Spring集成Prometheus

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

Prometheus基礎文檔

TSDB是什么？ (Time Series Database)

Prometheus是什么

Prometheus 的特點

Prometheus 生態系統

部署和配置

下載

部署

配置

my global config

scrape_timeout is set to the global default (10s).

Alertmanager configuration

Load rules once and periodically evaluate them according to the global 'evaluation_interval'.

- "first_rules.yml"

- "second_rules.yml"

A scrape configuration containing exactly one endpoint to scrape:

Here it's Prometheus itself.

The job name is added as a label job=<job_name> to any timeseries scraped from this config.

metrics_path defaults to '/metrics' // [連接的prometheus 自帶的 exporter]

scheme defaults to 'http'.

啟動

數據類型

時序數據-打點-查詢

PromQL

報警

部署安裝

制定報警規則

- "first_rules.yml"

- "second_rules.yml"

Alert for any instance that is unreachable for >5 minutes.

配置報警設置

The directory from which notification templates are read.

The root route on which each incoming alert enters.

The labels by which incoming alerts are grouped together. For example,

multiple alerts coming in for cluster=A and alertname=LatencyHigh would

be batched into a single group.

When a new group of alerts is created by an incoming alert, wait at

least 'group_wait' to send the initial notification.

This way ensures that you get multiple alerts for the same group that start

firing shortly after another are batched together on the first

notification. //新建立的組，在發信之前等待時間。 組隊上車

When the first notification was sent, wait 'group_interval' to send a batch

of new alerts that started firing for that group.

If an alert has successfully been sent, wait 'repeat_interval' to

resend them.

A default receiver

繪圖展示

啟動

配置

界面

賬號密碼

新增數據源

集成

參考文獻

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.

notification. //新建立的組，在發信之前等待時間。組隊上車