kube-promethesu對k8s版本支持
kube-prometheus版本 | kubenetes版本 |
---|---|
release-0.4 | 1.16、1.17 |
release-0.5 | 1.18 |
release-0.6 | 1.18、1.19 |
release-0.7 | 1.19、1.20 |
release-0.8 | 1.20、1.21 |
release-0.9 | 1.21、1.22 |
release-0.10 | 1.22、1.23 |
release-0.11 | 1.23、1.24 |
main | 1.24 |
參考連接: https://github.com/prometheus-operator/kube-prometheus#compatibility
kube-prometheus安裝
kube-prometheus地址
https://github.com/prometheus-operator/kube-prometheus
kube-prometheus組件
prometheus-operator
prometheus
alertmanager
prometheus-adapter
node-exporter
kube-state-metrics
grafana
-
blackbox-exporter
以上組件均來源與kube-prometheus release-0.8版本
下載kube-prometheus
git clone -b release-0.8 https://github.com/prometheus-operator/kube-prometheus.git
[root@k8s-master kube-prometheus-release-0.8]# ll
總用量 184
-rwxr-xr-x 1 root root 679 3月 21 2022 build.sh
-rw-r--r-- 1 root root 3039 3月 21 2022 code-of-conduct.md
-rw-r--r-- 1 root root 1422 3月 21 2022 DCO
drwxr-xr-x 2 root root 4096 3月 21 2022 docs
-rw-r--r-- 1 root root 2051 3月 21 2022 example.jsonnet
drwxr-xr-x 7 root root 4096 3月 21 2022 examples
drwxr-xr-x 3 root root 28 3月 21 2022 experimental
-rw-r--r-- 1 root root 237 3月 21 2022 go.mod
-rw-r--r-- 1 root root 59996 3月 21 2022 go.sum
drwxr-xr-x 3 root root 68 3月 21 2022 hack
drwxr-xr-x 3 root root 29 3月 21 2022 jsonnet
-rw-r--r-- 1 root root 206 3月 21 2022 jsonnetfile.json
-rw-r--r-- 1 root root 4857 3月 21 2022 jsonnetfile.lock.json
-rw-r--r-- 1 root root 4495 3月 21 2022 kustomization.yaml
-rw-r--r-- 1 root root 11325 3月 21 2022 LICENSE
-rw-r--r-- 1 root root 2153 3月 21 2022 Makefile
drwxr-xr-x 3 root root 4096 9月 19 19:39 manifests
-rw-r--r-- 1 root root 126 3月 21 2022 NOTICE
-rw-r--r-- 1 root root 38246 3月 21 2022 README.md
drwxr-xr-x 2 root root 187 3月 21 2022 scripts
-rw-r--r-- 1 root root 928 3月 21 2022 sync-to-internal-registry.jsonnet
drwxr-xr-x 3 root root 17 3月 21 2022 tests
-rwxr-xr-x 1 root root 808 3月 21 2022 test.sh
部署文件清單
[root@k8s-master kube-prometheus-release-0.8]# tree manifests/
manifests/
├── alertmanager-alertmanager.yaml
├── alertmanager-podDisruptionBudget.yaml
├── alertmanager-prometheusRule.yaml
├── alertmanager-secret.yaml
├── alertmanager-serviceAccount.yaml
├── alertmanager-serviceMonitor.yaml
├── alertmanager-service.yaml
├── blackbox-exporter-clusterRoleBinding.yaml
├── blackbox-exporter-clusterRole.yaml
├── blackbox-exporter-configuration.yaml
├── blackbox-exporter-deployment.yaml
├── blackbox-exporter-serviceAccount.yaml
├── blackbox-exporter-serviceMonitor.yaml
├── blackbox-exporter-service.yaml
├── grafana-dashboardDatasources.yaml
├── grafana-dashboardDefinitions.yaml
├── grafana-dashboardSources.yaml
├── grafana-deployment.yaml
├── grafana-serviceAccount.yaml
├── grafana-serviceMonitor.yaml
├── grafana-service.yaml
├── istio-servicemonitor.yaml
├── kube-prometheus-prometheusRule.yaml
├── kubernetes-prometheusRule.yaml
├── kubernetes-serviceMonitorApiserver.yaml
├── kubernetes-serviceMonitorCoreDNS.yaml
├── kubernetes-serviceMonitorKubeControllerManager.yaml
├── kubernetes-serviceMonitorKubelet.yaml
├── kubernetes-serviceMonitorKubeScheduler.yaml
├── kube-state-metrics-clusterRoleBinding.yaml
├── kube-state-metrics-clusterRole.yaml
├── kube-state-metrics-deployment.yaml
├── kube-state-metrics-prometheusRule.yaml
├── kube-state-metrics-serviceAccount.yaml
├── kube-state-metrics-serviceMonitor.yaml
├── kube-state-metrics-service.yaml
├── node-exporter-clusterRoleBinding.yaml
├── node-exporter-clusterRole.yaml
├── node-exporter-daemonset.yaml
├── node-exporter-prometheusRule.yaml
├── node-exporter-serviceAccount.yaml
├── node-exporter-serviceMonitor.yaml
├── node-exporter-service.yaml
├── prometheus-adapter-apiService.yaml
├── prometheus-adapter-clusterRoleAggregatedMetricsReader.yaml
├── prometheus-adapter-clusterRoleBindingDelegator.yaml
├── prometheus-adapter-clusterRoleBinding.yaml
├── prometheus-adapter-clusterRoleServerResources.yaml
├── prometheus-adapter-clusterRole.yaml
├── prometheus-adapter-configMap.yaml
├── prometheus-adapter-deployment.yaml
├── prometheus-adapter-podDisruptionBudget.yaml
├── prometheus-adapter-roleBindingAuthReader.yaml
├── prometheus-adapter-serviceAccount.yaml
├── prometheus-adapter-serviceMonitor.yaml
├── prometheus-adapter-service.yaml
├── prometheus-clusterRoleBinding.yaml
├── prometheus-clusterRole.yaml
├── prometheus-operator-prometheusRule.yaml
├── prometheus-operator-serviceMonitor.yaml
├── prometheus-operator.yaml
├── prometheus-podDisruptionBudget.yaml
├── prometheus-prometheusRule.yaml
├── prometheus-prometheus.yaml
├── prometheus-roleBindingConfig.yaml
├── prometheus-roleBindingSpecificNamespaces.yaml
├── prometheus-roleConfig.yaml
├── prometheus-roleSpecificNamespaces.yaml
├── prometheus-serviceAccount.yaml
├── prometheus-serviceMonitor.yaml
├── prometheus-service.yaml
└── setup
├── 0namespace-namespace.yaml
├── prometheus-operator-0alertmanagerConfigCustomResourceDefinition.yaml
├── prometheus-operator-0alertmanagerCustomResourceDefinition.yaml
├── prometheus-operator-0podmonitorCustomResourceDefinition.yaml
├── prometheus-operator-0probeCustomResourceDefinition.yaml
├── prometheus-operator-0prometheusCustomResourceDefinition.yaml
├── prometheus-operator-0prometheusruleCustomResourceDefinition.yaml
├── prometheus-operator-0servicemonitorCustomResourceDefinition.yaml
├── prometheus-operator-0thanosrulerCustomResourceDefinition.yaml
├── prometheus-operator-clusterRoleBinding.yaml
├── prometheus-operator-clusterRole.yaml
├── prometheus-operator-deployment.yaml
├── prometheus-operator-serviceAccount.yaml
└── prometheus-operator-service.yaml
部署
kubectl apply -f manifests/setup/
kubectl apply -f manifests/
驗證
[root@k8s-master kube-prometheus-release-0.8]# kubectl get all -n monitoring
NAME READY STATUS RESTARTS AGE
pod/alertmanager-main-0 2/2 Running 6 115d
pod/alertmanager-main-1 2/2 Running 4 115d
pod/alertmanager-main-2 2/2 Running 6 115d
pod/blackbox-exporter-55c457d5fb-swjqc 3/3 Running 6 115d
pod/grafana-9df57cdc4-wc9hn 1/1 Running 2 115d
pod/kube-state-metrics-76f6cb7996-8x7pj 2/3 ImagePullBackOff 4 115d
pod/kube-state-metrics-7749b7b647-4mzsq 2/3 ImagePullBackOff 2 10d
pod/node-exporter-9tj5z 2/2 Running 4 113d
pod/node-exporter-hsxf7 2/2 Running 4 115d
pod/node-exporter-q8g6m 2/2 Running 4 115d
pod/node-exporter-zngtl 2/2 Running 4 115d
pod/prometheus-adapter-59df95d9f5-hjb7l 1/1 Running 3 115d
pod/prometheus-adapter-59df95d9f5-kdx7n 1/1 Running 4 115d
pod/prometheus-k8s-0 2/2 Running 5 115d
pod/prometheus-k8s-1 2/2 Running 5 115d
pod/prometheus-operator-7775c66ccf-2r99w 2/2 Running 5 115d
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/alertmanager-main ClusterIP 100.101.252.112 <none> 9093/TCP 115d
service/alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 115d
service/blackbox-exporter ClusterIP 100.111.30.55 <none> 9115/TCP,19115/TCP 115d
service/grafana ClusterIP 100.97.190.206 <none> 3000/TCP 115d
service/kube-state-metrics ClusterIP None <none> 8443/TCP,9443/TCP 115d
service/node-exporter ClusterIP None <none> 9100/TCP 115d
service/prometheus-adapter ClusterIP 100.109.111.30 <none> 443/TCP 115d
service/prometheus-k8s NodePort 100.101.111.146 <none> 9090:32101/TCP 115d
service/prometheus-operated ClusterIP None <none> 9090/TCP 115d
service/prometheus-operator ClusterIP None <none> 8443/TCP 115d
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/node-exporter 4 4 4 4 4 kubernetes.io/os=linux 115d
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/blackbox-exporter 1/1 1 1 115d
deployment.apps/grafana 1/1 1 1 115d
deployment.apps/kube-state-metrics 0/1 1 0 115d
deployment.apps/prometheus-adapter 2/2 2 2 115d
deployment.apps/prometheus-operator 1/1 1 1 115d
NAME DESIRED CURRENT READY AGE
replicaset.apps/blackbox-exporter-55c457d5fb 1 1 1 115d
replicaset.apps/grafana-9df57cdc4 1 1 1 115d
replicaset.apps/kube-state-metrics-76f6cb7996 1 1 0 115d
replicaset.apps/kube-state-metrics-7749b7b647 1 1 0 67d
replicaset.apps/prometheus-adapter-59df95d9f5 2 2 2 115d
replicaset.apps/prometheus-operator-7775c66ccf 1 1 1 115d
NAME READY AGE
statefulset.apps/alertmanager-main 3/3 115d
statefulset.apps/prometheus-k8s 2/2 115d
說明:關于prometheus的service type上述已經修改為NodePort類型
修改grafana的service類型為NodePort
- 修改前grafana-service.yaml
apiVersion: v1
kind: Service
metadata:
kind: Service
metadata:
labels:
app.kubernetes.io/component: grafana
app.kubernetes.io/name: grafana
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 7.5.4
name: grafana
namespace: monitoring
spec:
ports:
- name: http
port: 3000
targetPort: http
selector:
app.kubernetes.io/component: grafana
app.kubernetes.io/name: grafana
app.kubernetes.io/part-of: kube-prometheus
- 修改后grafana-service.yaml
kind: Service
metadata:
labels:
app.kubernetes.io/component: grafana
app.kubernetes.io/name: grafana
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 7.5.4
name: grafana
namespace: monitoring
spec:
type: NodePort
ports:
- name: http
port: 3000
targetPort: http
nodePort: 32009
selector:
app.kubernetes.io/component: grafana
app.kubernetes.io/name: grafana
app.kubernetes.io/part-of: kube-prometheus
- 查看修改后的效果
[root@k8s-master manifests]# kubectl get svc -n monitoring | grep grafana
grafana NodePort 100.97.190.206 <none> 3000:32009/TCP 115d
[圖片上傳失敗...(image-84565c-1689257827144)]
持久化prometheus
- prometheus-prometheus.yaml
,,,
serviceAccountName: prometheus-k8s
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector: {}
version: 2.26.0
volumeClaimTemplates:
- metadata:
name: db-volume
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: mysql
resources:
requests:
storage: 8Gi
具體sc的創建可以參考:https://note.youdao.com/s/T71lEiDh
kube-prometheus介紹
[圖片上傳失敗...(image-9dc40c-1689257827144)]
-
Operator
operator在k8s中以deployment類型運行,其職責為自定義資源(CRD),用來部署和管理prometheus server,同時監控這些CRD資源事件的變化來做出相應的處理,是整個架構中的控制中心
-
Prometheus
- 該CRD聲明定義了Prometheus期望在k8s集群中運行的配置,提供了配置選項來配置副本、持久化、報警等
- 對于每個Prometheus CRD資源,Operator都會以sts形式在形同的名稱空間下部署對應配置,proemtheus pod的配置是通過一個包含prometheus配置的名為prometheus-name的secret對象聲明掛載的
- 該CRD根據標簽選擇來指定部署到prometheus實例應該覆蓋那些ServiceMonitors,然后Operator會根據包含的ServiceMonitor生成配置,并在包含配置的secret中進行更新
-
Alertmanager
- 該CRD定義了在k8s集群中運行的Alertmanager的配置,同樣提供了多種配置,包含持久化存儲。
- 對于每個Alertmanager資源,Operator都會在相同的名稱空間中部署一個對應配置的sts,Alertmanager pod被配置為一個包含名為alertmanager-name的secret,該secret以alertmanager.yaml為key的方式保存使用的配置文件
-
ThanosRuler
- 該CRD定義了一個Thanos Ruler組件的配置,以方便在k8s集群中運行,通過Thanos Ruler,可以跨多個Proemtheus實例處理記錄和報警規則
- 一個ThanosRuler實例至少需要一個queryEndpoint,它指向Thanos Queriers或prometheus實例的位置,queryEndpoints用于配置Thanos運行時的--query參數
-
ServiceMonitor
- 該CRD定義了如何監控一組動態的服務,使用標簽來定義那些service被選擇進行監控
- 為了讓Prometheus金控k8s內的任何應用,需要存在一個Endpoints對象,Endpoints對象本質上時IP地址的列表,通常Endpoints對象是由Service對象自動填充的,Service對象通過標簽選擇器匹配pod,并將其添加到Endpoints對象中,一個Service可以暴露一個或多個端口,這些端口由多個Endpoints列表支持,這些端點一般情況下都是指向一個pod
- 注意:endpoints是ServiceMonitor CRD中的字段,Endpoints是k8s的一種對象
-
PodMonitor
- 該CRD用于定義如何監控一組動態pod,使用標簽來定義那些pod被選擇進行監控。
-
Probe
- 該CRD用于定義如何監控一組Ingress和靜態目標,除了target之外,Probe對象還需要一個Prober,它是監控的目標并為prometheus提供指標的服務,例如可以通過使用blackbox-exporter來提供這個服務
-
PrometheusRule
- 用于配置prometheus的rule規則文件,包括recording rule和alerting,可以自動被prometheus加載
-
AlertmanagerConfig
kube-prometheus自定義監控
-
創建需要被監控服務的pod(deployment控制器)
apiVersion: apps/v1 kind: Deployment metadata: name: check-secret-tls namespace: kube-ops labels: app: check-secret-tls release: prod spec: replicas: 1 selector: matchLabels: app: check-secret-tls release: prod strategy: rollingUpdate: maxSurge: 70% maxUnavailable: 25% type: RollingUpdate template: metadata: labels: app: check-secret-tls release: prod spec: terminationGracePeriodSeconds: 60 containers: - image: xxxxxx/kube-ops/check-secret-tls:v1.0 imagePullPolicy: Always name: check-secret-tls readinessProbe: httpGet: port: 8090 path: /health initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 30 failureThreshold: 10 livenessProbe: httpGet: port: 8090 path: /health initialDelaySeconds: 330 periodSeconds: 10 timeoutSeconds: 3 failureThreshold: 3 resources: requests: cpu: 0.5 memory: 500Mi limits: cpu: 0.5 memory: 500Mi lifecycle: preStop: exec: command: ["/bin/sh", "-c", "echo 1"] imagePullSecrets: - name: cn-beijing-ali-tope365 --- apiVersion: v1 kind: Service metadata: labels: app: check-secret-tls release: prod name: check-secret-tls namespace: kube-ops spec: ports: - name: check-secret-tls port: 8090 protocol: TCP targetPort: 8090 selector: app: check-secret-tls release: prod type: ClusterIP
-
創建ServiceMonitor
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: chekc-secret-tls namespace: monitoring spec: endpoints: - interval: 15s path: /metrics port: check-secret-tls namespaceSelector: any: true selector: matchLabels: app: 'check-secret-tls'
metadata.name:該ServcieMonitor的名稱
metadata.namespace:該ServiceMonitor所屬的名稱空間
spec.endpoints:prometheus所采集Metrics地址配置,endpoints為一個數組,可以創建多個,但是每個endpoints包含三個字段interval、path、port
spec.endpoints.interval:prometheus采集數據的周期,單位為秒
spec.endpoints.path:prometheus采集數據的路徑
spec.endpoints.port:prometheus采集數據的端口,這里為port的name,主要是通過spec.selector中選擇對應的svc,在選中的svc中匹配該端口
spec.namespaceSelector:需要發現svc的范圍
spec.namespaceSelector.any:有且僅有一個值true,當該字段被設置時,表示監聽所有符合selector所選擇的svc
-
使用matchNames時:
...... namespaceSelector: matchNames: - default - kube-ops ......
matchNames數組值,表示監聽的namespeces的范圍,上述yaml表示監控的namespaces為default和kube-ops
-
kube-prometheus主要yaml文件介紹
-
alertmanager-prometheusRule.yaml
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: labels: app.kubernetes.io/component: alert-router app.kubernetes.io/name: alertmanager app.kubernetes.io/part-of: kube-prometheus app.kubernetes.io/version: 0.21.0 prometheus: k8s role: alert-rules name: alertmanager-main-rules namespace: monitoring spec: groups: - name: alertmanager.rules rules: - alert: AlertmanagerFailedReload annotations: description: Configuration has failed to load for {{ $labels.namespace }}/{{ $labels.pod}}. runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/alertmanagerfailedreload summary: Reloading an Alertmanager configuration has failed. expr: | # Without max_over_time, failed scrapes could create false negatives, see # https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details. max_over_time(alertmanager_config_last_reload_successful{job="alertmanager-main",namespace="monitoring"}[5m]) == 0 for: 10m labels: severity: critical # 該yaml文件截取了一部分,剩余的其實就是其他報警項目 # 由 -alert到severity: critical為一組報警規則,當然你也可以自己定義所需的報警規則,只需由-alert到everity: critical復制粘貼即可
重要標簽:
prometheus:k8s
-
role:alert-rules
這兩個標簽是在
prometheus-prometheus.yaml
需要使用到的,通過ruleSelector來選擇對應的報警
所以我們要想自定義一個報警規則,只需要創建一個具有
prometheus=k8s
和role=alert-rules
標簽的PrometheusRule
對象就行了 -
kube-prometheus-prometheusRule.yaml
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: labels: app.kubernetes.io/component: exporter app.kubernetes.io/name: kube-prometheus app.kubernetes.io/part-of: kube-prometheus prometheus: k8s role: alert-rules name: kube-prometheus-rules namespace: monitoring spec: groups: - name: general.rules rules: - alert: TargetDown annotations: description: '{{ printf "%.4g" $value }}% of the {{ $labels.job }}/{{ $labels.service }} targets in {{ $labels.namespace }} namespace are down.' runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/targetdown summary: One or more targets are unreachable. expr: 100 * (count(up == 0) BY (job, namespace, service) / count(up) BY (job, namespace, service)) > 10 for: 10m labels: severity: warning - alert: Watchdog annotations: description: | This is an alert meant to ensure that the entire alerting pipeline is functional. This alert is always firing, therefore it should always be firing in Alertmanager and always fire against a receiver. There are integrations with various notification mechanisms that send a notification when this alert is not firing. For example the "DeadMansSnitch" integration in PagerDuty. runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/watchdog summary: An alert that should always be firing to certify that Alertmanager is working properly. expr: vector(1) labels: severity: none - name: node-network rules: - alert: NodeNetworkInterfaceFlapping annotations: message: Network interface "{{ $labels.device }}" changing it's up status often on node-exporter {{ $labels.namespace }}/{{ $labels.pod }} runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/nodenetworkinterfaceflapping expr: | changes(node_network_up{job="node-exporter",device!~"veth.+"}[2m]) > 2 for: 2m labels: severity: warning - name: kube-prometheus-node-recording.rules rules: - expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}[3m])) BY (instance) record: instance:node_cpu:rate:sum - expr: sum(rate(node_network_receive_bytes_total[3m])) BY (instance) record: instance:node_network_receive_bytes:rate:sum - expr: sum(rate(node_network_transmit_bytes_total[3m])) BY (instance) record: instance:node_network_transmit_bytes:rate:sum - expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}[5m])) WITHOUT (cpu, mode) / ON(instance) GROUP_LEFT() count(sum(node_cpu_seconds_total) BY (instance, cpu)) BY (instance) record: instance:node_cpu:ratio - expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}[5m])) record: cluster:node_cpu:sum_rate5m - expr: cluster:node_cpu_seconds_total:rate5m / count(sum(node_cpu_seconds_total) BY (instance, cpu)) record: cluster:node_cpu:ratio - name: kube-prometheus-general.rules rules: - expr: count without(instance, pod, node) (up == 1) record: count:up1 - expr: count without(instance, pod, node) (up == 0) record: count:up0
該yaml與
alertmanager-prometheusRule.yaml
是定義同一的yaml文件,由于kind
、prometheus: k8s
、role: alert-rules
不難看出 -
kubernetes-prometheusRule.yaml
該yaml與上述兩個定義同一的yaml文件,不做過多解釋
-
kubernetes-serviceMonitorCoreDNS.yaml
關于serviceMonitor不做過多的解釋,上述已經解讀過,如果不清楚可以翻看該文檔的 kube-prometheus自定義監控
-
prometheus-adapterxxx.yaml
- adapter這里簡單介紹一下,有興趣的可以自行百度查看具體用法
- prometheus采集到的metrics并不能直接給k8s用,因為兩者數據格式不兼容,這時就需要一個組件(prometheus-adapter),將prometheus采集到的數據格式轉換成k8s API接口能夠識別的格式。
kube-prometheus報警流程介紹
[圖片上傳失敗...(image-bd5dd1-1689257827144)]
東西向流量簡單說明:exporter-->prometheus-->alertmanager-->報警接收渠道
exporter:可參考 https://note.youdao.com/s/EQ3Ra7MD
-
prometheus對接alertmanager
global: scrape_interval: 30s scrape_timeout: 10s evaluation_interval: 30s external_labels: prometheus: monitoring/k8s prometheus_replica: prometheus-k8s-0 alerting: alert_relabel_configs: - separator: ; regex: prometheus_replica replacement: $1 action: labeldrop alertmanagers: - follow_redirects: true scheme: http path_prefix: / timeout: 10s api_version: v2 relabel_configs: - source_labels: [__meta_kubernetes_service_name] separator: ; regex: alertmanager-main replacement: $1 action: keep - source_labels: [__meta_kubernetes_endpoint_port_name] separator: ; regex: web replacement: $1 action: keep kubernetes_sd_configs: - role: endpoints follow_redirects: true namespaces: names: - monitoring rule_files: - /etc/prometheus/rules/prometheus-k8s-rulefiles-0/*.yaml
該文件來自于prometheus web中connfiguration
[圖片上傳失敗...(image-811dff-1689257827144)]
regex: alertmanager-main
regex:web
匹配的服務為alertmanager-main,端口為web
查看svc為alertmanager-main
[root@k8s-master manifests]# kubectl get svc -n monitoring | grep alertmanager-main alertmanager-main ClusterIP 100.101.252.112 <none> 9093/TCP 117d [root@k8s-master manifests]# kubectl describe svc alertmanager-main -n monitoring Name: alertmanager-main Namespace: monitoring Labels: alertmanager=main app.kubernetes.io/component=alert-router app.kubernetes.io/name=alertmanager app.kubernetes.io/part-of=kube-prometheus app.kubernetes.io/version=0.21.0 Annotations: <none> Selector: alertmanager=main,app.kubernetes.io/component=alert-router,app.kubernetes.io/name=alertmanager,app.kubernetes.io/part-of=kube-prometheus,app=alertmanager Type: ClusterIP IP Families: <none> IP: 100.101.252.112 IPs: 100.101.252.112 Port: web 9093/TCP TargetPort: web/TCP Endpoints: 10.244.129.125:9093,10.244.32.198:9093,10.244.32.254:9093 Session Affinity: ClientIP Events: <none>
該svc正是由
alertmanager-service.yaml
生成
kube-prometheus自動發現
-
配置自動發現
- job_name: "endpoints" kubernetes_sd_configs: - role: endpoints relabel_configs: # 指標采集之前或采集過程中去重新配置 - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep # 保留具有 prometheus.io/scrape=true 這個注解的Service regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::\d+)?;(\d+) # RE2 正則規則,+是一次多多次,?是0次或1次,其中?:表示非匹配組(意思就是不獲取匹配結果) replacement: $1:$2 - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - action: labelmap regex: __meta_kubernetes_service_label_(.+) replacement: $1 - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_service - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod - source_labels: [__meta_kubernetes_node_name] action: replace target_label: kubernetes_node
將上述文件保存為
prometheus-additional.yaml
-
創建secret
kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring
其中
-from-file=prometheus-additional.yaml
該文件就是上邊生成的yaml文件 -
修改prometheus-prometheus.yaml
...... additionalScrapeConfigs: name: additional-configs key: prometheus-additional.yaml
在該文件末尾處添加
-
更新prometheus-prometheus.yaml
kubectl apply -f prometheus-prometheus.yaml
補充:如果版本過低,在
kubectl logs -f prometheus-k8s-0 prometheus -n monitoring
日志中會出現forbidden
報錯的字段,主要是因為權限設置的問題(rbac)prometheus關聯的ServiceAccount為
serviceAccountName: prometheus-k8s
該ServiceAccount配置來自于prometheus-prometheus.yaml
,然后通過serviceAccountName: prometheus-k8s
查找發現其綁定的文件為prometheus-clusterRole.yaml
prometheus-clusterRole.yaml
apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: labels: app.kubernetes.io/component: prometheus app.kubernetes.io/name: prometheus app.kubernetes.io/part-of: kube-prometheus app.kubernetes.io/version: 2.26.0 name: prometheus-k8s rules: - apiGroups: - "" resources: - nodes/metrics - services - endpoints - pods verbs: - get - list - watch - nonResourceURLs: - /metrics verbs: - get
其中resources中包含了services、pods的資源,不需要更改
-
通過prometheus web查看targets
[圖片上傳失敗...(image-72eb32-1689257827144)]
-
prometheus自動發現自定義service
apiVersion: v1 kind: Service metadata: annotations: prometheus.io/scrape: "true" labels: app: check-secret-tls release: prod name: check-secret-tls namespace: kube-ops spec: ports: - name: check-secret-tls port: 8090 protocol: TCP targetPort: 8090 selector: app: check-secret-tls release: prod type: ClusterIP
添加注解:
prometheus.io/scrape: "true"
即可 -
驗證自動發現是否生效
[圖片上傳失敗...(image-e8c5ec-1689257827144)]
忽略新加入的targets狀態為down,這是因為我環境問題,導致程序啟動失敗
補充:關于exporter自定義開發可以參考:https://note.youdao.com/s/EQ3Ra7MD
-
關于metrics路徑
特殊項目,需要指定metrics
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: chekc-secret-tls namespace: monitoring spec: endpoints: - interval: 15s path: /api/metrics port: check-secret-tls namespaceSelector: any: true selector: matchLabels: app: 'check-secret-tls'
- 創建ServiceMonitor,通過
selector.matchLabels
標簽進行匹配svc,這時匹配到的svc就不需要添加加注解:prometheus.io/scrape: "true"
- 其中
spec.endpoints.path
該參數可以指定metrics的路徑
- 創建ServiceMonitor,通過
-
效果展示
[圖片上傳失敗...(image-1f4e55-1689257827144)]
請忽略狀態為down,這是因為僅測試prometheus的功能
如果需要使用自動發現更改metrics路徑,適用于以后的所有項目
-
更改prometheus-additional.yaml
- job_name: "endpoints" metrics_path: /api/v2/metrics kubernetes_sd_configs: - role: endpoints relabel_configs: # 指標采集之前或采集過程中去重新配置 - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep # 保留具有 prometheus.io/scrape=true 這個注解的Service regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::\d+)?;(\d+) # RE2 正則規則,+是一次多多次,?是0次或1次,其中?:表示非匹配組(意思就是不獲取匹配結果) replacement: $1:$2 - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - action: labelmap regex: __meta_kubernetes_service_label_(.+) replacement: $1 - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_service - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod - source_labels: [__meta_kubernetes_node_name] action: replace target_label: kubernetes_node
添加字段
metrics_path: /api/v2/metrics
,更多字段可以參考:https://prometheus.io/docs/prometheus/latest/configuration/configuration/下面步驟參考 本章節的 創建secret-->修改prometheus-prometheus.yaml-->更新prometheus-prometheus.yaml
-
kube-prometheus自定義告警規則
在kube-prometheus主要yaml文件介紹的時候,已經說過,
prometheus-prometheusRule.yaml
這個定義的其實是報警規則,其kind: PrometheusRule
,metadata.labels為prometheus: k8s
、role: alert-rules
,至于為什么添加這兩個標簽,這是因為prometheus-prometheus.yaml
這里的spec.ruleSelector.metchLabels
進行匹配的-
創建報警規則yaml文件
customize_rule.yaml
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: labels: prometheus: k8s role: alert-rules name: customize-rules namespace: monitoring spec: groups: - name: test rules: - alert: CustomizeRule annotations: summary: CustomizeRule summary description: Customize Rule use for test expr: | coredns_forward_request_duration_seconds_bucket >=3000 for: 3m labels: severity: warning
創建報警規則
kubectl apply -f customize_rule.yaml
補充:當然也可以直接在
prometheus-prometheusRule.yaml
繼續添加報警規則 -
查看告警創建結果
[圖片上傳失敗...(image-4b7f29-1689257827144)]
[圖片上傳失敗...(image-516ec0-1689257827144)]
至此報警規則已經創建完畢,根據自己的需求創建即可
kube-prometheus自定義告警渠道
-
常用告警渠道簡單介紹幾種
- 郵件
- webhook
-
查看alertmanager web
[圖片上傳失敗...(image-28e5da-1689257827144)]
該config文件,其實是`alertmanager-secret.yaml`定義的,經過base64加密,具體信息可以查看`alertmanager-main`secret
-
webhook告警渠道自定義
-
定義告警配置
alertmanager-config.yaml
apiVersion: monitoring.coreos.com/v1alpha1 kind: AlertmanagerConfig metadata: name: config-example namespace: monitoring labels: alertmanagerConfig: example spec: route: groupBy: ['job'] groupWait: 30s groupInterval: 5m repeatInterval: 12h receiver: 'webhook' receivers: - name: 'webhook' webhookConfigs: - url: 'http://192.168.10.70:8008/api/v1/ping'
-
修改alertmanager-alertmanager.yaml
...... configSecret: alertmanagerConfigSelector: # 匹配 AlertmanagerConfig 的標簽 matchLabels: alertmanagerConfig: example
該文件末尾添加標簽選擇
-
更新修改的文件
kubectl apply -f alertmanager-config.yaml kubectl apply -f alertmanager-alertmanager.yaml
-
-
再次查看alertmanager web
[圖片上傳失敗...(image-acff9d-1689257827144)]
-
查看url接口(/api/v1/ping)
image-20230115171811951這個是我僅做測試使用的,該接口(/api/v1/ping)經過測試驗證了自定義告警渠道webhook是ok的
-
webhook告警工具推薦
-
更多告警渠道
-
參考資料
Thanos
優勢
- 全局視圖
- 長期存儲
- 兼容prometheus
架構圖
[圖片上傳失敗...(image-da809-1689257827144)]
- Sidecar:連接prometheus,讀取其數據進行查詢或上傳到云存儲
- Store Gateway:訪問放在對象存儲中的指標數據
- Compact:采樣壓縮,清理對象存儲中的數據
- Ruler:根據Thanos中的數據評估記錄和報警規則,進行展示、上傳
- Query:實現prometheus的api v1聚合
- object Storage:對象存儲
- Frontend:查詢緩存
thanos+minio+kube-prometheus對接
minio需要提前部署,具體可參考:https://note.youdao.com/s/KyC0xeI3
-
修改prometheus-prometheus.yaml
thanos: baseImage: quay.io/thanos/thanos version: v0.29.0 objectStorageConfig: key: thanos.yaml name: thanos-objstore-config
在文件末尾處添加,主要作用是添加sidecar
objectStorageConfig:這里使用secret的方式
-
創建secret供prometheus-prometheus.yaml中objectStorageConfig使用
-
thanos-config.yaml文件
type: s3 config: bucket: thanos endpoint: 192.168.70.201:9090 access_key: admin secret_key: adminadmin insecure: true signature_version2: false http_config: idle_conn_timeout: 90s insecure_skip_verify: true
bucket:存儲桶
ccess_key:登錄minio的賬號
ecret_key:登錄minio的密碼
insecure: true 使用http
idle_conn_timeout:超時時間配置
insecure_skip_verify: true 跳過TLS證書驗證
更多配置參數可以參考: https://thanos.io/tip/thanos/storage.md/#s3
-
創建secret
kubectl -n monitoring create secret generic thanos-objstore-config --from-file=thanos.yaml=./thanos-config.yaml
-
-
更新prometheus-prometheus.yaml后查看pod
[root@k8s-master manifests]# kubectl get pod -n monitoring | grep prometheus-k8s prometheus-k8s-0 3/3 Running 1 22h prometheus-k8s-1 3/3 Running 1 22h
其中3個pod已經running,3個pod中有sidecar
[root@k8s-master manifests]# kubectl describe pod prometheus-k8s-0 -n monitoring thanos-sidecar: Container ID: docker://b8ee9554fc2f4cb37480987a5a65da2c26c8aa16395103fed79c2ea1cdf043b9 Image: quay.io/thanos/thanos:v0.29.0 Image ID: docker-pullable://quay.io/thanos/thanos@sha256:4766a6caef0d834280fed2d8d059e922bc8781e054ca11f62de058222669d9dd Ports: 10902/TCP, 10901/TCP Host Ports: 0/TCP, 0/TCP Args: sidecar --prometheus.url=http://localhost:9090/ --grpc-address=[$(POD_IP)]:10901 --http-address=[$(POD_IP)]:10902 --objstore.config=$(OBJSTORE_CONFIG) --tsdb.path=/prometheus
-
下載kube-thanos需要的yaml文件
git clone https://github.com/thanos-io/kube-thanos
所需文件:manifests目錄中
-
kube-thanos清單
[root@k8s-master manifests]# ll 總用量 72 -rw-r--r-- 1 root root 2604 1月 31 09:52 thanos-query-deployment.yaml -rw-r--r-- 1 root root 285 11月 3 23:20 thanos-query-serviceAccount.yaml -rw-r--r-- 1 root root 603 11月 3 23:20 thanos-query-serviceMonitor.yaml -rw-r--r-- 1 root root 539 1月 30 11:19 thanos-query-service.yaml -rw-r--r-- 1 root root 790 11月 3 23:20 thanos-receive-ingestor-default-service.yaml -rw-r--r-- 1 root root 4779 11月 3 23:20 thanos-receive-ingestor-default-statefulSet.yaml -rw-r--r-- 1 root root 321 11月 3 23:20 thanos-receive-ingestor-serviceAccount.yaml -rw-r--r-- 1 root root 729 11月 3 23:20 thanos-receive-ingestor-serviceMonitor.yaml -rw-r--r-- 1 root root 268 11月 3 23:20 thanos-receive-router-configmap.yaml -rw-r--r-- 1 root root 2676 11月 3 23:20 thanos-receive-router-deployment.yaml -rw-r--r-- 1 root root 308 11月 3 23:20 thanos-receive-router-serviceAccount.yaml -rw-r--r-- 1 root root 661 11月 3 23:20 thanos-receive-router-service.yaml -rw-r--r-- 1 root root 294 11月 3 23:20 thanos-store-serviceAccount.yaml -rw-r--r-- 1 root root 621 11月 3 23:20 thanos-store-serviceMonitor.yaml -rw-r--r-- 1 root root 560 11月 3 23:20 thanos-store-service.yaml -rw-r--r-- 1 root root 3331 1月 30 18:10 thanos-store-statefulSet.yaml
-
修改thanos-query-deployment.yaml
containers: - args: - query - --grpc-address=0.0.0.0:10901 - --http-address=0.0.0.0:9090 - --log.level=info - --log.format=logfmt - --query.replica-label=prometheus_replica - --query.replica-label=rule_replica - --store=dnssrv+prometheus-operated.monitoring.svc.cluster.local:10901 - --query.auto-downsampling
添加
--store=dnssrv+prometheus-operated.monitoring.svc.cluster.local:10901
,這里使用跨名稱空間訪問prometheus的svc -
創建thanos的名稱空間
kubectl create ns thanos
-
更新query相關yaml
kubectl apply -f thanos-query-deployment.yaml -f thanos-query-serviceAccount.yaml -f thanos-query-serviceMonitor.yaml -f thanos-query-service.yaml
-
修改query的svc類型
thanos-query-service.yaml
apiVersion: v1 kind: Service metadata: labels: app.kubernetes.io/component: query-layer app.kubernetes.io/instance: thanos-query app.kubernetes.io/name: thanos-query app.kubernetes.io/version: v0.29.0 name: thanos-query namespace: thanos spec: type: NodePort ports: - name: grpc port: 10901 targetPort: 10901 - name: http port: 9090 targetPort: 9090 selector: app.kubernetes.io/component: query-layer app.kubernetes.io/instance: thanos-query app.kubernetes.io/name: thanos-query
-
訪問query截圖
[圖片上傳失敗...(image-e069b6-1689257827144)]
-
修改thanos-store-statefulSet.yaml
volumeClaimTemplates: - metadata: labels: app.kubernetes.io/component: object-store-gateway app.kubernetes.io/instance: thanos-store app.kubernetes.io/name: thanos-store name: data spec: storageClassName: 'nfs-storage' accessModes: - ReadWriteOnce resources: requests: storage: 10Gi
文件末尾處 修改處:storageClassName: 'nfs-storage'
sc需要自行提前創建,具體可參考:https://note.youdao.com/s/IGwRd3gk
補充:由于該實驗k8s版本為:v1.20.15,使用sc是需要修改
/etc/kubernetes/manifests/kube-apiserver.yaml
否則掛載失敗,主要是k8s版本更新,去掉了一些字段
添加
- --feature-gates=RemoveSelfLink=false
image-20230201103326686 -
更新store相關yaml
kubectl apply -f thanos-store-serviceAccount.yaml -f thanos-store-serviceMonitor.yaml -f thanos-store-service.yaml -f thanos-store-statefulSet.yaml
-
修改thanos-query-deployment.yaml對接store
...... containers: - args: - query - --grpc-address=0.0.0.0:10901 - --http-address=0.0.0.0:9090 - --log.level=info - --log.format=logfmt - --query.replica-label=prometheus_replica - --query.replica-label=rule_replica - --store=dnssrv+prometheus-operated.monitoring.svc.cluster.local:10901 - --store=dnssrv+_grpc._tcp.thanos-store.thanos.svc.cluster.local:10901 - --query.auto-downsampling env: - name: HOST_IP_ADDRESS valueFrom: fieldRef: fieldPath: status.hostIP image: quay.io/thanos/thanos:v0.29.0 imagePullPolicy: IfNotPresent ......
添加
- --store=dnssrv+_grpc._tcp.thanos-store.thanos.svc.cluster.local:10901
-
更新thanos-query-deployment.yaml
kubectl apply -f thanos-query-deployment.yaml
-
訪問query截圖
[圖片上傳失敗...(image-294353-1689257827144)]
-
補充
- 多集群對接,類似上邊的步驟,當然使用kube-prometheus部署的prometheus,其會有external_labels
標簽為
prometheus_replica
- 多集群對接,類似上邊的步驟,當然使用kube-prometheus部署的prometheus,其會有external_labels
thanos之query
<img src="https://gitee.com/root_007/md_file_image/raw/master/202302071244144.svg" alt="thanos-querier" />
當使用thanos-query進行指標查詢時,通過storeApi grpc進行查詢的
Query與Sidecar
[圖片上傳失敗...(image-9f7aa6-1689257827144)]
Sicecar上傳數據到對象存儲
[圖片上傳失敗...(image-fa2b00-1689257827144)]
快樂交流
博客請訪問:https://kubesre.com/
公眾號請搜索:云原生運維圈