基本介紹
Ganglia:是UC Berkeley發起的一個開源集群監視項目,設計用于測量數以千計的節點。Ganglia的核心包含gmond、gmetad以及一個Web前端。主要是用來監控系統性能,如:cpu 、mem、硬盤利用率, I/O負載、網絡流量情況、系統負載等,通過曲線很容易見到每個節點的工作狀態,對合理調整、分配系統資源,提高系統整體性能起到重要作用。
更重要的是,HDFS、YARN、HBase等已經支持其守護進程的資源情況發送給Ganglia進行監控。
Nagios:是一款開源的電腦系統和網絡監視工具,能有效監控Windows、Linux和Unix的主機狀態,交換機路由器等網絡設置,打印機等。尤其有用的是,在系統或服務狀態異常時發出郵件或短信報警第一時間通知網站運維人員,在狀態恢復后發出正常的郵件或短信通知。
我們這次的架構設計:
- 1,Ganglia的優勢在于監控數據的實時性和豐富的圖形化界面,同時對Mobile端支持的很好,但是在出現問題的時候報警提示功能,相對較弱。
- 2,Nagios的優勢在于出現問題和問題恢復時可以提供強大的報警提示功能,但是在實時監控和圖形化展示上功能較弱,對大規模集群支持較差。
- 3,要對數據平臺中支持的Hadoop集群(HDFS、YARN)對資源的使用情況進行監控。
所以我們將3者結合起來,架構如下:
相關版本:Ubuntu 16.04 LTS、Ganglia 3.6.1、Nagios 4.1.1、Hadoop 2.7.3
1,部署Ganglia:
在需要進行Web展示的節點上安裝:
sudo apt-get update
sudo apt install apache2 php libapache2-mod-php
sudo apt-get install rrdtool
sudo apt-get install gmetad ganglia-webfrontend
#過程中出現apache2重啟的對話框,選擇yes即可
在需要被監控的節點上安裝:
sudo apt-get update
sudo apt install php libapache2-mod-php
sudo apt-get install ganglia-monitor
#過程中出現apache2重啟的對話框,選擇yes即可
下述操作過程,在主節點上進行:
#復制 Ganglia webfrontend Apache 配置:
sudo cp /etc/ganglia-webfrontend/apache.conf /etc/apache2/sites-enabled/ganglia.conf
#編輯gmetad配置文件
sudo vi /etc/ganglia/gmetad.conf
#更改數據源 data_source “my cluster” localhost 修改為:
data_source "bigdata cluster" 10 wl1:8649 wl2:8649 wl3:8649
setuid_username "nobody"
gridname "bigdata cluster"
case_sensitive_hostnames 1
all_trusted on
#主節點上執行:
sudo ln -s /usr/share/ganglia-webfrontend/ /var/www/ganglia
下述操作過程,在所有被監控節點上進行:
#編輯gmond配置文件
sudo vi /etc/ganglia/gmond.conf
globals {
daemonize = yes
setuid = yes
user = ganglia
debug_level = 0
max_udp_msg_len = 1472
mute = no
deaf = no
host_dmax = 0 /*secs */
cleanup_threshold = 300 /*secs */
gexec = no
send_metadata_interval = 10
}
/* If a cluster attribute is specified, then all gmond hosts are wrapped inside
* of a <CLUSTER> tag. If you do not specify a cluster tag, then all <HOSTS> will
* NOT be wrapped inside of a <CLUSTER> tag. */
cluster {
name = "bigdata cluster"
owner = "ganglia"
latlong = "unspecified"
url = "unspecified"
}
/* The host section describes attributes of the host, like the location */
host {
location = “wl1" #每個節點寫自己的主機名
}
/* Feel free to specify as many udp_send_channels as you like. Gmond
used to only support having a single channel */
udp_send_channel {
#mcast_join = 239.2.11.71
host = wl1 #每個節點都指向gmetad主機
port = 8649
ttl = 1
}
/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
#mcast_join = 239.2.11.71
port = 8649
#bind = 239.2.11.71
}
/* You can specify as many tcp_accept_channels as you like to share
an xml description of the state of the cluster */
tcp_accept_channel {
port = 8649
}
2,收集Hadoop集群的HDFS、YARN metric源:
下述操作過程,在所有Hadoop集群節點上進行:
#編輯hadoop-metrics2.properties
vi hadoop-2.7.3/etc/hadoop/hadoop-metrics2.properties
#注釋掉所有原來的內容,增加如下內容:
*.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
*.sink.ganglia.period=10
*.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both
*.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40
namenode.sink.ganglia.servers=wl1:8649
resourcemanager.sink.ganglia.servers=wl1:8649
datanode.sink.ganglia.servers=wl1:8649
nodemanager.sink.ganglia.servers=wl1:8649
jobhistoryserver.sink.ganglia.servers=wl1:8649
maptask.sink.ganglia.servers=wl1:8649
reducetask.sink.ganglia.servers=wl1:8649
重啟Hadoop集群、重啟gmond、gmetad、gweb:
hadoop-2.7.3/sbin/stop-all.sh
hadoop-2.7.3/sbin/start-all.sh
sudo /etc/init.d/ganglia-monitor restart (所有節點) gmond服務
sudo /etc/init.d/gmetad restart (gmetad節點) gmetad服務
sudo /etc/init.d/apache2 restart (gweb節點) web服務(包含gweb服務)
然后在安裝了gweb的節點上使用主機ip/ganglia即可登錄Web:
選擇某個Node具體觀察,可以看到已經收集到了HDFS和YARN的度量數據:
選擇Mobile標簽頁,可以看到對移動終端的展示支持的很好:
3,部署Nagios:
1,為了Nagios能正常發送告警郵件,先要安裝sendmail工具:
sudo apt-get install sendmail
sudo apt-get install sendmail-cf
sudo apt-get install mailutils
sudo apt-get install sharutils
#終端輸入命令:
ps aux |grep sendmail
#輸出如下:說明sendmail 已經安裝成功并啟動了
root 20978 0.0 0.3 8300 1940 ? Ss 06:34 0:00 sendmail: MTA: accepting connections
root 21711 0.0 0.1 3008 776 pts/0 S+ 06:51 0:00 grep sendmail
配置sendmail:
#打開sendmail的配置文件 /etc/mail/sendmail.mc
vi /etc/mail/sendmail.mc
#找到如下行:
DAEMON_OPTIONS(`Family=inet, Name=MTA-v4, Port=smtp, Addr=127.0.0.1')dnl
#將Addr=127.0.0.1修改為Addr=0.0.0.0,表明可以連接到任何服務器。
DAEMON_OPTIONS(`Family=inet, Name=MTA-v4, Port=smtp, Addr=0.0.0.0')dnl
#生成新的配置文件:
cd /etc/mail
mv sendmail.cf sendmail.cf~ #做一個備份
m4 sendmail.mc > sendmail.cf #>的左右有空格
#修改sendmail.cf
vi /etc/mail/sendmail.cf
#新增
Dj$w. #注意最后面有一個點
#修改hosts,否則發送郵件的過程會非常慢,因為sendmail
#以wl1作為域名加到主機名wl1后組成完整的長名wl1.wl1來訪問,
#會提示找不到域名
vi /etc/hosts
x.x.x.x wl1 wl1.localdomain wl1.wl1
#重啟sendmail服務:
service sendmail restart
#測試發送郵件,看看能否收得到:
echo "test" | mail -s test xxx@xxx.com
2,安裝Nagios:
參考Ubuntu 16.04 安裝 Nagios Core,
其中下載Nagios插件那一步時,nagios-plugins-2.1.1官網下載太慢,先從下面的鏈接下載好,然后編譯安裝
nagios-plugins-2.1.1下載
使用http://主機IP/nagios/ 登錄,需要輸入安裝時設置的用戶名nagiosadmin及其密碼,然后就可以看到主頁了:
3,主要說一下如何用Nagios監控Ganglia數據,并根據閥值發出告警:
#新生成一個監控ganglia的插件check_ganglia.py
cd /usr/local/nagios/libexec
vi check_ganglia.py #內容如下:
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import sys
import getopt
import socket
import xml.parsers.expat
class GParser:
def __init__(self, host, metric):
self.inhost =0
self.inmetric = 0
self.value = None
self.host = host
self.metric = metric
def parse(self, file):
p = xml.parsers.expat.ParserCreate()
p.StartElementHandler = parser.start_element
p.EndElementHandler = parser.end_element
p.ParseFile(file)
if self.value == None:
raise Exception('Host/value not found')
return float(self.value)
def start_element(self, name, attrs):
if name == "HOST":
if attrs["NAME"]==self.host:
self.inhost=1
elif self.inhost==1 and name == "METRIC" and attrs["NAME"]==self.metric:
self.value=attrs["VAL"]
def end_element(self, name):
if name == "HOST" and self.inhost==1:
self.inhost=0
def usage():
print """Usage: check_ganglia \
-h|--host= -m|--metric= -w|--warning= \
-c|--critical= [-o|--opposite=] [-s|--server=] [-p|--port=] """
sys.exit(3)
if __name__ == "__main__":
##############################################################
ganglia_host = 'x.x.x.x' #修改為你的gmetad主機的ip
ganglia_port = 8651
host = None
metric = None
warning = None
critical = None
opposite = 0 ##增加一個參數,表示設定值取反,也就是當實際值小于等于設定值報警
try:
options, args = getopt.getopt(sys.argv[1:],
"h:m:w:c:o:s:p:",
["host=", "metric=", "warning=","critical=","opposite=", "server=","port="],
)
except getopt.GetoptError, err:
print "check_gmond:", str(err)
usage()
sys.exit(3)
for o, a in options:
if o in ("-h", "--host"):
host = a
elif o in ("-m", "--metric"):
metric = a
elif o in ("-w", "--warning"):
warning = float(a)
elif o in ("-c", "--critical"):
critical = float(a)
elif o in ("-o", "--opposite"):
opposite = int(a)
elif o in ("-p", "--port"):
ganglia_port = int(a)
elif o in ("-s", "--server"):
ganglia_host = a
if critical == None or warning == None or metric == None or host ==None:
usage()
sys.exit(3)
try:
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((ganglia_host,ganglia_port))
parser = GParser(host, metric)
value = parser.parse(s.makefile("r"))
s.close()
except Exception, err:
#import pdb
#pdb.set_trace()
print "CHECKGANGLIA UNKNOWN: Error while getting value\"%s\"" % (err)
sys.exit(3)
if opposite == 1: ###根據傳入參數做判斷,等于1時,表示取反,等于0,不取反
if value <= critical:
print "CHECKGANGLIA CRITICAL: %s is %.2f" % (metric, value)
sys.exit(2)
elif value <= warning:
print "CHECKGANGLIA WARNING: %s is %.2f" % (metric, value)
sys.exit(1)
else:
print "CHECKGANGLIA OK: %s is %.2f" % (metric, value)
sys.exit(0)
else:
if value >= critical:
print "CHECKGANGLIA CRITICAL: %s is %.2f" % (metric, value)
sys.exit(2)
elif value >= warning:
print "CHECKGANGLIA WARNING: %sis %.2f" % (metric, value)
sys.exit(1)
else:
print "CHECKGANGLIA OK: %s is %.2f" % (metric, value)
sys.exit(0)
修改該腳本為可讀寫、操作權限:
chmod 755 check_ganglia.py
在如下目錄,新建文件:(注意啊,里面最好不要有注釋,可能會引起功能不可用,原因我沒時間去分析)
#在/usr/local/nagios/etc/objects/下新建一個services.cfg
cd /usr/local/nagios/etc/objects/
vi services.cfg #內容如下:
define host {
use linux-server
host_name wl1
address x.x.x.1
}
define host {
use linux-server
host_name wl2
address x.x.x.2
}
define host {
use linux-server
host_name wl3
address x.x.x.3
}
define hostgroup {
hostgroup_name ganglia-servers
alias nagios server
members *
}
define servicegroup {
servicegroup_name ganglia-metrics
alias Ganglia Metrics
}
define command {
command_name check_ganglia
command_line $USER1$/check_ganglia.py -h $HOSTNAME$ -m $ARG1$ -w $ARG2$ -c $ARG3$ -o $ARG4$
}
define service {
use generic-service
name ganglia-service
hostgroup_name ganglia-servers
service_groups ganglia-metrics
notifications_enabled 1
notification_interval 10
register 0
}
define service{
use ganglia-service
service_description 內存空閑
check_command check_ganglia!mem_free!200!50!1
contact_groups admins
}
define service{
use ganglia-service
service_description load_one
check_command check_ganglia!load_one!4!5!0
contact_groups admins
}
define service{
use ganglia-service
service_description disc_free
check_command check_ganglia!disk_free!40!50!0
contact_groups admins
}
define service{
use ganglia-service
service_description yarn.NodeManagerMetrics.AvailableGB
check_command check_ganglia!yarn.NodeManagerMetrics.AvailableGB!8!4!1
contact_groups admins
}
需要注意的是,這個services.cfg文件就是用來你的Nagios自動去Ganglia里面取數據的,里面定義的需要關注的Ganglia的項目越多,Nagios里面顯示的越多,我這里僅僅是一個范本,只舉例了幾個簡單的數據,如果有需要,請自行增加。
修改該配置文件為可讀寫、操作權限:
chown nagios:nagios services.cfg
chmod 664 services.cfg
修改Nagios主配置文件:
vi /usr/local/nagios/etc/nagios.cfg
#cfg_file=/usr/local/nagios/etc/objects/localhost.cfg
#add by wangliang for ganglia
cfg_file=/usr/local/nagios/etc/objects/services.cfg
修改和發送告警郵件相關的配置:
vi /usr/local/nagios/etc/objects/commands.cfg
#將其中的/bin/mail替換為mail
# 'notify-host-by-email' command definition
define command{
command_name notify-host-by-email
command_line /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress: $HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" | mail -s "** $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ **" $CONTACTEMAIL$
}
# 'notify-service-by-email' command definition
define command{
command_name notify-service-by-email
command_line /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\n\nService: $SERVICEDESC$\nHost: $HOSTALIAS$\nAddress: $HOSTADDRESS$\nState: $SERVICESTATE$\n\nDate/Time: $LONGDATETIME$\n\nAdditional Info:\n\n$SERVICEOUTPUT$\n" | mail -s "** $NOTIFICATIONTYPE$ Service Alert: $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$ **" $CONTACTEMAIL$
}
#修改發送的郵件地址和收件人:
vi /usr/local/nagios/etc/objects/contacts.cfg
###############################################################################
# CONTACTS.CFG - SAMPLE CONTACT/CONTACTGROUP DEFINITIONS
#
#
# NOTES: This config file provides you with some example contact and contact
# group definitions that you can reference in host and service
# definitions.
#
# You don't need to keep these definitions in a separate file from your
# other object definitions. This has been done just to make things
# easier to understand.
#
###############################################################################
###############################################################################
###############################################################################
#
# CONTACTS
#
###############################################################################
###############################################################################
# Just one contact defined by default - the Nagios admin (that's you)
# This contact definition inherits a lot of default values from the 'generic-contact'
# template which is defined elsewhere.
define contact{
contact_name nagiosadmin ; Short name of user
use generic-contact ; Inherit default values from generic-contact template (defined above)
alias Nagios Admin ; Full name of user
email xxx1@xxx.com ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ******
}
define contact{
contact_name nagiosadmin2 ; Short name of user
use generic-contact ; Inherit default values from generic-contact template (defined above)
alias Nagios Admin2 ; Full name of user
email xxx2@xxx.com ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ******
}
define contact{
contact_name nagiosadmin3 ; Short name of user
use generic-contact ; Inherit default values from generic-contact template (defined above)
alias Nagios Admin3 ; Full name of user
email xxx3@xxx.com ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ******
}
###############################################################################
###############################################################################
#
# CONTACT GROUPS
#
###############################################################################
###############################################################################
# We only have one contact in this simple configuration file, so there is
# no need to create more than one contact group.
define contactgroup{
contactgroup_name admins
alias Nagios Administrators
members *
}
利用如下命令,判斷修改是否成功:
/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
Total Warnings: 0
Total Errors: 0
按順序重啟相關服務:
sudo /etc/init.d/ganglia-monitor restart (所有節點)
sudo /etc/init.d/gmetad restart (gmetad節點)
sudo /etc/init.d/apache2 restart (gweb節點)
service nagios restart (nagios節點)
service sendmail restart (nagios節點)
最后的效果圖如下:
(nagios采集數據的過程略慢,有的時候會短暫的顯示service status是unknown或者pending,過一會就會好的,不用著急)
收到的郵件告警的圖示:
下一步工作是把這幾個組件做成docker鏡像,用k8s調度,具體過程不在詳述,參考我前面的文章就可以完成。
需要注意的地方:
- 1, 如果你想在Ganglia Web上顯示各節點的主機名,則需要提前在
gmetad節點的/etc/hosts里面配置好ip和hostname的映射關系,ganglia會在收到各節點數據時,先按照ip查找hosts里面的hostname,如果沒有,則rrd中就按照ip存儲;如果有,則rrd中按照查到的名字存儲,Web顯示數據時,是根據rrd中的記錄的名字或者Ip來顯示的。 - 2, 如果你以前是按照ip顯示,后來想改成hostname,則先要把rrd的內容清空,反之亦然。
- 3, 記得rrd的權限是:
drwxr-xr-x nobody nogroup rrds/
否則網頁會提示拒絕連接 - 4, 由于sendmail使用的是smtp協議,而有的公司用的是esmtp協議的服務器,所以用本文描述的sendmail發送告警郵件可能郵箱會收不到。后來我使用了sendEmail的工具,它可以使用esmtp協議,如下格式:sendEmail -f xxx@xxx.com -t xxx@xxx.com -s smtp.exmail.qq.com -xu xxx@xxxx.com -xp xxx -m "test"
進行測試,就可以發送成功啦,
安裝很簡單,參考此處
需要同步修改nagios的/usr/local/nagios/etc/objects/commands.cfg
define command{
command_name notify-host-by-email
command_line /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress: $HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" | /usr/local/bin/sendEmail -f xxx@xxx.com -t $CONTACTEMAIL$ -s smtp.exmail.qq.com -u "** $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ **" -xu xxxx@xxx.com -xp xxxx
}
# 'notify-service-by-email' command definition
define command{
command_name notify-service-by-email
command_line /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\n\nService: $SERVICEDESC$\nHost: $HOSTALIAS$\nAddress: $HOSTADDRESS$\nState: $SERVICESTATE$\n\nDate/Time: $LONGDATETIME$\n\nAdditional Info:\n\n$SERVICEOUTPUT$\n" | /usr/local/bin/sendEmail -f xxx@xxx.com -t $CONTACTEMAIL$ -s smtp.exmail.qq.com -u "** $NOTIFICATIONTYPE$ Service Alert: $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$ **" -xu xxx@xxx.com -xp xxx
}