เรียนรู้การเพิ่ม Host&Service ในการ Monitor

สวัสดีครับ วันนี้เราจะมารู้จักการเพิ่ม host และ service กัน เพื่อที่จะนำ nagios มาใช้ในการ monitor ในระบบเครือข่ายของเราครับ อันดับแรกเราจะต้องมารู้จักและทำความเข้าใจว่า nagios ทำอะไรได้บ้าง object หลักๆเช่น host service และ command คืออะไร เพราะขั้นตอนต่างๆต่อไปนี้ ที่ผมจะอธิบายจะเป็นการ ปรับแต่งระบบแบบง่ายๆ ส่วนรายละเอียดปลีกย่อยนั้นผมขอแนะนำให้อ่านจากเอกสารของ nagios เพิ่มเติมนะ ครับ

เมื่อทำการติดตั้ง Nagios และ Nagios plugin แล้ว ลองตรวจดูใน /usr/local/nagios ว่ามี subdirectory เหล่านี้อยู่หรือไม่:
bin etc libexec sbin share var

เริ่มต้นด้วยการ configuration ใน /usr/local/nagios/etc ไฟล์แรกที่สำคัญคือ nagios.cfg ซึ่งเป็น main configuration file ใช้สำหรับ define ค่าที่จำเป็นต่างๆใน Object Definitions ถ้าต้องการจะให้เรียกค่าที่ define ไว้ในไฟล์ใด ก็ต้องเพิ่ม path นั้นๆไปใน nagios.cfg เช่น

cfg_file=/usr/local/nagios/etc/hosts.cfg
cfg_file=/usr/local/nagios/etc/services.cfg
cfg_file=/usr/local/nagios/etc/checkcommands.cfg
cfg_file=/usr/local/nagios/etc/misccommands.cfg
cfg_file=/usr/local/nagios/etc/timeperiods.cfg
cfg_file=/usr/local/nagios/etc/contacts.cfg
cfg_file=/usr/local/nagios/etc/contactgroups.cfg

ถ้าหาก define objects ไว้ไม่สมบูรณ์ เมื่อทำการ execute verify command นี้แล้ว
$/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

จะมี error หรือ warning messages ก็ต้องตามไปแก้ให้ถูกต้องครับ ผมแนะนำให้อ่านเรื่อง Main configuration file options ใน Nagios documentation ( http://nagios.sourceforge.net/docs/2_0/configmain.html ) ให้เข้าใจหลักการคร่าวๆก่อนทำการแก้ไขครับ อันที่ จริงแล้ว ค่า definition ทั้งหมดสามารถเก็บไว้ในไฟล์เดียวได้ แต่เราควรเก็บแยกไว้ในหลายๆไฟล์เพื่อความ
สะดวกในการแก้ไขครับ ทุกท่านสามารถ define ค่าอื่นๆไว้ในไฟล์เพิ่มเติมได้อีกตามความสะดวก การเพิ่ม Object Definition ลงไปในไฟล์ใดๆก็จะต้องดูตัวอย่าง Template ของ object นั้นๆ

ตัวอย่างที่ 1
สมมุติว่าเราต้องการจะ monitor Webserver ของ Google เราก็สามารถเติม host object definition ไปในไฟล์ hosts.cfg ตัวอย่างเช่น

# 'Google' host definition
define host{
host_name Google
alias Google Server
address http://www.google.com/
max_check_attempts 1
check_period 24x7
notification_interval 120
notification_period 24x7
notification_options d,u,r
contact_groups admins
}

ข้างบนนี้แสดงค่าที่ required สำหรับค่าอื่นๆที่ list ไว้ทั้งหมดใน Object definition format ก็สามารถ เพิ่มเติมไปได้ตามความจำเป็น (ดูรายละเอียดเพิ่มเติมในหัวข้อ Template-Based Object Configuration ในเอกสาร ของ Nagios (http://nagios.sourceforge.net/docs/2_0/xodtemplate.html ) เมื่อเราระบุ host ที่เราจะคอย monitor แล้ว ขั้นต่อไปก็คือ เราจะ monitor service อะไรที่เกี่ยวกับ host นี้ เช่น HTTP, FTP, SMTP, SNMP ฯลฯ จากตัวอย่างข้างบน ถ้าเราต้องการทราบว่าจะสามารถ http Google ได้หรือไม่ เราก็สามารถ define service object definition ลงไปในไฟล์ services.cfg ว่าให้ มี service สำหรับ check http สำหรับ host Google

define service{
host_name Google
service_description HTTP
check_command check_http
max_check_attempts 5
normal_check_interval 5
retry_check_interval 3
check_period 24x7
notification_interval 30
notification_period 24x7
notification_options w,c,r
contact_groups admins
}

จากตัวอย่างนี้ เราระบุว่าที่ directive check_command ให้ใช้ command check_http ซึ่ง command object definition นี้ได้ถูก define ไว้ใน ไฟล์ checkcommands.cfg อีกที

# 'check_http' command definition
define command{
command_name check_http
command_line $USER1$/check_http -H $HOSTADDRESS$
}

command check_http ก็คือการเรียกใช้ plugin ที่มีชื่อว่า check_http ซึ่งถูกติดตั้งมาโดยอัตโนมัติตอน เราติดตั้ง Nagios plugin ใน directory /usr/local/nagios/libexec นั่นเอง สำหรับ host object definition แล้ว directive check_command เป็น optional คือไม่จำเป็นต้องมี ก็ได้ ถ้ามีส่วนใหญ่ก็จะเป็น command ตรวจดูว่ายังติดต่อ host ได้อยู่หรือไม่ (ดู ตัวอย่างที่ 2 ถัดไป)
Directive อื่นๆ ที่ required บางตัวที่กำหนดค่าช่วงเวลาในการตรวจสอบ service (มี หน่วยdefault เป็น นาที) เช่น

max_check_attempts คือจำนวนครั้งที่ Nagios พยายาม check สถานะของ service หลังจาก
check แล้ว สถานะเป็น non-OK

normal_check_interval คือช่วงเวลาที่ต้องรอเพื่อเริ่ม check ใหม่ หลังจากการ check ครั้งที่ผ่านมา
สถานะของ service เป็น OK หรือ ถ้าเป็น non-OK แต่จำนวนครั้งที่ check ไปแล้วเท่ากับ
max_check_attempts

retry_check_interval คือช่วงเวลาที่ต้องรอเพื่อ check สถานะของ service หลังจากครั้งที่ผ่านมา
สถานะ ของ service เป็น non-OK

notification_interval คือช่วงเวลาที่ต้องรอก่อนจะแจ้งเตือนไปยังผู้ดูแล หลังจากการ check ครั้งที่แล้ว
สถานะเป็น non-OK แล้ว check ครั้งนี้ก็ยังเป็น non-OK อยู่ ค่านี้ต้องตั้งไว้ให้มากกว่าหรือเท่ากับค่า
ของ normal_check_interval

ตัวอย่างที่ 2
ทีนี้ลองมาดูอีกตัวอย่างหนึ่งของการ define command object definition โดยการเพิ่มเติมบาง option
ของ plugin ที่มีอยู่แล้ว ถ้าเราจะตรวจดูสถานะการทำงานของ router ที่ host มี definition ดังนี้

define host{
host_name Router1
alias My Router #1
address 192.168.1.254
parents server-backbone
check_command check-host-alive
max_check_attempts 5
check_period 24x7
process_perf_data 0
retain_nonstatus_information 0
contact_groups admins
notification_interval 30
notification_period 24x7
notification_options d,u,r
}

สมมุติว่าเราต้องการ define command เรียกว่า check-host-alive เพื่อลอง check ดูว่า host UP
หรือ DOWN โดยใช้วิธี ping แล้วระบุค่าบางอย่างลงไป เช่น

define command{
command_name check-host-alive
command_line $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c
5000.0,100% -p 1
}

จะเห็นได้ว่า command นี้จะไปเรียกใช้ plugin ที่มีชื่อว่า check_ping กับ option พิเศษที่เราต้องการ
คือ level สำหรับ warning (-w switch), critical (-c switch), และจำนวนของ ICMP ECHO
packets to send (-p switch) ผู้ใช้ควรอ่านวีธีการใช้ของ plugin ต่างๆใน libexec directory เพื่อระบุ
option ต่างๆให้ตรงกับ จุดประสงค์ของงาน ดูวิธีใช้ได้จากการพิมพ์ ชื่อ plugin กับ -h switch เช่น

$./check_ping -h

ตัวอย่างที่ 3
ลองมาดูอีกตัวอย่างของการใช้ command จาก plugin ใน service object definition เริ่มด้วย
command check_local_disk ที่ไปเรียกใช้ plugin ชื่อว่า check_disk ไว้ตรวจดู harddisk

# 'check_local_disk' command definition
define command{
command_name check_local_disk
command_line $USER1$/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$
}

ถ้าเราลองดูการใช้ของ plugin ตัวนี้ (พิมพ์ ./check_disk -h) จะพบว่า switch -w และ -c จะเหมือน
กับ ของ check_ping คือเป็นระดับ threshold สำหรับ warning และ critical ส่วนตัวสุดท้าย -p เป็น
path หรือ partition ที่ต้องการดู สมมุติว่าเป็น partition /hda1 ของเครื่อง localhost ที่ define ตามนี้

# Service definition check /hda1

define service{
use generic-service
host_name localhost
service_description hda1 Free Space
check_command check_local_disk!20%!10%!/dev/hda1
is_volatile 0
check_period 24x7
max_check_attempts 3
normal_check_interval 5
retry_check_interval 1
contact_groups admins
notification_interval 120
notification_period 24x7
notification_options w,u,c,r

จะเห็นได้ว่า directive check_command นั้นสามารถเรียกใช้ command check_local_disk กับ
swithces ต่างๆ คือ สถานะของ service นี้จะเป็น warning เมื่อ partition /hda1 เหลือ free space
อยู่ต่ำกว่า 20% แต่ถ้าเหลืออยู่ต่ำกว่า 10% service นี้จะมีสถานะเป็น critical
การจัดกลุ่มของ Objects ที่ทำหน้าที่คล้ายกันหรือสัมพันธ์กัน hosts หรือ services ที่จัดอยู่ในกลุ่ม
เดียวกันได้ สามารถรวมเป็นกลุ่ม เป็น Object ที่เรียกว่า hostgroup หรือ servicegroup ได้ ซึ่งเรา
สามารถเลือกดูสถานะของแต่ละกลุ่มในการแสดงผลของ CGI

ตัวอย่างเช่น

define servicegroup{
servicegroup_name dbservices
alias Database Services
members SQL Server,SQL Server Agent,SQL DTC
}

define hostgroup{
hostgroup_name 3Com-servers
alias 3Com Servers
members Server1, Server2, Server3
}

นอกจากนี้การจัดกลุ่มของผู้ดูแลรับผิดชอบระบบ หรือ Contact group ยังสามารถทำได้คล้ายๆกัน
(ดูรายละเอียดในเรื่องต่อๆไป) เมื่อรู้ว่าควรจะ monitor อะไร (การ define hosts หรือ services)
ด้วยวิธีการใด (การเรียกใช้ plugins ที่มี อยู่แล้ว หรือ commands อื่นๆ) แล้ว ต่อไปคือการ set
ค่าต่างๆ ที่ใช้เป็นช่วงเวลาในการวัดก่อนอื่นขอเริ่มจาก เรื่อง Time periods ซึ่งรุบุว่าเวลาใดที่ควร
run host หรือ service check เวลาใดที่ควรส่ง host หรือ service notification หรือ เวลาใดที่จะส่ง
notification ไปที่ผู้รับผิดชอบ (อยู่ในเรื่อง Contacts ซึ่งจะ กล่าวต่อไป)

ตัวอย่างเช่น
ในไฟล์ timeperiods.cfg เราสามารถเพิ่ม definition ของ Timeperiods 24x7 และ workhours
ไว้ดังนี้

define timeperiod{
timeperiod_name 24x7
alias 24 Hours A Day, 7 Days A Week
sunday 00:00-24:00
monday 00:00-24:00
tuesday 00:00-24:00
wednesday 00:00-24:00
thursday 00:00-24:00
friday 00:00-24:00
saturday 00:00-24:00
}

define timeperiod{
timeperiod_name workhours
alias Work Hours
monday 09:00-17:00
tuesday 09:00-17:00
wednesday 09:00-17:00
thursday 09:00-17:00
friday 09:00-17:00
}

time periods เหล่านี้ เวลาใช้ก็ต้องระบุใน directive check_period ใน host object definition
หรือ ใน service object definition. การ define ค่าของ time period บางค่าไว้มากกว่า 24x7
เนื่องจากบางที เราไม่จำเป็นต้องตรวจสอบอะไรตลอด 24 ชั่วโมง (ซึ่งถ้าไม่ระบุจะเป็น default
ของ Nagios) แต่เราสามารถระบุช่วงเวลาที่ต้องการได้ เช่น ไม่จำเป็นต้อง monitor printer
นอกเวลาทำงาน หรือ ในวันเสาร์ อาทิตย์เป็นต้น แต่อุปกรณ์ที่สำคัญ เช่น server หรือ service ต่างๆ
ควรจะระบุให้ check ตลอด 24 ช.ม.เนื่องจากถ้าในช่วงที่เราไม่ check เกิดเหตุการณ์อะไรขึ้น
สมมุติว่า mail server เกิด DOWN และถึงแม้ว่า จะ UP ขึ้นมาใหม่ เราจะไม่มีทางรู้เลยว่าเกิดอะไรขึ้น
เพราะจะไม่มีการแจ้งเตือน (notification) เกี่ยวกับ host recovery ส่งมาถึงเรา
เมื่อเราระบุค่าของ time periods ต่างๆแล้ว ต่อไปคือเราจะจัดการอย่างไรกับระบบแจ้งเตือน หรือ
notification ซึ่งก็คือเมื่อเกิดเหตุการณ์อะไรขึ้น ก็สามารถแจ้งไปยังผู้ดูแลได้ ตัวอย่างต่อไปนี้
มี definition ของผู้ดูแลที่เกี่ยวข้อง เก็บไว้ในไฟล์ contacts.cfg

define contact{
contact_name Jdoe
alias John Doe
contactgroups admins
host_notification_period 24x7
service_notification_period 24x7
host_notification_options d,u,r
service_notification_options w,u,c,r
host_notification_commands host-notify-by-email
service_notification_commands service-notify-by-email
email jdoe@xyz.com
pager 555-5555@pagergateway.localhost.localdomain
address1 jdoe@yahoo.com
address2 jdoe@hotmail.com
address3 08-989-898-89
}

จากตัวอย่างข้างบน directive บางตัว เช่น email, pager, address1,2,...,6 (เติมข้อมูลเพิ่มเติมได้ถึง 6 อย่าง) เป็น optional สำหรับ directive contactgroups สามารถ define ตามข้างล่างนี้ ถ้าระบบมีผู้ดูแลหรือผู้รับผิดชอบหลายคน เราก็สามารถเติม contact_name อื่นๆ ลงไปเป็น group ได้อีก
สมมุติว่าเราตั้งชื่อไฟล์ว่า contactgroups.cfg เพื่อรวมรายชื่อที่เป็นกลุ่มของผู้ดูแลดังนี้

define contactgroup{
contactgroup_name admins
alias Nagios Administrators
members Jdoe, Bobby, Rose
}

เมื่อทราบแล้วว่าจะส่งข้อความแจ้งเตือนไปที่ผู้ใด ขั้นต่อไปคือเราสามารถส่งอย่างไร ตัวอย่างเช่น directive host_notification_commands ใช้ command host-notify-by-email ส่ง email
ไปที่ผู้ดูแลระบบ สมมุติว่าเราสร้างไฟล์ misccommands.cfg ไว้เก็บ definitions ของ notification commands หรือ commmand อื่นๆ นอกเหนือจาก basic commands ที่ใช้ plugins (ซึ่งจะอยู่ในไฟล์ checkcommands.cfg) ต้วอย่างเช่น definition ของ command host-notify-by-email สามารถ define ไว้ดังนี้

define command{
command_name host-notify-by-email
command_line /usr/bin/printf "%b" "*** Nagios ***\n\nNotification Type:
$NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress:
$HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" /bin
/mail -s
"Host $HOSTSTATE$ alert for $HOSTNAME$!" $CONTACTEMAIL$
}

ซึ่งหลักการง่ายๆก็คือใช้ printf สร้าง text ที่มีข้อมูลแจ้งเตือนที่สำคัญ แล้วใช้ mail ส่ง email ตามที่ระบุไว้ ใน contactเพื่อให้การกำหนดค่าและการตีความการแจ้งเตือนอย่างมีประสิทธิภาพ ผู้ใช้ควรทำความเข้าใจเกี่ยวกับสถานะ หรือ state ต่างๆ ของระบบ (hosts หรือ services) ให้เข้าใจ จากตัวอย่างที่ผ่านมาจะเห็นว่ามี notification_options ของ host ที่แจ้งเตือนตาม state ต่างๆของ host คือ [d,u,r,n] ซึ่งก็คือ [ DOWN , UNREACHABLE , HOST RECOVERY, NO Notifications ] ซึ่งส่วนนี้ไม่ค่อยยุ่งยาก ส่วน notification_options ของ service นั้นมีค่า [w,u,c,r,f] คือ [ WARNING , UNKNOWN , CRITICAL , SERVICE RECOVERY , NO Notifications ] ระดับ thresholds ที่ใช้ระบุว่าเมื่อใด ควรจะแจ้งเตือนแบบ WARNING หรือ เมื่อใดควรเป็น CRITICAL ก็ขึ้นอยู่กับการกำหนดค่าเหล่านี้ (-w และ -c switches) ดังนั้นตอนการเรียกใช้ command กรุณาดูรายละเอียดของ plugin นั้นๆก่อนว่าค่าใด จะต้องกำหนด และควรจะเป็นเท่าใดจึงจะสอดคล้องกับความต้องการ

แหล่งข้อมูล
· เว็ปไซด์, http://wiki.nectec.or.th/ntl/Project/Nagios_Configuration
· เว็ปไซด์, http://www.nagios.org/docs
· เว็ปไซด์, http://nagios.sourceforge.net/docs/2_0/configmain.html

เรียนรู้การเพิ่ม Host&Service ในการ Monitor

ไม่มีความคิดเห็น:

แสดงความคิดเห็น

คลังบทความของบล็อก