OIT System Monitoring¶
Scope: OIT
Type: Guideline
Version: 2022-legacy
Goal¶
TBD
Ownership¶
Direct questions to the Owner: TBD email redacted
Resources to comply with this standard should be directed via the Executive sponsor: TBD email redacted
Timeline & Enforcement¶
TBD
Exception Process¶
TBD
Terminology¶
- The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119
Requirements¶
Original text¶
- OIT business applications, services, processes, and infrastructure should be continually monitored at the discretion of the service owner.
- This includes all environments (staging, integration, pre-production, etc.) that are required for ongoing development.
- Conditions that might lead to outages (e.g. high cpu/disk utilization, hardware failure) should alert support staff as soon as possible during the application’s normal operating hours.
- Thresholds for alerting should be maintained and adjusted as necessary such that any alert can be trusted to be important and requiring a response from support staff
- OIT Standard “health check” tools that test availability and basic functionality should be employed to monitor Web Application performance over time
- Monitoring notification and response times must be coordinated with application/service owners to ensure required service uptime requirements are met
- Server resources will be monitored by the responsible system administrators to ensure capacity and performance is within set limits for vital resources such as, but not limited to:
- CPU, Disk, I/O, Memory, Network and any other critical component specific to the server
- Only OIT supported applications, services, processes, and infrastructure will be monitored by OIT monitoring solutions
- Automated notifications/alerts sent through email, SMS, Slack, or other appropriate notification mechanism will automatically alert support staff when conditions fall outside of defined thresholds.
- ITIL framework event management notification categories should be used for classification of events:
- Informational (INFO): the event does not require any immediate action and does not represent an exception.
- They are recorded in the log files and maintained for a predetermined period.
- This type of event is used to check the status of a device or service, to confirm the state of an activity, to generate statistics (user login, batch job completed, device power up, number of users logged into an application)
- They are recorded in the log files and maintained for a predetermined period.
- Warning (WARN / ALERT): the event is generated when a device or service, (application / utility), is approaching an agreed threshold (KPI).
- Warnings are intended to notify the group/process/tool in order to take the necessary actions to prevent an exception occurring.
- Exception (ERROR): means that a service or device is currently operating below the normal parameters/indicators (predefined).
- This mean that the business service is impacted and the device or service presents a failure, performance degradations or loss of functionality (web server down, CS coverage lost for several sites).
- A device failure is an error.
- This mean that the business service is impacted and the device or service presents a failure, performance degradations or loss of functionality (web server down, CS coverage lost for several sites).
- Informational (INFO): the event does not require any immediate action and does not represent an exception.
- Automated alerting should be disabled during scheduled maintenance periods, for example, using “maintenance windows” or similar mechanism that ensures alerts are automatically re-enabled without manual intervention.
- Where possible, significant events should generate a Service Now Incident for initial event notification and for escalation notifications
- The OIT Help Desk will be informed about any disruption in services to our clients, and periodically updated on resolution progress
- The OIT Help Desk should be included in automated reporting and email notifications that indicate applications or services are no longer usable by our clients
- The email redacted email list will be used to disseminate information on system/service outages and resolution progress
- The OIT Change Management Policy should be followed for planned outages, upgrades, where clients might be impacted
- Campus-wide services will have client facing health and uptime reporting via the OIT web site