Event storm

Revision as of 23:54, 14 August 2017 by Nigel (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Contents

1 Overview

An event storm is a large number of informational, warning and exception events of the same type from one or more nodes over a relatively short period of time. The events will appear in your Veloopti organisation either as individual events or as duplicates of events.

The main factors with detecting an event storm are the number of events that match the rule and the time window that they are detected within.

Storm events can increase and decrease in severity.

2 Where event storms come from

Each minute the number of events that are received that match the path of the storm rule are added together. When adding the events together it does not matter whether they are new events or duplicate.

3 How event storms are detected

The informational, warning or exception event windows threshold breaches are compared with the total number of actual events that have been received during the specified time period. If a breach is detected then the one with the highest severity either creates a new storm event or increments the duplicate count of the pre-existing one.

3.1 Changing severity while the storm is still active

Once a storm rule is breached it will continue to monitor the number of events to determine whether the severity of the event needs to change. If a new breach of the rule is detected then a new storm event will be created.

3.2 Ending the storm

When the total number of events for the path no longer exceed any of the informational, warning or exception thresholds the reset conditions are then checked to confirm that the event can be closed.

3.3 Notifications

Notifications will only occur once with any new or increase of severity level of the event. Therefore an event that is opened with a severity of information will send a notification when the severity is increased to warning and then also again when the severity is increased to exception. Whereas a storm event that is opened with a severity level of warning would not send a notification if the event decreased severity to information however it would send a notification if the event severity increased to exception.

4 Event storm properties

A storm rule is opened by clicking on its name on the Storm rules web page which is found under the main menu item of Events. These pages are protected using permissions meaning not everyone is able to see them.

When a change is made that does not abide by the logic of the storm rules engine then the text is marked in red indicating that it needs to be changed.

4.1 Overview

The overview tab contains the properties of the storm rule that are common to every severity setting.

Name: The name of the storm rule. This is used in the short description of the event and in the email subject line.
Description: This is used to describe the storm rule to other people who are editing it.
Enabled: This enables or disables the storm rule. A disabled storm rule will not detect or notify of any threshold breaches.
Event path: This is the path that is used when counting the events each minute.
Help Text: This text appears in the event that is created and also in the body of the email.

4.2 Information settings

The parameters in the information setting define the number of events or duplicates and the time period over which they should be received in order to create or reduce a higher level severity storm event with the status of information. It also contains the parameters for closing an open event.

Breach Count: This is the minimum number of events or duplicates per breach duration that must be exceeded in order to create a storm event with the status of information. As long as a storm event is open and this condition is met, the storm rule will continue to be in breach with an information severity level. When this condition is not being met then the reset conditions are checked to see whether the open event can be closed.
Breach Duration: This is the time period over which the minimum number of breach count events or duplicates in order to trigger a storm event with the status of information.
Reset Count: If there is no breach condition being met then this is the number that the events or duplicates must be equal to or less than during the reset duration before the event is closed. This will close an event. The event that is being closed can have a severity of either information, warning or exception.
Reset Duration: This is the time period over which the minimum number of reset count events must appear in order to close the event.
Notify Users: This is a list of users who will be notified when a storm event breaches the breach count for the first time. If the storm event is first created with a warning or critical severity level then the users in those settings will be notified.
Notify User Groups: This is a list of user groups that will be notified when a storm event breaches the breach count for the first time. If the storm event is first created with a warning or critical severity event the user groups in those settings will be notified.

4.3 Warning settings

The parameters in the warning settings define the number of events or duplicates and the time period over which they should be received in order to create a storm event with the severity of warning. It also contains the details to reduce or increase an existing open-event to a higher or lower level severity level. If the conditions are no longer being met for a storm event with the severity of exception then the warning conditions are next checked to see whether the event can be reduced to warning. Additionally if a storm event with the severity of normal now meets the warning requirements the then the severity level is increased to warning.

Breach Count: This is the minimum number of events or duplicates per breach duration that must be exceeded in order to create or change a storm event with the status of warning. While the storm event with a severity of warning is open, as long as this condition is met the storm rule will continue to be in breach. When this condition is not being met then the reset values are checked.
Breach Duration: This is the time period over which the minimum number of breach count events or duplicates in order to trigger a storm event with the status of warning.
Reset Count: If there is no breach condition being met then this is the number that the events or duplicates must be equal to or less than during the reset duration before the event can be reduced to warning.
Reset Duration: This is the time period over which the minimum number of reset count events must appear in order to reduce the event to warning.
Notify Users: This is a list of users who will be notified when a storm event breaches the breach count for the first time. If the storm event is first created with a severity level of exception then these users will not be notified when the severity is reduced to warning.
Notify User Groups: This is a list of user groups that will be notified when a storm event breaches the breach count for the first time. If the storm event is first created with a severity level of exception then these user groups will not be notified when the severity is reduced to warning.

4.4 Exception settings

The parameters in the exception settings define the number of events or duplicates and the time period over which they should be received in order to create a storm event with the severity of exception. It also contains the details to increase an existing open-event to a severity level of exception. If the conditions are no longer being met for a storm event with the severity of exception then the warning and normal levels are then checked to see what it should be changed to.

Breach Count: This is the minimum number of events or duplicates per breach duration that must be exceeded in order to create or change a storm event with the status of exception. While the storm event with a severity of exception is open, as long as this condition is met the storm rule will continue to be in breach. When this condition is not being met then the reset values are checked.
Breach Duration: This is the time period over which the minimum number of breach count events or duplicates in order to trigger a storm event with the status of exception.
Reset Count: If there is no breach condition being met then this is the number that the events or duplicates must be equal to or less than during the reset duration before the event may be reduced to warning or information.
Reset Duration: This is the time period over which the minimum number of reset count events must appear in order to reduce the event to warning or information.
Notify Users: This is a list of users who will be notified when a storm event breaches the breach count for the first time.
Notify User Groups: This is a list of user groups that will be notified when a storm event breaches the breach count for the first time.

4.5 Explain

The explain tab will attempt to describe the storm rule in plain english.

When a change is made that does not abide by the logic of the storm rules engine then the text is marked in red indicating that it needs to be changed.

4.6 History

The history tab contains a graph with the minute total count for the storm rule path for the last hour. The rule will be populated with the last hour of data if the storm rule has existed this long.