Difference between revisions of "Event storm"

From Veloopti Help
Jump to: navigation, search
m (Event storm properties)
(Changed the words around and checked their consistency)
Line 1: Line 1:
 
== Overview ==
 
== Overview ==
An event storm is ''a large number of informational, warning and exception events of the same type from one or more nodes over a relatively short period of time''. The events will appear in your Veloopti organisation either as individual events or as duplicates of events.
+
An event storm is ''a large number of informational, warning and exception events of the same type from one or more nodes over a relatively short period of time''. The events will appear in your Veloopti organisation either as individual events or as duplicates of events. The factors with detecting an event storm are the number of events that match the rule, called the ''Breach count'', and the time window that they are detected within, called the ''Breach duration''.  
  
The main factors with detecting an event storm are the number of events that match the rule and the time window that they are detected within.
+
Storm events have increasing levels of severity that come into effect by having an increasing breach count relative to the time window (breach duration) that they are received in.
  
Storm events can increase and decrease in severity.
+
== Starting and ending the event storm ==
 +
=== How event storms are raised ===
 +
For the events that match the path of the event storm rule, each minute the following is performed. It does not matter whether they are new events or [[Event#Previous events and duplicates|duplicate]] events.
 +
# If the number of events that that are received over the exception breach duration are added together and exceed the exception breach count then an event with the severity level of exception is raised.
 +
# If the number of events that that are received over the warning breach duration are added together and exceed the warning breach count then an event with the severity level of warning is raised.
 +
# If the number of events that that are received over the information breach duration are added together and exceed the exception breach count then an event with the severity level of information is raised.
  
== Where event storms come from ==
+
==== Changing severity while the event storm is still active ====
Each minute the number of events that are received that match the path of the storm rule are added together. When adding the events together it does not matter whether they are new events or [[Event#Previous events and duplicates|duplicate]].  
+
Once a storm rule is breached the event storm will continue to be monitored in the same manner as above to see whether there is an increase in severity. If there is an increase in severity of the event the event is increased in severity and the relivent notifications are sent out. Storm events do not decrease in severity.
  
== How event storms are detected ==
+
=== How event storms are ended ===
The informational, warning or exception breach duration windows are compared with the total number of actual events that have been received during each specified breach duration time window. If a breach is detected then the one with the highest severity either creates a new storm event or increments the duplicate count of the pre-existing one.
+
When the current event count for the path no longer exceed any of the informational, warning or exception breach thresholds over the breach durations the reset conditions can be evaluated. The reset breach count and durations are then checked in the same manner as the initial breach condition. If none of the reset conditions are met then the storm event can be closed.  If one of the reset conditions are still being met then the storm event remains open with the pre-existing severity.
  
==== Changing severity while the storm is still active ====
+
== Notifications ==
Once a storm rule is breached it will continue to be monitored for the current number of events during its breach duration window to determine whether the severity of the event needs to change.
+
Notifications will occur once with any new storm event or once with an increase of severity level. Therefore a storm event that is opened with a severity of information will initially send a notification when it is first opened. It will also send a notification if the severity is increased to warning and again if the severity is increased to exception.  Whereas a  storm event that is opened with a severity level of warning would not send a notification if the event decreased severity to information. However it would send a notification if the event severity increased to exception. A storm event that is opened as a exception would not notify if it reduces in severity to either warning or normal.
  
=== Ending the storm ===
+
== Event storm properties ==
When the current event count for the path no longer exceed any of the informational, warning or exception breach duration thresholds the reset conditions are then compared to the current event count to determine if the event can be closed. If no reset condition is met then the storm event remains open with the pre-existing severity.
+
A storm rule is opened by clicking on its name on the [https://ap1.veloopti.com.au/storm-rules Storm rules] web page which is found under Events on the main menu. These pages are protected using permissions so not everyone may be able to see them.
 
 
=== Notifications ===
 
Notifications will occur once with any new storm event or increase of severity level of a current storm event. Therefore a storm event that is opened with a severity of information will initially send a notification when it is first opened. It will also send a notification if the severity is increased to warning and again if the severity is increased to exception. Whereas a  storm event that is opened with a severity level of warning would not send a notification if the event decreased severity to information however it would send a notification if the event severity increased to exception. A storm event that is opened as a exception would not notify if it reduces in severity to either warning or normal.
 
  
== Event storm properties ==
+
When creating or modifying a rule that does not abide by the logic of the storm rules engine then the text is marked in red indicating that it needs to be changed. The rule is that storm events with an increasing level of severity are required to have an increasing breach count relative to the time window (breach duration) that they are received in. So for the same time period, the most number of events should be received for an exception, with a lesser amount for a warning and the least for information.  
A storm rule is opened by clicking on its name on the [https://ap1.veloopti.com.au/storm-rules Storm rules] web page which is found under the Events main menu item. These pages are protected using permissions meaning not everyone is able to see them.
 
  
When a change is made that does not abide by the logic of the storm rules engine then the text is marked in red indicating that it needs to be changed.
+
# If the number of events that that are received over the exception breach duration are added together and exceed the exception breach count then an event with the severity level of exception is raised.
 +
# If the number of events that that are received over the warning breach duration are added together and exceed the warning breach count then an event with the severity level of warning is raised.
 +
# If the number of events that that are received over the information breach duration are added together and exceed the exception breach count then an event with the severity level of information is raised.
  
 
=== Overview ===
 
=== Overview ===
 
The overview tab contains the properties of the storm rule that are common to every severity setting.
 
The overview tab contains the properties of the storm rule that are common to every severity setting.
:'''Name''': The name of the storm rule. This is used in the short description of the event that will appear in the event view and also in the email subject line.  
+
:'''Name''': The name of the storm rule. This is used in the description of the event that appears in the event view and also in the email subject line.  
  
 
:'''Description''': This is used to describe the storm rule to other people who are editing it.
 
:'''Description''': This is used to describe the storm rule to other people who are editing it.
Line 34: Line 37:
 
:'''Enabled''': This enables or disables the storm rule. A disabled storm rule will not detect or notify of any threshold breaches. It also will not collect any metrics for displaying in a dashboard.
 
:'''Enabled''': This enables or disables the storm rule. A disabled storm rule will not detect or notify of any threshold breaches. It also will not collect any metrics for displaying in a dashboard.
  
:'''Event path''': This is the path that is used when counting the events each minute.
+
:'''Event [[Path|path]]''': This is the path that is used when detecting an event storm.
  
:'''Help Text''': This text appears in the event that is created and also in the body of the email.
+
:'''Help Text''': This text appears in the event that is displayed in the event view and also in the body of any notification email.
  
 
=== Information settings ===
 
=== Information settings ===
The parameters in the information setting define the number of events or duplicates and the time period over which they should be received in order to create or reduce a higher level severity storm event with the status of information. It also contains the parameters for closing an open event.
+
The parameters in the information setting define the number of events and duplicates and the time period over which they should be received in order to create or reduce an storm event with the status of information. It also contains one of the parameters for closing an open event.
 +
 
 +
Information Storm events have the lowest level of severity of all of the storm events. Informational severity storm events should be used to inform your users that an event storm has occurred but there is no impact yet to your monitored IT infrastructure.
  
 
'''Breach Condition
 
'''Breach Condition
:'''Breach Count''': This is the minimum number of events or duplicates per '''''breach duration''''' that must be exceeded in order to create a storm event with the status of information. As long as a storm event is open and this condition is met, the storm rule will continue to be in breach with an information severity level.  When this condition is not being met then the reset conditions are checked to see whether the open event can be closed.
+
:'''Breach Count''': This is the minimum number of events and duplicates per '''''breach duration''''' that must be exceeded in order to create a storm event with the status of information. As long as a storm event is open and this condition is met, the storm rule will continue to be in breach with an severity level of infomration.  When this condition is not being met then the reset conditions are checked to see whether the open event can be closed.
  
:'''Breach Duration''': This is the time period over which the minimum number of '''''breach count''''' events or duplicates in order to trigger a storm event with the status of information.
+
:'''Breach Duration''': This is the time period over which the minimum number of '''''breach count''''' events and duplicates in order to trigger a storm event with the status of information.
 
'''Reset Condition
 
'''Reset Condition
:'''Reset Count''': If there is no breach condition being met then this is the number that the events or duplicates must be equal to or less than during the '''''reset duration''''' before the event is closed. This will close an event. The event that is being closed can have a severity of either information, warning or exception.
+
:'''Reset Count''': If there is no breach condition being met then this is the number that the event and duplicate count must be equal to or less than during the '''''reset duration''''' before the event may be closed. As this is the smallest reset count condition if it is met then this will signify that the event can be closed. The event that is being closed can have a severity of either information, warning or exception.
  
:'''Reset Duration''': This is the time period over which the minimum number of '''''reset count''''' events must appear in order to close the event.
+
:'''Reset Duration''': This is the time period over which the minimum number of '''''reset count''''' events must appear in order to close the storm event.
 
'''Notifications
 
'''Notifications
:'''Notify Users''': This is a list of users who will be notified when a storm event breaches the '''''breach count''''' for the first time. If the storm event is first created with a warning or critical severity level then the users in those settings will be notified.
+
:'''Notify Users''': This is a list of users who will be notified when a storm event with the severity of information breaches the '''''breach count''''' for the first time. If the storm event is first created with a warning or critical severity level then the users in those settings will be notified.
  
:'''Notify User Groups''': This is a list of user groups that will be notified when a storm event breaches the '''''breach count''''' for the first time. If the storm event is first created with a warning or critical severity event the user groups in those settings will be notified.
+
:'''Notify User Groups''': This is a list of user groups that will be notified when a storm event with the severity of information breaches the '''''breach count''''' for the first time. If the storm event is first created with a warning or critical severity event the user groups in those settings will be notified.
  
 
=== Warning settings ===
 
=== Warning settings ===
The parameters in the warning settings define the number of events or duplicates and the time period over which they should be received in order to create a storm event with the severity of warning.  It also contains the details to reduce or increase an existing open storm event to a higher or lower level severity level. If the breach conditions are no longer being met for a storm event with the severity of exception then the warning breach conditions are next checked to see whether the event can be reduced to warning. Additionally if a storm event with the severity of normal now meets the warning breach conditions the then its severity level is increased to warning.
+
The parameters in the warning settings define the number of events and duplicates and the time period over which they should be received in order to create a storm event with the severity of warning.  It also contains some details to reduce or increase an existing open storm event to a higher or lower level severity level. If the breach conditions are no longer being met for a storm event with the severity of exception then the warning breach conditions are next checked to see whether the event can be reduced to warning. Additionally if a storm event with the severity of normal now meets the warning breach conditions the then its severity level is increased to warning.
 +
 
 +
Storm events with a severity of warning come between the information and exception severity events. They come into effect by having a larger breach count relative to the time window (breach duration) of informational severity events but are not as large in number as exception level severity events. Warning level storm events should be used to inform your users that an event storm has occurred that is impacting your IT infrastructure in a negative way and something should be done to resolve it before it increases in severity to exception.
 +
 
 +
It also contains one of the three conditions for closing an open storm event.
  
 
'''Breach Condition
 
'''Breach Condition
:'''Breach Count''': This is the minimum number of events or duplicates per '''''breach duration''''' that must be exceeded in order to create or change a storm event with the status of warning. While the storm event with a severity of warning is open, as long as this condition is met the storm rule will continue to be in breach. When this condition is not being met then the reset values are checked.
+
:'''Breach Count''': This is the minimum number of events and duplicates per '''''breach duration''''' that must be exceeded in order to create or change a storm event with the status of warning. While the storm event with a severity of warning is open, as long as this condition is met the storm rule will continue to be in breach. When this condition is not being met then the reset values are checked.
  
:'''Breach Duration''': This is the time period over which the minimum number of '''''breach count''''' events or duplicates in order to trigger a storm event with the status of warning.
+
:'''Breach Duration''': This is the time period over which the minimum number of '''''breach count''''' events and duplicates in order to trigger a storm event with the status of warning.
  
:'''Reset Count''': If there is no breach condition being met then this is the number that the events or duplicates must be equal to or less than during the '''''reset duration''''' before the event can be reduced to warning.
+
:'''Reset Count''': If there is no breach condition being met then this is the number that the events and duplicates must be equal to or less than during the '''''reset duration''''' before the event can be reduced to warning.
 
'''Reset Condition
 
'''Reset Condition
 
:'''Reset Duration''': This is the time period over which the minimum number of '''''reset count''''' events must appear in order to reduce the event to warning.
 
:'''Reset Duration''': This is the time period over which the minimum number of '''''reset count''''' events must appear in order to reduce the event to warning.
Line 71: Line 80:
  
 
=== Exception settings ===
 
=== Exception settings ===
The parameters in the exception settings define the number of events or duplicates and the time period over which they should be received in order to create a storm event with the severity of exception.  It also contains the details to increase an existing open storm event to a severity level of exception. If the breach conditions are no longer being met for a storm event with the severity of exception then the warning first followed by normal breach conditions are then checked to see the event should be changed to them.
+
The parameters in the exception settings define the number of events and duplicates and the time period over which they should be received in order to create a storm event with the severity of exception.  It also contains the details to increase an existing open storm event to a severity level of exception. If the breach conditions are no longer being met for a storm event with the severity of exception then the warning first followed by normal breach conditions are then checked to see the event should be changed to them.
 +
 
 +
Exception storm events have the highest level of severity and come into effect by having the highest level of event count relative to the time window (breach duration) that they are received in. Exception level storm events should be used to inform your users that an event storm has occurred and it is impacting one or more component of your IT infrastructure and it has impacted it with a significant performance degradation or loss of service.
 +
 
 +
It also contains one of the parameters for closing an open event.
  
 
'''Breach Condition
 
'''Breach Condition
:'''Breach Count''': This is the minimum number of events or duplicates per '''''breach duration''''' that must be exceeded in order to create or change a storm event with the status of exception. While the storm event with a severity of exception is open, as long as this condition is met the storm rule will continue to be in breach. When this condition is not being met then the reset values are checked.
+
:'''Breach Count''': This is the minimum number of events and duplicates per '''''breach duration''''' that must be exceeded in order to create or change a storm event with the status of exception. While the storm event with a severity of exception is open, as long as this condition is met the storm rule will continue to be in breach. When this condition is not being met then the reset values are checked.
  
:'''Breach Duration''': This is the time period over which the minimum number of '''''breach count''''' events or duplicates in order to trigger a storm event with the status of exception.
+
:'''Breach Duration''': This is the time period over which the minimum number of '''''breach count''''' events and duplicates in order to trigger a storm event with the status of exception.
 
'''Reset Condition
 
'''Reset Condition
:'''Reset Count''': If there is no breach condition being met then this is the number that the events or duplicates must be equal to or less than during the '''''reset duration''''' before the event may be reduced to warning or information.
+
:'''Reset Count''': If there is no breach condition being met then this is the number that the events and duplicates must be equal to or less than during the '''''reset duration''''' before the event may be reduced to warning or information.
  
 
:'''Reset Duration''': This is the time period over which the minimum number of '''''reset count''''' events must appear in order to reduce the event to warning or information.
 
:'''Reset Duration''': This is the time period over which the minimum number of '''''reset count''''' events must appear in order to reduce the event to warning or information.

Revision as of 21:55, 16 August 2017

1 Overview

An event storm is a large number of informational, warning and exception events of the same type from one or more nodes over a relatively short period of time. The events will appear in your Veloopti organisation either as individual events or as duplicates of events. The factors with detecting an event storm are the number of events that match the rule, called the Breach count, and the time window that they are detected within, called the Breach duration.

Storm events have increasing levels of severity that come into effect by having an increasing breach count relative to the time window (breach duration) that they are received in.

2 Starting and ending the event storm

2.1 How event storms are raised

For the events that match the path of the event storm rule, each minute the following is performed. It does not matter whether they are new events or duplicate events.

  1. If the number of events that that are received over the exception breach duration are added together and exceed the exception breach count then an event with the severity level of exception is raised.
  2. If the number of events that that are received over the warning breach duration are added together and exceed the warning breach count then an event with the severity level of warning is raised.
  3. If the number of events that that are received over the information breach duration are added together and exceed the exception breach count then an event with the severity level of information is raised.

2.1.1 Changing severity while the event storm is still active

Once a storm rule is breached the event storm will continue to be monitored in the same manner as above to see whether there is an increase in severity. If there is an increase in severity of the event the event is increased in severity and the relivent notifications are sent out. Storm events do not decrease in severity.

2.2 How event storms are ended

When the current event count for the path no longer exceed any of the informational, warning or exception breach thresholds over the breach durations the reset conditions can be evaluated. The reset breach count and durations are then checked in the same manner as the initial breach condition. If none of the reset conditions are met then the storm event can be closed. If one of the reset conditions are still being met then the storm event remains open with the pre-existing severity.

3 Notifications

Notifications will occur once with any new storm event or once with an increase of severity level. Therefore a storm event that is opened with a severity of information will initially send a notification when it is first opened. It will also send a notification if the severity is increased to warning and again if the severity is increased to exception. Whereas a storm event that is opened with a severity level of warning would not send a notification if the event decreased severity to information. However it would send a notification if the event severity increased to exception. A storm event that is opened as a exception would not notify if it reduces in severity to either warning or normal.

4 Event storm properties

A storm rule is opened by clicking on its name on the Storm rules web page which is found under Events on the main menu. These pages are protected using permissions so not everyone may be able to see them.

When creating or modifying a rule that does not abide by the logic of the storm rules engine then the text is marked in red indicating that it needs to be changed. The rule is that storm events with an increasing level of severity are required to have an increasing breach count relative to the time window (breach duration) that they are received in. So for the same time period, the most number of events should be received for an exception, with a lesser amount for a warning and the least for information.

  1. If the number of events that that are received over the exception breach duration are added together and exceed the exception breach count then an event with the severity level of exception is raised.
  2. If the number of events that that are received over the warning breach duration are added together and exceed the warning breach count then an event with the severity level of warning is raised.
  3. If the number of events that that are received over the information breach duration are added together and exceed the exception breach count then an event with the severity level of information is raised.

4.1 Overview

The overview tab contains the properties of the storm rule that are common to every severity setting.

Name: The name of the storm rule. This is used in the description of the event that appears in the event view and also in the email subject line.
Description: This is used to describe the storm rule to other people who are editing it.
Enabled: This enables or disables the storm rule. A disabled storm rule will not detect or notify of any threshold breaches. It also will not collect any metrics for displaying in a dashboard.
Event path: This is the path that is used when detecting an event storm.
Help Text: This text appears in the event that is displayed in the event view and also in the body of any notification email.

4.2 Information settings

The parameters in the information setting define the number of events and duplicates and the time period over which they should be received in order to create or reduce an storm event with the status of information. It also contains one of the parameters for closing an open event.

Information Storm events have the lowest level of severity of all of the storm events. Informational severity storm events should be used to inform your users that an event storm has occurred but there is no impact yet to your monitored IT infrastructure.

Breach Condition

Breach Count: This is the minimum number of events and duplicates per breach duration that must be exceeded in order to create a storm event with the status of information. As long as a storm event is open and this condition is met, the storm rule will continue to be in breach with an severity level of infomration. When this condition is not being met then the reset conditions are checked to see whether the open event can be closed.
Breach Duration: This is the time period over which the minimum number of breach count events and duplicates in order to trigger a storm event with the status of information.

Reset Condition

Reset Count: If there is no breach condition being met then this is the number that the event and duplicate count must be equal to or less than during the reset duration before the event may be closed. As this is the smallest reset count condition if it is met then this will signify that the event can be closed. The event that is being closed can have a severity of either information, warning or exception.
Reset Duration: This is the time period over which the minimum number of reset count events must appear in order to close the storm event.

Notifications

Notify Users: This is a list of users who will be notified when a storm event with the severity of information breaches the breach count for the first time. If the storm event is first created with a warning or critical severity level then the users in those settings will be notified.
Notify User Groups: This is a list of user groups that will be notified when a storm event with the severity of information breaches the breach count for the first time. If the storm event is first created with a warning or critical severity event the user groups in those settings will be notified.

4.3 Warning settings

The parameters in the warning settings define the number of events and duplicates and the time period over which they should be received in order to create a storm event with the severity of warning. It also contains some details to reduce or increase an existing open storm event to a higher or lower level severity level. If the breach conditions are no longer being met for a storm event with the severity of exception then the warning breach conditions are next checked to see whether the event can be reduced to warning. Additionally if a storm event with the severity of normal now meets the warning breach conditions the then its severity level is increased to warning.

Storm events with a severity of warning come between the information and exception severity events. They come into effect by having a larger breach count relative to the time window (breach duration) of informational severity events but are not as large in number as exception level severity events. Warning level storm events should be used to inform your users that an event storm has occurred that is impacting your IT infrastructure in a negative way and something should be done to resolve it before it increases in severity to exception.

It also contains one of the three conditions for closing an open storm event.

Breach Condition

Breach Count: This is the minimum number of events and duplicates per breach duration that must be exceeded in order to create or change a storm event with the status of warning. While the storm event with a severity of warning is open, as long as this condition is met the storm rule will continue to be in breach. When this condition is not being met then the reset values are checked.
Breach Duration: This is the time period over which the minimum number of breach count events and duplicates in order to trigger a storm event with the status of warning.
Reset Count: If there is no breach condition being met then this is the number that the events and duplicates must be equal to or less than during the reset duration before the event can be reduced to warning.

Reset Condition

Reset Duration: This is the time period over which the minimum number of reset count events must appear in order to reduce the event to warning.

Notifications

Notify Users: This is a list of users who will be notified when a storm event breaches the breach count for the first time. If the storm event is first created with a severity level of exception then these users will not be notified when the severity is reduced to warning.
Notify User Groups: This is a list of user groups that will be notified when a storm event breaches the breach count for the first time. If the storm event is first created with a severity level of exception then these user groups will not be notified when the severity is reduced to warning.

4.4 Exception settings

The parameters in the exception settings define the number of events and duplicates and the time period over which they should be received in order to create a storm event with the severity of exception. It also contains the details to increase an existing open storm event to a severity level of exception. If the breach conditions are no longer being met for a storm event with the severity of exception then the warning first followed by normal breach conditions are then checked to see the event should be changed to them.

Exception storm events have the highest level of severity and come into effect by having the highest level of event count relative to the time window (breach duration) that they are received in. Exception level storm events should be used to inform your users that an event storm has occurred and it is impacting one or more component of your IT infrastructure and it has impacted it with a significant performance degradation or loss of service.

It also contains one of the parameters for closing an open event.

Breach Condition

Breach Count: This is the minimum number of events and duplicates per breach duration that must be exceeded in order to create or change a storm event with the status of exception. While the storm event with a severity of exception is open, as long as this condition is met the storm rule will continue to be in breach. When this condition is not being met then the reset values are checked.
Breach Duration: This is the time period over which the minimum number of breach count events and duplicates in order to trigger a storm event with the status of exception.

Reset Condition

Reset Count: If there is no breach condition being met then this is the number that the events and duplicates must be equal to or less than during the reset duration before the event may be reduced to warning or information.
Reset Duration: This is the time period over which the minimum number of reset count events must appear in order to reduce the event to warning or information.

Notifications

Notify Users: This is a list of users who will be notified when a storm event breaches the breach count for the first time.
Notify User Groups: This is a list of user groups that will be notified when a storm event breaches the breach count for the first time.

4.5 Explain

The explain tab will attempt to describe the storm rule in plain english.

When a change is made that does not abide by the logic of the storm rules engine then the text is marked in red indicating that it needs to be changed.

4.6 History

The history tab contains a graph with the minute total count for the storm rule path for the last hour. The rule will be populated with the last hour of data if the storm rule has existed this long.