The Anomalies module in Experda uses a machine-learning model (Isolation Forest) to automatically detect abnormal behavior in your SQL Server and host performance metrics.
It is designed to:
- Learn what is “normal” for each metric on each
- Continuously scan for unusual patterns and sudden
- Present anomalies visually in the AI Insights
- Notify DBAs and ops teams via email and group Processing schedule
- Detection runs: every 2 hours on Experda’s own
- Learning phase: baselines are retrained once every 24 hours so the model adapts to new workloads.
Use the Anomalies module when you want Experda to act like an always- on “early warning system” for performance issues—catching unusual behavior before users complain and highlighting incidents that are easy to miss in daily monitoring. Instead of relying on fixed thresholds (which often create noise), the module learns what is normal for each server and metric (including hour-of-day and weekend patterns) and then flags meaningful deviations such as sudden CPU jumps, unexpected disk queue spikes, abnormal SQL activity changes, or new usage patterns.
Practically, you use it for proactive troubleshooting, spotting regressions after deployments, detecting emerging bottlenecks, and ensuring the right people get notified through alerts and 12-hour digests—without needing to watch dashboards all day.
Real Case Example
In this example, Experda detected Disk Queue Length anomalies (OS-level) on server 192.168.15.50 for three different volumes: F: (1 anomaly), G: (4 anomalies), and M: (27 anomalies). The blue line shows the queue length over time, and the red markers highlight the specific time points that were statistically unusual compared to the server’s normal behavior. Notice how M: has both many more anomaly events and much larger spikes, which strongly suggests that this specific disk/volume experienced repeated I/O pressure during the period (a backlog of requests waiting to be served).
It’s good to know because Disk Queue Length is an early indicator of storage bottlenecks: when a drive’s queue builds up, applications (including SQL Server) can slow down even if CPU looks fine. Seeing the anomalies per drive helps you immediately narrow the investigation to the right volume (here, M:), and then correlate it with related metrics (Disk Bytes/sec, Read/ Write Latency) and real events (backups, ETL loads, index rebuilds, antivirus scans, file copies) to prevent performance incidents from repeating.
1. How the anomaly engine works
1.1 Data collection
Experda collects performance metrics both from:
- Host / OS – g., CPU, Disk Bytes/sec, Disk Queue Length, Disk Latency.
- SQL Server – e.g., Batches/sec, Compilations/sec, Distinct client computers, etc.
Each metric is stored as a time series per server and (where relevant) or instance (for example, drive letters C:, E:, F:). This information is available in the UI under the perfomance tabs.
1.2 Isolation Forest model
Experda uses an Isolation Forest model per metric:
- The model learns What “normal” looks like from the last 24 hours of data.
- Every 2 hours, it scores the latest samples.
- Points that are statistically isolated from normal behavior are labeled as anomalies.
The model doesn’t rely only on raw metric values; it also uses context features such as:
- Hour of day
- Weekend/holiday vs weekday
- Recent averages and standard deviation
- Distance from previous peaks
- Large changes compared to previous samples
- Threshold violations (above configured baseline or percentile)
These features are visible in the Data Analysis tab for each anomaly.
1.2.1 Isolation Forest – How It Works
Isolation Forest is an outlier-detection algorithm built on a very simple idea:
Anomalies are few and different, so they are easier to isolate than normal points.
Instead of modeling “normal” behavior directly (like many other algorithms), it repeatedly splits the data at random and measures how many splits are needed to isolate each data point:
- Normal points live in dense regions ⇒ need many splits to isolate.
- Anomalies sit far away or in sparse regions ⇒ often isolated with very few splits.
The fewer splits a point needs on average, the more anomalous it is.
1.2.2 Building the forest
Random subsampling
To make the algorithm fast and robust, each tree is built on a small random subset of your data (for example, 256 points taken from the full time series of a metric).
This has several benefits:
- Reduces the cost of training.
- Increases the chance that outliers stand out clearly in at least some trees.
- Makes the method scalable to very large datasets.
Growing one isolation tree For each isolation tree:
- Start at the root with all points in the subsample.
- Randomly choose a feature (e.g., CPU, Disk Queue Length, or a derived feature such as “Avg last values”).
- Randomly choose a split value between the min and max of that feature.
- Split the data into left/right child nodes.
- Repeat steps 2–4 on each child node until:
– The node contains just 1 point, or
– You reach a maximum depth.
This creates a random tree where every leaf node contains a tiny region of the feature space (and usually one point). Many trees = a forest
You repeat this process dozens or hundreds of times, each time with:
- A different random subsample.
- Different random features and thresholds.
The result is a forest of isolation trees.
Each tree provides a noisy estimate of how easy it is to isolate each point; the forest average smooths out the randomness.
1.2.3 Scoring anomalies
Path length
For a given data point (for example, “Disk Queue Length = 17 at 10:01 AM, weekday, large change from previous, above threshold”):
- You drop it down each tree.
- Count the path length = how many splits from root to leaf.
If a point is an outlier, many trees will isolate it very high up in the tree → short path.
Normal points require more splits to isolate → long path.
Normalizing the score
Path length depends on sample size, so Isolation Forest uses a normalization function to convert it into an anomaly score between 0 and 1, roughly:
- Score ≈ 1 → very anomalous (isolated quickly).
- Score ≈ 5 → borderline.
- Score ≈ 0 → very normal.
In Experda’s context, this score is then combined with your rule configuration (thresholds, consecutive events, weekend/hour of day effects, etc.) to decide whether to register an anomaly event and possibly trigger an alert.
The Anomaly score in Isolation Forest is non-intuitive. The lower the score the stronger the anomaly indicators are.
1.2.4 Why Isolation Forest is a good fit for performance metrics
- No need for labeled data
You don’t need thousands of labeled “bad” incidents. The algorithm simply learns “normal” structure and then exposes what doesn’t fit.
- Handles high-dimensional context
It can work on multiple features at once:
- Raw metric value (e.g., Disk Queue Length).
- Time features (hour of day, weekend).
- Statistical features (running average, standard deviation).
- Change features (large jump from previous value, distance from last peak).
- Robust to non-linear patterns
Random splits can approximate complex shapes in the data without any explicit modeling.
- Fast and scalable
Because it uses small subsamples and simple trees, training every 24 hours and scoring every 2 hours is computationally efficient, even for many servers and metrics.
1.3 Severity and status
Each anomaly rule has a Severity level:
- Information – minor deviations, useful for awareness and trend analysis.
- Warning – likely performance or stability risk, should be investigated.
- Critical (if configured) – high risk of downtime or severe degradation.
An anomaly can also have status:
- Active – anomaly is still relevant for the current period.
- Dismissed – hidden from normal views but kept for history.
- Excluded time – the specific time range is excluded from future training (useful for planned maintenance). This feature is usable when you recognize that an anomaly occurs in a timeframe that you are already aware of the normality of the For example if a backup occurs every day at 1am – you can expect a heavier load on the disks at that time each day.
2. Working with AI Insights
2.1 AI Insights overview
Open the AI Insights tab to see a dashboard of anomalies for the selected time range. Make sure to select a proper date range from the timeframe control.
Each tile represents a metric on a specific server/instance:
- Metric name + Instance (e.g., Disk Queue Length – E:).
- Server and anomaly master type (OS / SQL).
- A line chart with red points marking detected anomalies.
- A numeric badge that indicates the number of anomalies for that metric in the selected time window.
- Show details button to open the Enhanced anomaly analysis.
Use this view to quickly see which servers and which metrics have been behaving abnormally.
2.2 Enhanced anomaly analysis – Details tab
Click Show details from the Anomaly overview tile to open the Enhanced anomaly analysis window.
Here you’ll see:
- Summary header
– Metric name (e.g., Disk Queue Length). գ Severity chip (e.g., Warning).
– Status (e.g., Active).
– Timestamp when the anomaly was detected.
– Server and instance (e.g., Server: 192.168.15.50, Disk: F:). գ Current anomalous value.
- Simple explanation banner
A short text listing the top contributing features that led the model to mark this as an anomaly, for example:
“Top contributing features: Has Large Change From Previous, Time since last peak, Current value”. See Anomaly dimensions for a deeper understanding.
- Hover over the anomaly to see it’s score and time
- Metric timeline
A detailed time series chart with normal samples and highlighted anomaly points. You can visually see how the anomaly compares to the rest of the period.
- Other related metrics
Small charts for metrics that often correlate with the current anomaly metric (e.g., Disk Bytes/sec, Disk Read/Write Latency when looking at Disk Queue Length). This helps you understand cause and effect.
- Actions
– Compare graphs – compare this metric with other metrics on the same server.
– Exclude time – mark this time range as “normal” (for example, maintenance), and this anomaly will not be detected for this time frame in the future.
– Dismiss – hide this anomaly.
– Delete – permanently remove the anomaly record.
– Add note – attach comments for your team (e.g., “Linked to deployment X”).
2.3 Enhanced anomaly analysis – Data Analysis tab
The Data Analysis tab exposes how the model “saw” this anomaly.
These are the feature dimensions used by the Isolation Forest, which can be adjusted in the anomaly setting \ configuration.
Hour of day
- What it is: The hour when the anomalous sample occured.
- Why it matters: Workloads are different at 03:00 vs 14:00.
The model learns typical behavior per hour and uses this as context.
- Example: Batches/sec might be high at 10:00 (peak business hours) but low at 01:00. A value of 5,000 batches/sec at 01:00 is far more suspicious.
Is above threshold
- What it is: Boolean indicator: is the metric currently above the configured threshold or percentile.
- Why it matters: Ensures the anomaly is not only “different” statistically, but also above a practical operational threshold.
- Example: CPU may spike from 5% to 40% (large change) but still be below the “CPU above 80th percentile” rule, so it might not be considered severe.
Last values all above threshold
- What it is: Boolean indicator: have the last N data points all exceeded the threshold of “is above threshold”.
- Why it matters: Helps detect persistent problems rather than one-off spikes.
- Example: Disk Queue Length is above the threshold for five consecutive samples; the feature becomes TRUE, and the anomaly is more important.
Avg last values
- What it is: Running average of the last N values before and including the anomaly.
- Why it matters: Captures sustained high sikes.
- Example: Average CPU for the last 10 samples is 85%. Even if the exact anomaly point is 90%, the average indicates prolonged stress.
Std last values
- What it is: Standard deviation of the last N values.
- Why it matters: Shows how stable or noisy the metric has been.
- Example: If Disk Read Latency fluctuates a lot, a single high value might be less suspicious than a spike in a normally stable metric.
Has large change from previous
- What it is: Boolean –the value has jump sharply compared to the previous sample.
- Why it matters: Detects sudden incidents even when absolute levels are not extremely high yet.
- Example: CPU jumps from 20% to 80% within 2 minutes → flagged as a large change.
Distance from last high value
- What it is: Distance between the current value and the last recorded peak.
- Why it matters: Helps distinguish new extreme peaks from normal oscillations around a previous high.
- Example: Current Disk Bytes/sec is far above any previous high of the day, which strongly suggests an anomaly.
Is weekend
- What it is: Boolean – is this sample on a weekend or holiday (according to your configuration).
- Why it matters: Baselines may be very different on weekends or holidays, especially for business workloads.
- Example: On Sunday the system is usually idle; high activity may be suspicious or a special event.
3. Configuring anomalies
Configuration is done under Settings → Anomalies.
3.1 Templates vs Custom setup
There are two levels of configuration:
3.1.1 Templates
Under Templates, Experda provides default anomaly configurations, such as:
- Host
– Anomaly OS Default – covers CPU, Disk Bytes/sec, Disk Queue Length, etc. - SQL
– Anomaly SQL Default – covers Batches/sec, Compilations/sec, Distinct client computers, and other SQL metrics.
You can:
- Use them as-is for quick onboarding.
- Edit thresholds and dimrnsions.
- Attach templates to groups of servers via Affiliated server(s).
3.1.2 Custom server setup
Under Custom setup, you can override template settings per server
- Select a specific server (e.g., 168.15.50).
- Configure Host and SQL anomaly rules just for that server.
- Enable/disable individual metrics (CPU, Disk Queue Length, etc.).
This is useful when:
- Some servers handle batch workloads with different patterns.
- Test/dev servers should be monitored differently from production.
- Certain metrics are irrelevant for a given server.
4. Rule details – dimensions & examples
Each metric rule (CPU, Batches Per Sec, Disk Queue Length, etc.) has a Rule Details page.
Below are the main configuration dimensions and how to use them. Here are the dimensions for the CPU anomaly detection rule for example.
4.1 Severity
Choose Information, Warning, or Critical.
- Information: used for non-urgent deviations (e.g., temporary increases in Batches/sec).
- Warning: used for cases that may impact performance.
- Critical: used for conditions that may lead to downtime or data issues.
Example:
CPU above 80th percentile for 10 minutes → Warning
CPU above 95th percentile for 30 minutes + disk queue anomalies → Critical
4.2 Metric threshold: “X above”
For CPU
- CPU above – a percentile-based threshold (e.g., 80).
- Meaning: current CPU must be higher than 80% of historical values to contribute to an anomaly or to be considered for an anomly.
Example:
CPU above = 80
If the 80th percentile is 65%, then current CPU must exceed 65% to qualify.
For Batches Per Sec (and similar metrics)
- Batches per sec above – percentage compared to learned
- For example, 50% means “current value must be at least 50% higher than the learned baseline” or above the 50th percentile (depending on how you calibrated it).
Example:
Normal Batches/sec ≈ 1,000.
Batches per sec above = 50% → anomalies start around 1,500+.
4.3 Running average window (“Running X avg of”)
- What it is: number of samples used to calculate the running
- Example fields:
– Running CPU avg of: 5
– Running Batches per sec avg of: 10
Example:
If samples are collected every 2 minutes and Running CPU avg of = 5, the model uses the last 10 minutes of CPU data to compute the context average.
Use smaller values for fast-changing metrics; larger values for smoother, long-term trends.
4.4 Consecutive high events
- Field:
– Consecutive high Batches per sec of
– Consecutive high CPU events
- What it is: how many samples in a row must be “high” before an anomaly is raised.
Example:
Consecutive high Batches per sec of = 5
With 2-minute sampling, Experda requires 10 consecutive minutes of high values.
This prevents false positives from one-off spikes.
4.5 Weekend / holiday effect enabled
- Toggle: Weekend / holiday effect enabled
- What it does: tells the model to treat weekends and configured holidays differently when learning normal ranges.
Example:
On weekends, CPU usage is usually below 10%. A spike to 60% on Sunday is much more suspicious than the same value on Monday. Enabling this effect helps the model reflect that difference.
Turn this off if the server has 24/7 uniform workload with no weekday/ weekend difference.
4.6 Large change from previous chronological sample
- Toggle: Large change from previous chronological sample
- What it does: emphasizes sudden jumps or drops in the feature Example:
Disk Queue Length jumps from 1 to 40 within a single interval. Even if 40 is not “extreme” compared to historical peaks, the sudden jump is itself anomalous.
Useful for catching incidents at their onset, before long-term averages adapt.
4.7 Hour of day effect enabled
- Toggle: Hour of day effect enabled
- What it does: allows baselines to be different per hour of the Example:
Nighttime jobs may run at 02:00 every day with very high Disk Bytes/sec. With the hour-of-day effect on, the model learns 02:00 separately and won’t flag those jobs as anomalies.
Turn this off for servers with very flat usage across the day, where hour-by- hour patterns are not relevant.
4.8 Use distance from last high
- Toggle: Use distance from last The longer time a high value was not present, the larger the impact is when it does.
- What it does: adds the “Distance from last high value” feature into the anomaly calculation. If a high value event did not occur for a long time this dimension will be activated.
Example:
Disk Read Latency reaches 10 ms while previous maximum for the day was 2 ms – distance is large, so the anomaly score increases.
4.9 Standard deviation over last N samples
- Field: Std dev over last [N] samples (with a toggle)
- What it does: tells the model to explicitly consider short-term variability. Example:
Std dev over last 20 samples enabled.
If Batches/sec is normally stable (std dev small), any sudden increase in variability itself may be anomalous.
Use this when you care about noise and instability, not just raw levels.
4.10 Absolute CPU filter (CPU >)
- Field: CPU > (e.g., 100)
- What it does: an absolute guardrail: only consider anomalies when CPU exceeds a given percentage.
Example:
CPU > 60 ensures no anomaly is raised when CPU is at 40%, even if the model thinks it’s unusual for that specific time.
This is useful when you want to avoid being alerted about deviations that are still operationally harmless.
5. Alerts and email digests
- Per-rule email alerts
At the bottom of each Rule Details screen:
- Email toggle – enable or disable email alerts for this rule.
- Send to contacts – choose one or more named contacts from your Experda contacts list.
- Send to groups – choose one or more groups (e.g., “DBA On-Call”, “Ops Team”).
When enabled, these recipients will receive alerts whenever a new anomaly is detected for that metric, server, and severity combination.
Example:
For the CPU rule on a production server, send Critical anomalies to the “DBA On-Call” group and Warning anomalies to both “DBA Team” and “Ops Team”.
5.1 12-hour anomaly email digest
In addition to per-rule alerts, Experda sends a periodic email digest:
- Covers the last 12 hours.
- Includes all anomalies that:
– Occurred in that period, and
– The recipient is registered to (via the rule’s contacts or groups).
The digest typically contains:
- Summary counts per server and per metric.
- Top critical anomalies with direct links to AI insights.
- Short explanations (top contributing features) so you can quickly triage.
This helps busy teams avoid alert fatigue while still maintaining situational awareness across the environment.
You can set the anomaly digest interval in the global settings, under the anomaly category (see image below)
6. Best practices
6.1 Start with templates
- Begin with Anomaly OS Default and Anomaly SQL Default.
- Run them for a few days to allow the models to stabilize their baselines.
- Review anomalies in AI Insights to see which are useful and which are noise.
6.2 Tune thresholds and dimensions
- If you see too many minor alerts:
– Increase CPU above percentile.
– Increase Consecutive high events.
– Enable Weekend/holiday and Hour of day effects to add context.
- If you are missing incidents:
-Decrease thresholds.
– Enable Large change from previous sample and Use distance from last high.
6.3 Use custom setup for special servers
- Batch servers / data warehouses often have nightly spikes – tune their rules separately.
- Dev/test servers might need Information-only severity or even disabled rules.











