Retracted Analysis Tutorial
In this tutorial we will use the Waste sample network, which is included with Bayes Server, to demonstrate how to use the Retracted Analysis tool.
The retracted-analysis tool helps us understand which variable(s) are driving unusual behavior for a data set. We could use it to analyze anomalies for example.
Bayes Server supports retracted-analysis with both discrete and continuous variables.
A related analysis tool is the use of Log Likelihoods over a data set, as shown in the Log Likelihood tutorial.
- Open the Waste sample network included with Bayes Server, either from the Start Page or from the File menu, click Open.
The Waste network with no evidence set should look like this...
- Download the data for this part of the tutorial from Waste Anomaly Detection data
The data included in this file has 400 data points over 4 different variables. The first 300 data points are 'normal' after which the system starts to degrade over the remaining 100.
Open the file just downloaded in a spreadsheet application.
Create a line chart for each of the following columns:
- CO2 Concentration
- Dust emission
- Metals emission
- Metals in waste
The line chart for CO2 Concentration, for example, should look similar to the following:
Verify that for each line chart, while there is variability, there is no obvious indicator that the system is degrading over the last 100 points.
With the Waste network still open, click the Analyze menu and then Retracted.
The Data Tables dialog will launch as shown below:
Click the ellipsis (...) button to the right of the Data Connection drop down.
Click Add
With the Load Excel file into memory tab selected, click Open File and select the file you just downloaded.
The New Data Connection dialog should as follows:
Click OK to close the dialog.
Click Close to close the Data Connection Manager.
In the Data Tables dialog, choose the data connection just created, from the Data Connection drop down.
Then choose Sheet1 in the Data drop down.
The dialog should look as follows:
- Click Ok. This will launch the Data Map dialog shown below.
Check that the following 4 variables have been automatically mapped.
- CO2 Concentration
- Dust emission
- Metals emission
- Metals in waste
Click Ok. This will launch the Retracted Analysis dialog.
Choose the following variables as Target variables:
- CO2 Concentration
- Dust emission
- Metals emission
- Metals in waste
The dialog should look as follows:
- Click Run.
For CO2 Concentration, the results should look like the following 2 charts:
The first chart plots the predicted value of CO2 Concentration based on all the evidence except for CO2 Concentration for each row in the dataset.
The second chart plots the Log Likelihood based on all the evidence compared to the log-likelihood base on all the evidence except for CO2 Concentration, for each row in the dataset.
When both a retracted value is significantly different to the actual value and the difference in log log-likelihood is clear, this could indicate that the variable is responsible (or partially responsible) for the anomalous data. This can help us pinpoint the cause of anomalies.
In the case of the CO2 Concentration plots, the difference between actual and predicted does appear to be consistently different towards the end of the dataset, however there is little difference in log-likelihood, hence why it looks like a single trace rather than an area plot, that we will see for other variables.
- Inspect both plots for the other variables.
For all the other 3 variables (Dust emission, Metals emission, Metals in waste) it appears that the predicted values differ from the actual and the log-likelihood differs as well.
This is an example of when univariate anomaly detection does not suffice. It is the combination of certain variables that must be analyzed.
Retracted analysis can also be performed on discrete variables. Note that there is no discrete data in the dataset we downloaded earlier, so to test on discrete data, you would need a different dataset (e.g. you could generate some sample data from the Data Connection Manager dialog).