In the application of regression analysis, often the data set consist of unusual observations which are either outliers (noise) or influential observations. These observations may have large residuals and affect the parameters of the regression co-efficient and the whole regression analysis and become the source of misleading results and interpretations. Therefore it is very important to consider these suspected observations very carefully and made a decision that either these observations should be included or removed from the analysis.
In regression analysis, the basic step is to determine whether one or more observations can influence the results and interpretations of the analysis. If the regression analysis have one independent variable, then it is easy to detect observations in dependent and independent variables by using scatter plot, box plot and residual plot etc. But graphical method to identify outlier and/or influential observation is a subjective approach. It is also well known that in the presence of multiple outliers there can be a masking or swamping effect. Masking (false negative) occurs when an outlying subset remains undetected due the presence of another, usually adjacent subset. Swamping (false positive) occurs when usual observation is incorrectly identified as outlier in the presence of another usually remote subset of observations.
In the present study, some well known diagnostics are compared to identify multiple influential observations. For this purpose, first, robust regression methods are used to identify influential observation in Poisson regression, then to conform that the observations identified by robust regression method are genuine influential observations, some diagnostic measures based on single case deletion approach like Pearson chi-square, deviance residual, hat matrix, likelihood residual test, cook’s distance, difference of fits, squared difference in beta are considered but in the presence of masking and swamping diagnostics based on single case deletion fail to identify outlier and influential observations. Therefore to remove or minimize the masking and swamping phenomena some group deletion approaches; generalized standardized Pearson residual, generalized difference of fits, generalized squared difference in beta are taken.
3.2 Diagnostic measures based on single case deletion
This section presents the detail of single case deleted measures which are used to identify multiple influential observations in Poisson regression model. These measures are change in Pearson chi-square, change in deviance, hat matrix, likelihood residual test, cook’s distance, difference of fits (DFFITS),squared difference in beta(SDBETA).
- Pearson chi-square
To show the amount of change in Poisson regression estimates that would occurred if the kth observation is deleted, Pearson χ2 statistic is proposed to detect the outlier. Such diagnostic statistics are one that examine the effected of deleting single case on the overall summary measures of fit.
Let denotes the Pearson χ2 and denotes the statistic after the case k is deleted. Using one-step linear approximations given by Pregibon (1981). The decrease in the value of statistics due to deletion of the kth case is
Δ = Ë- , k=1,2,3,…..,n 3.1
is defined as:
And for the kth deleted case is:
- Deviance residual
The one-step linear approximation for change in deviance when the kth case is deleted is:
ΔD = D Ë- D(-k) 3.4
Because the deviance is used to measure the goodness of fit of a model, a substantial decrease in the deviance after the deletion of the kth observation is indicate that is observation is a misfit. The deviance of Poisson regression with kth observation is:
Where = exp (
D(-k)= 2 3.6
A larger value of ΔD(-k) indicates that the kth value is an outlier.
- Hat matrix:
The Hat matrix is used in residual diagnostics to measure the influence of each observation. The hat values, hii, are the diagonal entries of the Hat matrix which is calculated using
In Poisson regression model
=i) = (,where g function is usually called the link function and With the log link in Poisson regression
(XTVX)-1 is an estimated covariance matrix of and hii is the ith diagonal element of Hat matrix H. The properties of the diagonal element of hat matrix i.e leverage values are
Where k indicates the parameter of the regression model with intercept term. An observation is said to be influential if ckn. where c is a suitably constant 2 and 3 or more. Using twice the mean thumb rule suggested by Hoaglin and Welsch (1978), an observation with 2kn considered as influential.
- Likelihood residual test
For the detection of outliers, Williams (1987) introduced the likelihood residual. The squared likelihood residual is a weighted average of the squared standardized deviance and Pearson residual is defined as:
and it is approximately equals to likelihood ratio test for testing whether an observation is an outlier and it also called approximate studentized residual, is standardized Pearson residual is defined as:
is standardized deviance residual is defined as:
Where is called the deviance residual and it is another popular residual because the sum of square of these residual is a deviance statistic.
Because the average value, KN, of hi is small is much closer to than to ,and therefore also approximately normally distributed. An observation is considered to be influential if |t(1, n
- Difference of fits test (DFFITS)
Difference of fits test for Poisson regression is defined as:
(DFFITS)i= , i=1,2,3,…..,n 3.12
Where and are respectively the ith fitted response and an estimated standard error with the ith observation is deleted. DFFITS can be expressed in terms of standardized Pearson residuals and leverage values as:
An observation is said to be influential if the value of DFFITS 2.
- Cook’s Distance:
Cook (1977) suggests the statistics which measures the change in parameter estimates caused by deleting each observation, and defined as:
Where is estimated parameter of without ith observation. There is also a relationship between difference of fits test and Cook’s distance which can be expressed as:
Using approximation suggested by Pregibon’s C.D can be expressed as:
Observation with CD value greater than 1 is treated as an influential.
- Squared Difference in Beta (SDFBETA)
The measure is originated from the idea of Cook’s distance (1977) based on single case deletion diagnostic and brings a modification in DFBETA (Belsley et al., 1980), and it is defined as
(SDFBETA)i = 3.17
After some necessary calculation SDFBETA can be relate with DFFITS as:
(SDFBETA)i = 3.18
The ith observation is influential if (SDFBETA)i
- Diagnostic measures based on group deletion approach
This section includes the detail of group deleted measures which are used to identify the multiple influential observations in Poisson regression model. Multiple influential observations can misfit the data and can create the masking or swamping effect. Diagnostics based on group deletion are effective for identification of multiple influential observations and are free from masking and swamping effect in the data. These measures are generalized standardized Pearson residual (GSPR), generalized difference of fits (GDFFITS) and generalized squared difference in Beta(GSDFBETA).
3.3.1 Generalized standardized Pearson residual (GSPR)
Imon and Hadi (2008) introduced GSPR to identify multiple outliers and it is defined as:
= i 3.20
Where are respectively the diagonal elements of V and H (hat matrix) of remaining group. Observations corresponding to the cases |GSPR| > 3 are considered as outliers.
3.3.2 Generalized difference of fits (GDFFITS)
GDFFITS statistic can be expressed in terms of GSPR (Generalized standardized Pearson residual) and GWs (generalized weights).
GWs is denoted by and defined as:
for i 3.21
= for i 3.22
A value having is larger than, Median (MAD ( is considered to be influential i.e
> Median (MAD (
Finally GDFFITS is defined as
We consider the observation as influential if
3.3.3 Generalized squared difference in Beta (GSDFBETA)
In order to identify the multiple outliers in dataset and to overcome the masking and swamping effect GSDFBETA is defined as:
GSDFBETAi = for i 3.24
= for i 3.25
Now the generalized GSDFBETA can be re-expressed in terms of GSPR and GWs:
GSDFBETAi = for i 3.26
= for i 3.27
A suggested cut-off value for the detection of influential observation is