Poisson Distribution and Regression (+ Zero Inflation)

Poisson Distribution (from Wikipedia)

  • “… is a discrete probability distribution that expresses the probability of a given number of events occuring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event.”
  • Wow, it’s exactly fit to the dependent variable in my research:
    • fixed interval of time and space
    • time intervals between events have an average, but independent each other
  • If a discrete random variable X has a Poisson distribution with parameter lambda > 0,
    • lambda = E(X) = mean = variance

Poisson Regression (from Wikipedia)

  • Poisson regression assumes,
    • the response variable Y has a Poisson distribution
    • the logarithm of its expected value can be modeled by a linear combination of unknown parameters.
  • Also called log-linear model
  • Zero inflation?
    • “Another common problem with Poisson regression is excess zeros: if there are two processes at work, one determining whether there are zero events or any events, and a Poisson process determining how many events there are, there will be more zeros than a Poisson regression would predict. An example would be the distribution of cigarettes smoked in an hour by members of a group where some individuals are non-smokers.”
    • Zero-inflated model
      • “… a zero-inflated model is a statistical model based on a zero-inflated probability distribution, i.e. a distribution that allows for frequent zero-valued observations.
      • “The zero-inflated Poisson model concerns a random event containing excess zero-count data in unit time.” : fit to my dataset!

In practice

Poisson Distribution and Regression (+ Zero Inflation)

Multicollinearity (mainly from Wikipedia)

Wikipedia link

Meaning and Effect

  • Multicollinearity (also collinearity) is a statistical phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a non-trivial degree of accuracy.
  • Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data themselves; it only affects calculations regarding individual predictors.


  • Large changes in the estimated regression coefficients when a predictor variable is added or deleted.
  • If one variable is shown as statistically significant in a simple linear regression model, but is not in a muliple regression model.
  • If the Condition Number is above 30, the regression is said to have significant multicollinearity. The Condition Number is appeared on a regression result summary of Python StatsModels.
  • Use correlation matrix. Correlation values (off-diagonal elements) of at least .4 are sometimes interpreted as indicating a multicollinearity problem.


  • Dummy variables for every category with regression constant: Perfect multicollinearity
  • Drop one of the variables. However, you lose information (because you’ve dropped a variable). Omission of a relevant variable results in biased coefficient estimates for the remaining explanatory variables that are correlated with the dropped variable.
  • Standardize your independent variables. This may help reduce a false flagging of a condition index above 30. Numpy module in Python has a function for standardization of variables named “preprocessing.scale“. Of course, disadvantages of standardization should be considered either when to interpret the regression result.
Multicollinearity (mainly from Wikipedia)

Heteroscedasticity (mainly from Wikipedia)

Meaning and Effect

  • In statistics, a collection of random variables is heteroscedastic if there are sub-populations that have different variabilities from others. That is, if the variance of dependent values in a dataset is (significantly) different depending on the related independent values, we can say that the dataset is heteroscedastic.
  • Example 1: A classic example of heteroscedasticity is that of income versus expenditure on meals. As one’s income increases, the variability of food consumption will increase. A poorer person will spend a rather constant amount by always eating inexpensive food; a wealthier person may occasionally buy inexpensive food and at other times eat expensive meals. Those with higher incomes display a greater variability of food consumption. (higher income – independent variable – cause higher variance in expenditure on meals – dependent variable.)
  • Wikipedia link


Heteroscedasticity does not cause ordinary least squares coefficient estimates to be biased, although it can cause ordinary least squares estimates of the variance (and, thus, standard errors) of the coefficients to be biased, possibly above or below the true or population variance. Thus, regression analysis using heteroscedastic data will still provide an unbiased estimate for the relationship between the predictor variable and the outcome, but standard errors and therefore inferences obtained from data analysis are suspect. Biased standard errors lead to biased inference, so results of hypothesis tests are possibly wrong. For example, if OLS is performed on a heteroscedastic data set, yielding biased standard error estimation, a researcher might fail to reject a null hypothesis at a given significance level, when that null hypothesis was actually uncharacteristic of the actual population (making a type II error).

Heteroscedasticity (mainly from Wikipedia)