## Data Modeling: When to use a linear model

As you probably know, Excel lacks many sophisticated analytic techniques and not everyone has access to a true analytics package like STATA or SPSS. So what can you do if you’re modeling data and trying to decide of the type of model? Will a linear model work or do you need to consider an exponential or polynomial model to explain your dataset? Picking the wrong model is a mistake that sometimes even the best-intentioned researchers make and the incorrect model can even end up being published. I’m going to cover one technique called the Durbin-Watson test and explain how to use it to tell if a linear regression model is a valid model for your dataset.

First off, you can use Excel to add a linear trendline to your data. Make sure you have Excel give you the equation for the trendline – we’ll need this to compute the Durbin-Watson.

Let’s say you’ve run the regression and ended up with the formula y = 0.1897x + 5.0517. The next step is to calculate your fitted values.

For the next two steps (Yhat and residuals), you can either get these with Excel or calculate them manually. I’ll cover the manual approach.

If you first point was (2, 6) then the Yhat (fitted value) is 2*.1897 + 5.0517 = 5.4311. Do this for all of your data points.

The next step is to calculate residual values. The residual value is simply the real value minus the fitted value. So, for our above example, the residual value is 6 – 5.4311 = 0.5689.

The next step is to calculate the squared difference between successive residuals, summed from the second observation to the nth observation. So, for all the residuals in your list starting with the second one, take that residual, subtract out the previous residual, and square the result. Then sum all of these values.

The next to last step is to calculate the sum of squares residual. For this, calculate the square of each individual residual and the sum the entire range. This is your sum of squares.

Finally, to calculate the Durbin-Watson statistic, divide the squared difference of successive residuals by the sum of squares. This value can then be compared to online lookup tables (such as at http://www.paolocoletti.it/statistics/exercises/Durbin-Watson.html) to see if a linear model is appropriate.

For example, if you have one independent variable and 15 observations (k = 1, n = 15), then dL = 1.08 and dU = 1.36. If your value is between this range, then a linear model is probably appropriate for your data.

However, if your value was for example 0.9, then your model exhibits positive autocorrelation (i.e. the errors are not independent) and you can not use a linear model to describe the data. You then have to consider other models or transform the data to normality (natural log, square root, etc.) before modeling and re-compute the Durbin-Watson.

Thanks for reading this brief tutorial on data modeling and good luck with your analyses! 