Locally Weighted Regression (LWL)

Locally Weighted Regression (LWL) or LOWESS

The basic assumption for a linear regression is that the data must be linearly distributed. But what if the data is not linearly distributed. Can we still apply the idea of regression? And the answer is ‘yes’… we can apply regression and it is called as locally weighted regression. We can apply LOESS or LOWESS (locally weighted scatterplot smoothing) when the relationship between independent and dependent variables is non-linear such as:

LOESS or LOWESS are non-parametric regression methods that combine multiple regression models in k-nearest-neighbour based model.

Most of the algorithms such as classical feedforward neural network, support vector machines, nearest neighbor algorithms etc. are global learning systems or global function approximations where it is used to minimize the global loss functions such as sum squared error. In contrast, local learning systems will divide the global learning problem into multiple smaller/simpler learning problems and this is usually achieved by dividing the cost function into multiple independent local cost functions. The disadvantage of global methods is that sometimes no parameter values can provide a sufficiently good approximation. An alternative to global function approximation is Locally Weighted Learning or LOWESS. Locally Weighted Learning methods are non-parametric and the current prediction is done by local functions. The basic idea behind LWL is that instead of building a global model for the whole function space, for each point of interest a local model is created based on neighboring data of the query point. For this purpose, each data point becomes a weighting factor which expresses the influence of the data point for the prediction. In general, data points which are in the close neighborhood to the current query point are receiving a higher weight than data points which are far away. LWL is also called lazy learning because the processing of the training data is shifted until a query point needs to be answered. This approach makes LWL a very accurate function approximation method where it is easy to add new training points.


The leftmost figure below shows the result of fitting a y = θ0 + θ1x to a dataset. We see that the data doesn’t really lie on a straight line, and so the fit is not very good.

If we had added an extra feature x(x squared i.e. x*x), and fit y = θ0 + θx + θx2 , then we obtain a slightly better fit to the data (see middle figure). It might seem that the more features we add, the better. However, there is also a danger in adding too many features: The rightmost figure is the result of fitting a 5-th order polynomial. We see that even though the fitted curve passes through the data perfectly, we would not expect this to be a very good predictor. By seeing the fit we can say the figure on the left shows an instance of underfitting—in which the data clearly shows structure not captured by the model—and the figure on the right is an example of overfitting. As we all know, the choice of features is important to ensure good performance of a learning algorithm. Today we will see locally weighted linear regression (LWR) algorithm which assumes if there is sufficient training data then the choice of features is less critical.

In the original linear regression algorithm, to make a prediction at a query point x (i.e., to evaluate h(x)), we would:

  1. Fit θ to minimize sum squared error
  2. Output predicted value

In contrast, the locally weighted linear regression algorithm does the following:

  1. Fit θ to minimize w (i) * sum squared error (where w (i) are non-negative value for weights)
  2. Output predicted value

The standard choice for weights are:-

The parameter τ controls how quickly the weight of a training example falls off with distance of its x (i) from the query point x; τ is called the bandwidth parameter,

Note that the weights depend on the particular point x at which we’re trying to evaluate x. Moreover, if |x (i) − x| is small, then w (i) is close to 1; and if |x (i) − x| is large, then w (i) is small.

Note also that while the formula for the weights takes a form that is cosmetically similar to the density of a Gaussian distribution, the w (i) ’s do not directly have anything to do with Gaussians, and in particular, the w (i) are not random variables, normally distributed or otherwise.

Implementation In R

'data.frame': 50 obs. of 2 variables:
$ speed: num 4 4 7 7 8 9 10 10 10 11 ...
$ dist : num 2 10 4 22 16 10 18 26 34 17 ...

plot(speed~dist,data = cars)

Looking at the plot we can say that there is some sort of positive relationship between these 2 variables.

plot(speed~dist,data = cars)

By seeing at the trend line we can say that the linear model does not fit the data properly especially when the speed is around 23 to 25.

So, trying with a non-parametric curve to fit the data. Lowess curve tries to look locally (by dividing the entire sample space into no. of regions) and calculates a value for each region and joins all the calculated points together (this is done using smoothing parameter) with a line.

lowess(cars$speed~cars$dist) #Default setting where f=2/3

[1] 2 4 10 10 14 16 17 18 20 20 22 24 26 26 26 26 28 28 32
[20] 32 32 34 34 34 36 36 40 40 42 46 46 48 50 52 54 54 56 56
[39] 60 64 66 68 70 76 80 84 85 92 93 120

[1] 5.419833 6.034063 7.861475 7.861475 9.069588 9.673301 9.975386
[8] 10.277712 10.883150 10.883150 11.488354 12.085713 12.660596 12.660596
[15] 12.660596 12.660596 13.201558 13.201558 14.237556 14.237556 14.237556
[22] 14.737713 14.737713 14.737713 15.251018 15.251018 16.103033 16.103033
[29] 16.481524 17.210519 17.210519 17.508543 17.853728 18.173957 18.483517
[36] 18.483517 18.704807 18.704807 19.028842 19.416744 19.664883 19.923202
[43] 20.182120 20.935868 21.346516 21.696448 21.781930 22.396404 22.487343
[50] 25.050362

$x is original distances present in the dataset in sorted order and $y are calculated using LOWESS algorithm

lowessgraph = lines(lowess(cars$speed~cars$dist,f=2/3),col="green")
f – This gives the proportion of points in the plot which influence the smooth at each value. Larger values give more smoothness.

f value helps in determining the model from being over fitted or under fit.


When the smoothing constant is less then it tends to over fit the data(as we see the yellow line when f value is 0.01)

  • Need to evaluate the whole dataset every time.
  • Higher computation cost.
  • Memory requirement increases as dataset increases.
  • Autonomous Helicopter
  • Inverse Dynamic Learning
  • Has lots of application in the field of computer vision.
Tagged , , ,