knitr::opts_chunk$set(echo = TRUE)
library(class)

Problem Description

The purpose of this note is to design a simulation study to assess the behavior of the mean prediction error of the k-Nearest-Neighbor (knn) algorithm as a function of the flexibility parameter \(k\). The algorithm will be applied to simulated data from the following distribution:

Let \((X_1, X_2, Y)\) be a random vector where \(X_1, X_2\) are i.i.d. random variables following uniform \(U(0,1)\) distribution. The conditional distribution of \(Y|(X_1=x_1, X_2=x_2)\) is Bernoulli with parameter \(p(x_1,x_2)\) defined as follows:

\[ p(x_1,x_2) = \begin{cases} \frac{1}{2} \left( \frac{x_2}{b(x_1)}\right)^a, & x_2 \leq b(x_1)\\ 1-\frac{1}{2} \left( \frac{1-x_2}{1-b(x_1)}\right)^a, & x_2 > b(x_1)\\ \end{cases} \] where \(a\) is a parameter \(a>1\) and \(b\) is a function \(b: [0,1]\to [0,1]\).

Note that when \(x_2 = b(x_1)\), \(p(x_1, b(x_1)) = 1/2\) and, for any \(x_1\), \(p(x_1, x_2)\) is continuously increasing in \(x_2\) from 0 to 1. In this sense the curve \((x_1,b(x_1)), x_1 \in [0,1]\) represents the Bayes boundary between categories 0 and 1 of the dependent variable \(Y\).

Data Generation

The following functions are designed to generate training and test sets of variable size, given specific forms of the boundary function \(b(x)\) and parameter \(a\).

We first define a function for \(p(x_1, x_2)\), given \(b(x), a\):

p1=function(x1,x2,a,b)
{
    if (x2<b(x1)) return((x2/b(x1))^a/2) else
      return(1- ((1-x2)/(1-b(x1)))^a/2)
}

The following function creates a dataframe of size \(n\), with columns x1, x2, y, p, where each row corresponds to a random observation \((x_1, x_2, y)\) from the above distribution, with \(a\) and function \(b\) parametric.

create_dataset=function(n, a, b)
{
  x1=runif(n, 0, 1)
  x2=runif(n, 0, 1)
  y=rep(0,n)
  p=rep(0,n)
  for (i in 1:n)
  {
    p[i]=p1(x1[i], x2[i], a, b)
    y[i]=rbinom(1,1,p[i])
  }
  dataset=data.frame(x1,x2,y,p)
  return(dataset)
}

As an example, we create a dataset of size 200, with parameter \(a=2\) and boundary function \[ b(x) = 0.7 + 0.1 x - 0.8 x^2 + 4 x^3 - 2.8 x^4 . \]

b=function(x) 0.7+0.1*x-0.8*x*2+4*x^3-2.8*x^4
a=2
dataset1=create_dataset(200,a,b)

We plot the data points \((x_1,x_2)\) with different colors and symbols for observations with \(y=0\) and \(y=1\), as well as the true boundary function.

plot(dataset1$x2~dataset1$x1, xlim=c(0,1), ylim=c(0,1), col=dataset1$y+1, pch=dataset1$y+2, cex=0.8)
x0=seq(0,1,0.01)
y0=b(x0)
lines(y0~x0)

KNN Algorithm Study

We can now simulate the application of knn algorithm to estimate the mean prediction error as a function of \(k\). The objective is to assess how the mean prediction error (i.e., the mean mislcassification rate) depends on the size of the training set and the value of parameter \(k\). We design the simulation experiment as follows.

We first generate a testing set of size \(n_2\), which we keep fixed. To assess the bias and variance of the predictions we generate a large number \(N\) of training sets of size \(n1\). For each training set \(m=1,\dots,N\) and each value \(k\) varying in a set \(K\) we apply the knn algorithm to predict the \(Y\) values for the observations of the test set. For \(i\) in the test set, we estimate \[ \hat{y}_i = \frac{1}{k} \sum_{j \in N(x_{i1}, x_{i2})} y_j \] where \(N(x_{i1}, x_{i2})\) is the \(k\)-neighborhood of \((x_{i1}, x_{i2})\) in the training set, i.e., the \(k\) observations of the training set that are closest to the point \((x_{i1}, x_{i2})\) of the test set, according to the Euclidean distance. We then compute the missclassification rate on the test set based on training set \(m\) and value \(k\): \[ MPE(m,k)=1- \frac{1}{n_2} \sum_{i=1}^{n_2} I(\hat{y}_i = y_i) . \] Finally, we estimate the mean prediction rate as function of \(k\) by averaging the corresponding values of \(MPE(m,k)\) over all training sets : \[ MPE(k) = \frac{1}{N} \sum_{m=1}^N MPE(m,k) . \] The following function implements the above experiment with parametric values of \(n_1, n_2, N, K, a, b\)

mpe_estimation = function(n1, n2, N, Kvalues, a, b)
{
  # Kvalues is a vector of k-values
  nk=length(Kvalues)
  test_set=create_dataset(n2, a, b)
  # The values of MPE(m,k) will be stored in table MPE
  MPE=matrix(rep(0, nk*N), N, nk)
  # loop over training sets
  for (m in 1:N)
  {
    training_set = create_dataset(n1, a, b)
    #loop over k values
    for (l in 1:nk)
    {
      k=Kvalues[l]
      trainx=training_set[,1:2]
      trainy=as.factor(training_set[,3])
      testx=test_set[,1:2]
      testy=as.factor(test_set[,3])
      yhat=knn(trainx, testx, trainy, k=k)
      MPE[m,l]=1-mean(yhat==testy)
    }
  }
  # compute the column means of matrix MPE to obtain a vector of
  # estimated MPE as a function of k
  mpe_vector=apply(MPE, 2, mean)
  # The function returns a matrix with columns kvalues and mpe_vector
  return(cbind(Kvalues, mpe_vector))
}

We first apply the function for the above values of \(a, b\), training set size \(n_1=50\), test set size \(n_2=100\), values of \(k=1,2,5,7, 10, 20, 40, 50\) and number of simulated training sets \(N=1000\), and plot the MPE as a function of \(k\).

n1=50
n2=100
N=1000
Kvalues=c(1,2,5,10,20,40,50)
v1=mpe_estimation(n1, n2, N, Kvalues, a, b)
v1
##      Kvalues mpe_vector
## [1,]       1    0.24109
## [2,]       2    0.24746
## [3,]       5    0.20593
## [4,]      10    0.20513
## [5,]      20    0.22348
## [6,]      40    0.33780
## [7,]      50    0.46508
plot(v1[,2]~v1[,1], xlab="k", ylab="MPE(k)", type="b")

We observe that \(k\) values of about 10 mimimize the MPE. Next, we increase the size of the training set to \(n_1=300\).

n1=300
n2=100
N=1000
Kvalues=c(1,2,5,10,20,50,100, 200, 300)
v2=mpe_estimation(n1, n2, N, Kvalues, a, b)
v2
##       Kvalues mpe_vector
##  [1,]       1    0.22131
##  [2,]       2    0.22189
##  [3,]       5    0.17890
##  [4,]      10    0.16931
##  [5,]      20    0.15861
##  [6,]      50    0.15315
##  [7,]     100    0.16278
##  [8,]     200    0.22599
##  [9,]     300    0.40069
plot(v2[,2]~v2[,1], xlab="k", ylab="MPE(k)", type="b")

Now the estimated MPE is minimized for values of \(k\) in the order of 50. This is reasonable, because for a training set of size 300, these values represent a moderate flexibility, while in the previous example \(k=50\) corresponded to zero flexibility.