Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Approximating BART Fits with Variable Subsets for Predictive Modeling, Schemes and Mind Maps of Machine Learning

The use of the Coordinate Ascent Regression Tree (CART) method for variable subset selection in predictive modeling. The goal is to find subsets of variables that can effectively approximate the predictions from a BART (Bayesian Additive Regression Trees) fit. The approach involves seeking functions of variable subsets that approximate the BART fit well, and only depends on the subset of variables. The document also covers the use of the BART package in R for implementing this method.

Typology: Schemes and Mind Maps

2021/2022

Uploaded on 09/27/2022

raimond
raimond 🇬🇧

4.8

(8)

216 documents

1 / 37

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Fitting the fit, variable selection using
surrogate models and decision analysis, a
brief introduction and tutorial
Carlos Carvalho
McCombs School of Business, University of Texas at Austin
and
Richard P. Hahn
School of Mathematical and Statistical Sciences, Arizona State University
and
Robert McCulloch
School of Mathematical and Statistical Sciences, Arizona State University
April 21, 2020
Abstract
This paper describes a practical procedure for Bayesian variable selection in non-
linear regression and classification models. A first stage model is fit in which all
variables are included. Typically, this first stage will not include the prior belief
that only a small subset of variables is needed, but it may. Given this first stage
fit, we look for functions of variable subsets which approximate the predictions from
the first stage fit well. A computationally efficient surrogate model is used to search
for approximating functions which depend on low numbers of predictors. Rather
than assuming there is some true sparcity, we seek sparse approximations to the
non-sparse truth. In the case that our first stage fit involves a Bayesian assessement
of the uncertainty, we use this to gauge the uncertainty of our approximation error.
If we learn that, with high probability, we can obtain a good approximation to the
non-sparse truth using a subset of the variables, we deem that subset to be of inter-
est. We demonstrate the procedure in empirical examples involving prediction and
classification and simulated examples.
Keywords: BART, nonlinear, Machine Learning
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25

Partial preview of the text

Download Approximating BART Fits with Variable Subsets for Predictive Modeling and more Schemes and Mind Maps Machine Learning in PDF only on Docsity!

Fitting the fit, variable selection using

surrogate models and decision analysis, a

brief introduction and tutorial

Carlos Carvalho

McCombs School of Business, University of Texas at Austin

and

Richard P. Hahn

School of Mathematical and Statistical Sciences, Arizona State University

and

Robert McCulloch

School of Mathematical and Statistical Sciences, Arizona State University

April 21, 2020

Abstract This paper describes a practical procedure for Bayesian variable selection in non- linear regression and classification models. A first stage model is fit in which all variables are included. Typically, this first stage will not include the prior belief that only a small subset of variables is needed, but it may. Given this first stage fit, we look for functions of variable subsets which approximate the predictions from the first stage fit well. A computationally efficient surrogate model is used to search for approximating functions which depend on low numbers of predictors. Rather than assuming there is some true sparcity, we seek sparse approximations to the non-sparse truth. In the case that our first stage fit involves a Bayesian assessement of the uncertainty, we use this to gauge the uncertainty of our approximation error. If we learn that, with high probability, we can obtain a good approximation to the non-sparse truth using a subset of the variables, we deem that subset to be of inter- est. We demonstrate the procedure in empirical examples involving prediction and classification and simulated examples.

Keywords: BART, nonlinear, Machine Learning

1 Introduction

Model selection, and more particularly variable selection has been, and continues to be, a major focus of statistical research and practice. In the case of the linear model, there is a huge literature on variable selection. Researchers often want to know which variables are most important.

In the case of nonlinear modeling, methods such as random forests and deep neural nets have been proven to be remarkably effective. While it is common to hear complaints that these methods are “black box” and “uninterpretable”, these methods typically provide measures of variable importance.

Carvalho et al. (2020) (CHM) develop an alternative approach to variable selection for nonlinear models which is generally applicable. Rather than building on the dubious notion that a variable has a meaningful importance on its own, CHM look for subsets of variables such that a function of the subset can approximate a non-sparse fit well as a practical matter. The CHM development for nonlinear models extends the ideas in Hahn & Carvalho (2015) designed for linear models. This paper provides an introduction to CHM and a tutorial on the usage of the corresponding R package nonlinarsel.

A simple way to get a feeling for the CHM approach is to think of the variable selection problem as one of function approximation. We outline our approach for the problem where the response Y is numeric, but the ideas extend nicely to the general case. Our basic statistical model is that we seek to know E(Y | x) where Y is a variable we wish to predict given the information in a vector of predictor variables x. The vector x may contain many variables and variable selection involves choosing a subset of variables S such that xS (the subset of predictor variables indicated by S) enables us to predict “almost” as well as the full x.

It is assumed that a first stage inference has successfully uncovered a trusted estimate fˆ such that E(Y | x) ≈ fˆ (x). The approach then seeks subsets S and surrogate functions γS such that γS (x) ≈ fˆ (x) for x of interest. Crucially, γS only depends on x through the subset xS. Can we find a function which only uses a subset of the variables and, as a practical matter, is an effective surrogate in the sense that it approximates the correct function well?

But, we do not need any guarantee that we have found the optimal γS. If we find a good one we are in business. If we find a better one later that is fine too. There are no tricky inferential issues related to estimation of a true function γS or the true subset S. Given any found γS , we can simply assess our uncertainty about it’s effectiveness as a surrogate based on the inference in step (1). This is step (3).

Note that an essential feature of our approach, which distinguishes it from others, is that the investigator must choose the set x values at which f (x) is needed. Because our method focuses on the practical value of variable selection, the user must specify what the model is going to be used for and this means we need to decide on a set x for which we need E(Y | x) = f (x).

Additionally, since steps (2) and (3) are driven by the goal of approximation, the choice of metrics that are used to measure how well an approximation is working is a fundamental part of the procedure. Often we use standard statistical metrics, but we emphasize the problem dependent metrics motivated by the practical problem being addressed can easily be incorporated into the steps (2) and (3).

CHM provide a more formal development of our approach based on Bayesian decision theory and discuss how to it relates to other approaches. In addition, an extended example is discussed.

In Section 2, we present the cars data example which looks at predicting the prices of used cars. We use the cars data as our running example in Sections 3 to 5. In Section 3 we illustrate using BART to do our step (1) inference. In Section 4 we show how we can use big trees to quickly approximate nonlinear functions. In Section 5.1 we illustrate step (2) and find subsets of the variables that approximate our our inference in step (1) well. In Section 5.2 we illustrate step (3), showing how BART draws from the posterior can be used to assess our uncertainty. In Section 5.3 we explore an alternative approach for finding surrogate models. In Section 5.4 we compare our inferential conclusions with the more traditional out-of-sample analysis. In Section 6 we look at some simulated data. In Section 7 we look at data on penalties in hockey games where the goal is to predict which team will get the next penalty in a National Hockey League game (Abrevaya & McCulloch (2014)). Note

that the hockey data is a classification problem whereas the previous problems were all regression problems in which we seek to predict a numeric response. Finally, in Section 8 we conclude.

2 The Cars Data

In this section we introduce the used cars data set. The goal of the study was to predict the sales price of a used car given characteristics of the car.

Let’s read in the data and have a quick look at it.

cdat = read.csv("http://www.rob-mcculloch.org/data/susedcars.csv") dim(cdat)

[1] 1000 7

names(cdat)

[1] "price" "trim" "isOneOwner" "mileage" "year"

[6] "color" "displacement"

Each of the 1000 observations corresponds to a car. The response y is the variable price which is the price the used car was sold for in dollars. To give our example an generic feel, we will call the response y. Let’s also change the unit to thousands of dollars.

y = cdat[,"price"]/1000 #unit is now thousands of dollars summary(y)

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.995 12.995 29.800 30.583 43.992 79.

Note that these cars are used Mercedes and hence still quite expensive.

The rest of the variables are our explanatory x variables. We note that two interesting variables in x are year and mileage which is the model year of the car and the mileage (number of miles). So, the bigger year is, the newer the car is. The other variables are categorical.

As discussed in Section 1, we need to choose a set of x vectors and which to assess our approximations to E(Y | x) = fˆ (x). Ideally, the user thinks carefully about what the set of x for which predictions will be needed. As a practical matter convenience choices are useful. We will use xp in our R code to denote the matrix whose rows consist of the x vectors at which we may need to predict. The most obvious convenience choice is xp = x where x is the observed training data. A second obvious choice is to randomly partition the data into a train/test split. This way we are careful to assess our uncertainty on x cases not used in estimation. For nonparametric methods such as BART, this can make a difference. Note that our motivation for a train/test split is very different from the usual motivation based on estimation of out of sample loss.

Let’s use a train/test split with 75% of the data in the training sample.

set.seed(99) n = length(y) ii = sample(1:n,floor(.75*n)) xp = x[-ii,] ; yp = y[-ii] x = x[ii,]; y = y[ii] print(dim(x)) ; print(length(y))

[1] 750 15

[1] 750

print(dim(xp)) ; print(length(yp))

[1] 250 15

[1] 250

So, (x,y) is the training data and (xp,yp) is the test data.

3 Cars Data, BART Inference for f

In Section 1 we outlined how our approach to variable selection consists of three steps. In step (1) we obtain an inference for E(Y | x) = f (x).

In this section we illustrate the use of Bayesian Additive Regression Trees (BART, Chipman et al. (2010)) to get inference for f with the cars data as our example (Section 2).

The BART model and Markov Chain Monte Carlo algorithm provide an inference for

Y = f (x) + ,  ∼ N (0, σ^2 ),

where we assume that the errors are iid N (0, σ^2 ).

We will use the R package BART(see Sparapani et al. (2020) and McCulloch et al. (2019)) that implements the BART algorithm for the model above (as well as many others.) We use the function BART::wbart:

library(BART)

set.seed(14) bf = wbart(x,y,xp,nskip=500,ndpost = 3000) #BART fit

As discussed in Section 2, (x,y) is our a training data and the third argument xp is the set of x for which we want inference for f (x). nskip=500 initial MCMC iterations will be used for burn-in and results for the subsequent ndpost=3000 MCMC iterations will be used for inference.

We can get quick feeling for how the BART model and MCMC performed by plotting the draws of the parameter σ. The component bf$sigma contains all the σ draws including burn-in so that there are 500+3000=3,500 draws.

plot(bf$sigma,xlab="MCMC iteration",ylab="sigma draw",pch=16,cex=.4,col="blue") lmf = lm(y~.,data.frame(x,y)) #linear model fit abline(h=summary(lmf)$sigma,col="red",lty=2,lwd=3) abline(v=500,col="black",lty=4,lwd=3)

.94 for the linear fits. (The warning about “fit may be misleading” is because we threw in all the dummies for each categorical x).

4 Big Tree Fit to the Fit

In this section we illustrate our use of big trees to approximate nonlinear functions. We use the R library rpart to fit the trees.

library(rpart)

Above we have used BART to estimate a function fˆ such that E(Y |x) ≈ fˆ (x). Now we want to use a big tree to approximate this function. We choose the parameters of the argument control to rpart::rpart to make sure rpart fits a big tree.

np = length(bf$yhat.test.mean) minbucket = floor(.5log(np)) btA = rpart(fhat~.,data=data.frame(xp,fhat=bf$yhat.test.mean), control = rpart.control(minbucket=minbucket, minsplit = 2minbucket, maxcompete=0, maxsurrogate=0,xval=0,cp=0))

To see how big the tree is we can look at the number of bottom nodes:

cat("the number of bottom nodes is: ",length(unique(btA$where)))

the number of bottom nodes is: 107

We see that there are 107 bottom nodes.

Now let’s have a look at how well our big tree approximates the function fˆ.

yhatbtA = predict(btA,data.frame(xp)) # big tree approx, all x variables cor(yhatbtA,bf$yhat.test.mean)

[1] 0.

The correlation 0.9987708 suggests that the big tree has done a pretty good job of approx- imating fˆ.

The basic idea of our method is to see if we can approximate fˆ well using functions that only use a subset of the variables in x. To quickly see if this is the case, we fit a big tree to fˆ , but only use a subset of the variables. Let’s use the variables (mileage, year, trim.other).

print(colnames(xp)[c(1,2,6)])

[1] "mileage" "year" "trim.other"

btS = rpart(fhat~.,data=data.frame(xp[,c(1,2,6)],fhat=bf$yhat.test.mean), control = rpart.control(minbucket=minbucket, minsplit = 2*minbucket, maxcompete=0, maxsurrogate=0,xval=0,cp=0)) print(length(unique(btS$where)))

[1] 108

We have fit a big tree with 108 bottom nodes using only the three variables.

If, the big tree fit with only three variables approximates fˆ well, then we might just use those three to predict. Let’s compare the approximation using just three variables, the approximation using all the variables, and fˆ. We’ll look at the correlations and the three scatter plots using any two of the three.

yhatbtS = predict(btS,data.frame(xp)) #big tree approx, three variables fmat = cbind(yhatbtS,yhatbtA,bf$yhat.test.mean) colnames(fmat) = c("yhatbt3x","yhatbtallx","fhat") cor(fmat)

yhatbt3x yhatbtallx fhat

yhatbt3x 1.0000000 0.9990113 0.

yhatbtallx 0.9990113 1.0000000 0.

fhat 0.9980005 0.9987708 1.

5 Variable Selection for the Cars Data

In Section 1 we outlined how our approach to variable selection consists of three steps. For step (1), we use the BART fit to the cars data from Section 3.

In Section 5.1 we illustrate step (2) and in Section 5.2 we illustrate step (3), again using the cars data in each case.

5.1 Cars Data, fit the fit Search for Variable Subsets

In this section we use our variable selection method to find subsets of our 15 x variables that such that a function depending only on the subset approximates the BART fit well.

The nonlinvarsel package has methods the use (i) forward greedy search, (ii) backwards elimination, and (iii) all subsets evaluation. For each candidate variable subset, a large regression tree is fit using the variables in the subset to fit the fit from the BART inference. All subsets is not viable for a large number of variables.

Let’s try forward greedy search using the function nonlinvarsel::vsf. Recall that bf is the data structure holding the BART fit from the training data T = (x,y) at the x values in xp. The vector bf$yhat.test.mean are the fitted values fˆ (x) = E(f (x) | T ) for x in the rows of xp.

vsfr = vsf(xp,bf$yhat.test.mean)

The returned vsfr list has a plot method and and print method. Let’s look at the plot.

plot(vsfr,cex.axis=.6)

forward variable selection

R−squared

year mileagetrim.othercolor.Whitecolor.othercolor.Black isOneOwner.fcolor.Silver

trim. isOneOwner.t displacement.4.6displacement.5.5displacement.other

trim.

The variable names printed out along the x-axis indicate the order in which the variables come in using the forward greedy search. The y-axis is the square of the correlation (R^2 ) between a function of the variables included and the BART fit (bf$yhat.test.mean). So, for example, using only the first three variables (year, mileage, trim.other), we can construct of a function of these three variables that explains almost 99.6% of the variation in fˆ (x) where the x are all those in xp and fˆ is estimated using BART (posterior mean of f (x)).

Let’s print vsfr. As the variables come in, we will see the R^2.

print(vsfr)

year mileage trim.other color.White

0.9378374 0.9927871 0.9960050 0.

vsfr$R2[3]

[1] 0.

We see that the subset using three variables corresponds to the variables in columns 2,1, and 6 of xp. The R-squared value is 0.996 indicating that, using a big tree, we have found a function of these three variables which fits the unrestricted fit very well.

Now let’s try the backward elimination search using nolinvarsel::vsb. Backward elimi- nation start with all the variables in and then takes variables out one at a time.

vsbr = vsb(xp,bf$yhat.test.mean)

We can plot and print the results. Let’s have a look at the plot.

plot(vsbr, cex.axis=.5)

backward variable selection

R−squared

year mileage trim.othercolor.Whitecolor.othercolor.Black isOneOwner.tcolor.Silver displacement.otherdisplacement.5.5displacement.4.

isOneOwner.f

trim.550trim.

The results tell us that the first variable thrown out in the backwards elimination is trim.500 and the last variable left is year. The obvious variables year and mileage

work very well, but you may want to consider a variable representing a trim category. The backwards results are very similar to the forwards results.

We can plot the forwards and backwards search together to check their similarity:

plotfb(vsfr,vsbr)

iteration number

R−squared

forwards

backwards

variable susbset search, forward and backward

The results are identical.

Since we only have 15 variables, we can try all possible subsets. However, it takes sub- stantially more time than the forwards and backwards searches and typically gives similar results.

vsar = vsa(xp,bf$yhat.test.mean)

Let’s plot the results. Note that since the subsets may not be nested (as in the forward search) we need the complete set of indices for each group size. Hence we just plot the subset size on the x-axis.

## [[4]]

## [1] 1 2 6 12

print(colnames(xp)[vsar$vL[[4]]])

[1] "mileage" "year" "trim.other" "color.White"

We can use the function nonlinvarsel::plotfba to plot all three searches.

plotfba(vsfr,vsbr,vsar)

number of variables

R−squared

forwards

backwards

all

variable susbset search, forward, backward, and all

We see that we obtain the same results we got from the forwards and backwards searches.

In general, it is possible for the searches to find different surrogates γ(xS ) with the hope that γ(xS ) ≈ fˆ (x) ≈ E(Y | x). From a statistical inference point of view it is important to remember that we are just seeking a good surrogate in this sense. If does not matter if we have the exact globally optimal surrogate and there are no tricky multiple-comparison type inference issues related to our search over different kinds of surrogates. All uncertainty is base on our step (1) inference. We illustrate this in the next section. If you find a good surrogate it is of real practical value in an uncomplicated way. And it is just fine if there is

a better one out there, or if you find a better one a later date. This conceptual simplicity driven by what we feel is a more useful specification of the real practical issues is a major strength of our approach.

5.2 Posterior Uncertainty for the Approximation Error

An advantage of using BART for our function estimation is that we get draws from the posterior distribution of f rather than just the point estimate fˆ. Using these draws we can assess the uncertainty about the approximation error for each subset of x variables. This is step (3).

The component bf$yhat.test stores evaluations fd(xj ) for d = 1, 2 ,... , D post burn-in draws fd of f and each xj in the test data.

dim(bf$yhat.test)

[1] 3000 250

We have 3000 post burn-in draws and 250 x vectors in the test data xp.

We can now assess our uncertainty. We use nonlinvarsel::sumpost.

sp = sumpost(bf$yhat.test,vsfr$fit,bf$yhat.test.mean,distrmse)

nd,n,p: 3000 250 14

The arguments are the draws of f (bf$yhat.test) representing our uncertainty about the true function, the fits from our candidate subsets (vsfr$fit), the fit using all the information in x (bf$yhat.test.mean), and a distance metric to measure the difference between a draw fd and a candidate function. The distance metric used above is RMSE (root mean squared error).

Drmse(f, γS ) =

n

∑^ n i=

(f (xi) − γS (xi))^2.

The sum is over {xi} in the n rows of xp. Note that any distance may be easily used. We may also be interested in Drmse(f, γS ), the possible error of fˆ.