tceic.com

学霸学习网 这下你爽了

学霸学习网 这下你爽了

Abstract

Key words: l1 -NORM CONSTRAINT; LASSO; VARIABLE SELECTION; SUBSET SELECTION; BAYSIAN LOGISTIC REGRESSION

LASSO is an innovative variable selection method for regression. Variable selection in regression is extremely important when we have available a large collection of possible covariates from which we hope to select a parsimonious set for the efficient prediction of a response variable. LASSO minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. LASSO not only helps to improve the prediction accuracy when dealing with multicolinearity data, but also carries several nice properties such as interpretability and numerical stability. This project also includes a couple of numerical approaches for solving LASSO. Case studies and simulations are studied in both the linear and logistic LASSO models, and are implemented in R; the source code is available in the appendix.

University of Minnesota Duluth

Acknowledgements

I would like to express my special thanks to my advisor, Dr. Kang James. She found this interesting topic for me and helped me to overcome the difficulties throughout this research. She taught me how to do research and what attitude and characteristics will lead me to success and happiness. She also helped me develop my presentation skills and was very encouraging. Thank Drs. Barry James and Yongcheng Qi for being my committee members and helping me with my thesis defense. I would like to thank my friend Lindsey Dietz and my boy friend Brad Jannsen for their help in proof reading my thesis. I would also like to acknowledge the support from my parents in China who always motivated me to work hard.

II

University of Minnesota Duluth

Table of Contents

Acknowledgements ....................................................................................................... I Abstract......................................................................................................................... II Chapter 1 Introduction...................................................................................................2 Chapter 2 LASSO application in the Linear Regression Model ...................................4 2.1 The principle of the generalized linear LASSO model ....................................4 2.2 Geometrical Interpretation of LASSO..............................................................7 2.3 Prediction Error and Mean Square Error..........................................................8 2.4 CV/GCV methods and Estimate of LASSO parameter....................................9 2.5 Standard Error of LASSO Estimators.............................................................12 2.6 Case Study: Diabetes data ..............................................................................13 2.7 Simulation.......................................................................................................17 Chapter 3 LASSO application in the Logistic Regression Model............................. 21 3.1 Logistic LASSO model and Maximum likelihood estimate ..........................21 3.2 Maximum likelihood estimate and Maximum a posteriori estimate ..............24 3.3 Case study: Kyphosis data..............................................................................27 3.4 Simulation.......................................................................................................29 Chapter 4 Computation of LASSO..............................................................................31 4.1Osborne’s dual algorithm.................................................................................31 4.2 Least Angle Regression algorithm..................................................................32 References ...................................................................................................................34 Appendix .....................................................................................................................35

1

University of Minnesota Duluth

Chapter 1 Introduction

A “lasso” is usually recognized as a loop of rope that is designed to be thrown around a target and tighten when pulled. It is a well-known tool of the American cowboy. In this context, it is fittingly being used as a metaphor of l1 constraint applied to linear model. Coincidently, LASSO is also the initials for Least Absolute Shrinkage and Selection Operator.

Consider the usual linear regression model with data ( xi1 , xi 2 ,

, xip , yi ), i = 1,

, n , where

the xij ’s are the regressors and yi is the response variable of the i th observation. The ordinary Least Squares (OLS) regression method finds the unbiased linear combination of the xij ’s that minimizes the residual sum of squares. However, if p is large or the regression coefficients are highly correlated (multicolinearity), the OLS may yield estimates with large variance which reduces the accuracy of the prediction. A widely-known method to solve this problem is Ridge Regression and subset selection. As an alternative to these techniques, Robert Tibshirani (1996) presented “LASSO” which minimized the residual sum of squares subject to the sum of absolute values of the coefficient being less than a constant.

? L = arg min{ ( y ? α ? β x ) 2 } β ∑ i ∑ j ij

i =1 j N

(1.1)

subject to

∑ β?

j =1 p j =1

p

L j

≤ t (Constant) .

(1.2)

? o , then the LASSO algorithm will yield the same estimate as OLS estimate. If t > ∑ β j ? o , then the problem is equivalent to However, if 0 < t < ∑ β j

j =1 p

? L = arg min ? ( y ? α ? β x ) 2 + λ β ? , β ∑ i ∑ j ij ∑ j ?

i =1 j j

?

N

? ?

(1.3)

λ > 0 . It will be shown later that the relation between λ and LASSO parameter t is

one-to-one. Due to the nature of the constraint, LASSO tends to produce some coefficients to

2

University of Minnesota Duluth

? o is an unbiased be exactly zero. Compared to the OLS, whose predicted coefficient β

estimator of β , both ridge regression and LASSO sacrifice a little bias to reduce the variance of the predicted values and improve the overall prediction accuracy. In my project, I focus on two main aspects of this topic. In chapter 2, definitions and principles of LASSO in both the generalized linear case and orthogonal case are discussed. In Chapter 3, I illustrate the principles of the LASSO logistic model and give a couple of examples. Chapter 4 introduces the existing algorithm for solving LASSO estimates. Finally, I study two main numerical algorithms.

3

University of Minnesota Duluth

Chapter 2 Linear LASSO Model 2.1 The principle of the generalized linear LASSO model

p ? ? Let G ( β , X , Y , λ ) = ∑ ? yi ? ∑ β j xij ? + λ ∑ β j ; G can also be written in matrix form j =1 j ? ? 2

G ( β , X , Y , λ ) = (Y ? X β )T (Y ? X β ) + λ β I p ,

(2.1.1)

?. Minimizing G, we can get best estimate of β which can be notated as β Let I p represent the p × p identity matrix, β be the diagonal matrix with j th diagonal element β j , X be the design matrix.

? β1 ? 0 β =? ? ? ? ?0

0

β2

0

0 ? ? ? = diag β j 0 ? ? βp ? ?

( )

? β1 ?1 ? ? 0 ?1 , β =? ? ? ? ? 0

0

β2

?1

0

0 ? ? ? ? = diag β j 0 ? ?1 ? βp ? ?

(

?1

).

We will use these notations through out the rest of this paper.

Firstly, we note that (2.1.1) can be written as

G(β , X , Y , λ ) = Y T Y ? β T X T Y ? Y T X β + β T X T X β + λ β .

(2.1.2)

Take partial derivative of (2.1.2) with respect to β :

?G ( β , X , Y , λ ) ?β = ?2Y T X + 2( X T X ) β + λ sign( β ) ,

(2.1.3)

? sign( β1 ) ? ? ? where sign( β ) = ? ?. ? sign( β p ) ? ? ? Set

?G ( β , X , Y , λ ) ?β

? . For simplicity, we assume X is = 0 (2.1.4), and solve for β

orthonormal, X T X = I .

? o = ( X T X ) ?1 X T Y = X T Y . ? o denotes the OLS estimate. For the multivariate case, β β j

4

University of Minnesota Duluth

By solving (2.1.4), we obtain the following result:

?L = β ?o ? β j j

Lemma2.1 If y = x ?

λ

2

?L . sign β j

( )

(2.1.5)

λ

2

sign( y ) , then x and y share the same sign.

y = x?

λ

2

sign( y ) ? x = y +

λ

2

sign( y )

If y is positive, y +

λ

2

sign( y ) must be positive, then x is positive. On the contrary, if y is

negative, then x is negative. Thus, x and y share the same sign. ? L = sign β ? o . Then (2.1.5) becomes According to Lemma 2.1 and (2.1.5), sign β j j ? L =β ?o ? β j j

■

( )

( )

λ

2

?o sign β j

( )

? ?o λ ?o ≥ 0 if β βj? j ? ? ? ?o λ ? ? ?o λ ? 2 =?β I[ β? o ≥0] + ? β =? j ? j ? ? ? I[ β? oj <0] . (2.1.6) j λ 2 2 ? ? ? ? o o ? + ? <0 ?β if β j j ? 2 ?

It follows that

? β

L j

?o ? β ?o ? λ ? = sign β j ? j ? 2? ?

( )

+

j = 1,

,p.

(2.1.7)

+ denotes the positive part of expression inside the parenthesis.

5

University of Minnesota Duluth

Fig 2.1

Form of coefficient shrinkage of LASSO in the orthonormal case. LASSO. LASSO tends to

produce zero coefficients.

Intuitively,

λ

2

is the threshold set for nonzero estimates; in another words, if the

? o is less than the threshold, we will set β ? L equal to zero. corresponding β j j The parameter λ can be computed by solving the equations

+ ? L λ? o ? ?o ? ? ? β j = sign β j ? β j ? ? , ? 2? ? ? ?L = t . ?∑ β j ? ? j

( )

(2.1.8) (2.1.9)

where λ is chosen so that

∑β

j

j

= t . Therefore, each λ value corresponds to a unique t

value. This is illustrated by solving the simplest case of p=2. Without loss of generality, we ? o are both positive. assume that the least squares estimates β j

? L ? o λ ?+ ? = β ? ?β 1 ? 1? ? 2? ? ? ? + ? ? L ? ?o λ ? β β = ? 2 ? ? 2 ? 2? ? ?

(2.1.10)

?L +β ? L = t . (2.1.11) and β 1 2

First solving for λ , I get

?o +β ? o ? t . (2.1.12) λ=β 1 2

6

University of Minnesota Duluth

Substituting (2.1.12) into (2.1.10), I obtain the solution for LASSO estimates

+ + ? ?o +β ?o ? t ? ? t β ?o ?β ?o ? ? β L o 1 2 1 2 ? =?β ? ?β ? ? 1 ? 1? ? =? ?2+ ? , 2 2 ? ? ? ? ? ? ? + + ?o +β ?o ? t ? ? t β ?o ?β ?o ? ? ? L ? ?o β 1 2 1 2 ? ? ?β2 = ? ?β2 ? ? =? ? ? ? . 2 2 ? ? ? ?2 ? ?

(2.1.13)

When extended to the general case,

? o )β ?o ? p λ . t = ∑ sign( β j j 2 i =1

p

(2.1.14)

Now it can be seen that the relationship between λ and the LASSO parameter t is one-to-one. Unfortunately, LASSO estimate doesn’t usually have a closed-form solution. However, statisticians have already developed several numerical approximation algorithms, which will be discussed in Chapter 4.

2.2 Geometrical Interpretation of LASSO

Fig 2.2.1 A geometrical interpretation of LASSO in 2-dimension and 3-dimension. The left panel is from Tibshirani (1996). The right panel is from Meinshausen (2008).

Fig 2.2.1 illustrates the geometric image of LASSO regression for the two- and three? o (OLS dimensional cases. Looking at the left panel, the center point of the ellipse is β estimates). The ellipse contour corresponds to some specific residual sum of square values.

7

University of Minnesota Duluth

The area inside the square around the origin satisfies the LASSO restriction. It means that

β = ( β1 ,..., β p )T inside the black square satisfies the constraint ∑ β j ≤ t . This implies that

j

minimizing the residual sum of squares according to the constraint corresponds to the contour tangent to the pyramid. The LASSO solution is the first place that the contours touch the square; this will sometimes occur at a corner (due to the nature of the pyramid), corresponding to a zero coefficient. It is the same in three dimensions. From the right panel of Fig 2.2.1, one can see the contour touching one margin of the pyramid on the X-Y plane which corresponds to β1 , β 2 , so it assigns the value zero to the variable β3 .

2.3 Prediction Error and Mean Square Error

? is an estimate ofη ( x) = X β , ε is random error with ? ( x) = X β Suppose that Y = η ( x) + ε , η normal distribution, and E ( ε ) = 0, Var (ε ) = σ 2 .

? ( x) is defined by The mean-squared error of an estimate η

? ( x) ? η ( x)]2 . MSE = E[η

(2.3.1)

It is hard to estimate MSE because η ( x) = X β is unknown. However, predicted error is

easier to calculate and is closely related to MSE. ? ( x)]2 = MSE + σ 2 . PSE = E[Y ? η Lemma 2.3 PSE = MSE + σ 2 ? ( x)]2 = E[Y ? η ? ( x) + η ( x) ? η ( x)]2 PSE = E[Y ? η ? ( x) ? η ( x)]2 + 2 E[(η ? ( x) ? η ( x))(Y ? η ( x))] + E[Y ? η ( x)]2 . = E[η ? ( x) ? η ( x)]2 = MSE , the middle term The first and the third term E[Y ? η ( x)]2 = σ 2 , E[η (2.3.2)

8

University of Minnesota Duluth

? ( x) ? η ( x))(Y ? η ( x))] = E ? ? ( x) ? η ( x) ) ε ? E [ (η ?(η ? = 0. Thus, PSE = MSE + σ 2 . Minimizing PSE is equivalent to minimizing MSE. In the next section, I will introduce how to get the optimal LASSO parameter by minimizing the effect of predicted error.

2.4 CV/GCV methods and Estimate of the LASSO parameter

2.4.1 Relative Bound and Absolute Bound

The tuning parameter t is called LASSO parameter, which is also recognized as the absolute bound.

∑ β?

j =1

n

L j

= t . Here I define another parameter, s , as the relative bound.

s=

∑ β? ∑ β?

j =1 j =1 p

p

L j

.

o j

s ∈ [0,1]

(2.4.1)

The relative bound can be seen as a normalized version of LASSO parameter. There are two algorithms mentioned in Tibshirani (1996) to compute the best s : N-fold Cross-validation and Generalized Cross-validation(GCV).

2.4.2 N-fold Cross-validation

Cross-validation is a general procedure that can be applied to estimate tuning parameters in a wide variety of problems. The bias in RSS is a result of using the same data for model fitting and model evaluation. CV can reduce the bias of RSS by splitting the whole data into two subsamples: a training (calibration) sample for model fitting and a test (validation) sample for model evaluation. The idea behind the cross-validation is to recycle data by switching the roles of training and test samples.

9

University of Minnesota Duluth

? . Prediction error can be estimated for the LASSO The optimal s can be denoted by s

procedure by ten-fold cross-validation. The LASSO is indexed in terms of s , and the prediction error is estimated over a grid of values of s from 0 to 1 inclusive. We wish to predict with small variance, thus we wish to choose the constraint s as small as we can. The

? ( x) is selected (Tibshirani ? which achieves the minimum predicted error of η value s

1996). For example, we have Acetylene data, which is from Marquardt, et al. 1975. In this data, the corresponding variable y is percentage of conversion of n-Heptane to acetylene. The three explanatory variables are: T=Reactor Temperature (°C); H=Ratio of H2 to n-Heptane (mole ratio); C=Contact time (sec). T and C are highly correlated variables the covariance matrix is

T T H C H C 1 0.223628 -0.9582 0.223628 1 -0.24023 -0.9582 -0.24023 1

Accordingly, we will get very large variance for each estimates. Instead, if we use LASSO

? = 0.25 . method, we will get the following GCV score plot, we choose s

Fig 2.4.1 Acetylene data GCV versus s plot. (We will illustrate GCV in the next section)

10

University of Minnesota Duluth

However, sometimes the data we use may yield minimum predicted error when s = 1 which means OLS estimates is LASSO estimates. In this case, we will choose the elbow

?. position of PE as the corresponding s

Fig 2.4.2 Predicted Error as a function of relative bound from Diabetes example

An N-fold cross-validation selects a model as follows. 1. Split the whole data into N disjoint subsamples S1 , …, S N . 2. For i = 1,

, N , fit model to the training sample ∪ Si , and compute discrepancy, d v ( s )

i≠v

using the test sample Sv . d v ( s ) = ? PESv ? PE ∪ Si ? . ? ? i ≠v ? ?

? as the minimizer of the overall discrepancy d ( s ) = ∑ d v ( s ) . 3. Find the optimal s

v =1 N

2

The drawback of N-fold cross-validation is its lack of efficiency. Suppose I try to optimize

s over a grid of 40 values ranging from 0 to 1. Using five-fold cross-validation, I need to run

the LASSO computation procedure 200 times, which will take a long time. The inefficiency becomes especially significant when the sample size is big and the number of variables is

11

University of Minnesota Duluth

large. Due to this feature, Tibshirani (1996) introduced another algorithm: Generalized Cross-validation(GCV) which has a significant computational advantage over N-fold cross-validation.

2.4.3 Generalized Cross-validation

The generalized cross-validation (GCV) criterion was firstly proposed by Craven, P. and Wahba, G. (1979). We have already proved in the previous section that ? β

L j

?o ? β ?o ? λ ? = sign β j ? j ? 2? ?

( )

+

j = 1,

, p . (2.1.7)

If we rewrite it in matrix form, ?L = ? XT X ? λi β β ? 2 ?

?1

? T ? X Y. ?

?1

(2.4.2)

Therefore the effective number of parameters (Appendix A) in the model is

? ? λ d (λ ) = Tr ? X ? X T X ? i β 2 ? ? ?

?1

?1 ? ? T ? X ?. ? ? ?

(2.4.3)

We construct the generalized cross-validation statistic for LASSO to be

GCV (λ ) =

( y ? x β? ) 1∑

n L i =1 i i

2

n ?1 ? d ( λ ) / n ? 2 ? ?

.

(2.4.4)

Compared to N-fold Cross-validation, GCV is more efficient. Suppose I still try to evaluate

s over a grid of 40 values ranging from 0 to 1. Using GCV instead, the LASSO computation

procedure only needs to run 40 times. However, there is also a drawback of GCV. A major difficulty lies in the evaluation of the cross-validation function, which requires the calculation of the trace of an inverse matrix. The problem becomes especially considerable when dealing

12

University of Minnesota Duluth

with large scale data. However, this is another area of study which will not be discussed here.

2.5 Standard Error of LASSO Estimators

?L = β ? o ? λ sign β ? , LASSO estimate (2.1.5) can also be written as β j 2

( )

?o λ β λ ? L o ? ? β = β ? i o =? XTX ? i β ? 2 β 2 ?

?1

? T ? X Y. ?

?1

(2.5.1)

? ? L = Var ?? X T X ? λ i β Var β ? 2 ? ??

( )

?1

?1 ? ? T ? X Y? ? ? ?

λ ? =? XTX ? iβ 2 ?

?1

?1 ?? T λ ? T X X ? iβ ? X Var (Y ) ? ? ?? 2 ? ?

?1

?1 ? ? T ? X ? ? ? ?

T

λ ? =σ ? XTX ? i β 2 ?

2

?1

λ ? ? T T ? X X?X X ? iβ 2 ? ?

?1

?1

? ? ?

?1

.

(2.5.2)

Var (Y ) = σ 2 is an estimate of the error variance. In my research, I usually use the bootstrap (keep resampling from the original data set with replacement) method to estimate the standard error of LASSO estimates.

2.6 Case Study: Diabetes data

This data was originally used by Efron, Hastie, Johnstone and Tibshirani (2003). Table 2.6.1 shows a small portion of the data. There are ten baseline variables, age, sex, body mass index (bmi), average blood pressure (BP) and six blood serum measurements. N=422 diabetes patients were measured. The response variable y is a quantitative measure of disease progression one year after baseline.

Obs 1 AGE 59 SEX 2 BMI 32.1 BP 101 S1 157

13

S2 93.2

S3 38

S4 4

S5 4.8598

S6 87

Y 151

University of Minnesota Duluth 2 3 4 5 6 … 438 439 440 441 442 48 72 24 50 23 … 60 47 60 36 36 1 2 1 1 1 … 2 2 2 1 1 21.6 30.5 25.3 23 22.6 … 28.2 24.9 24.9 30 19.6 87 93 84 101 89 … 112 75 99.67 95 71 183 156 198 192 139 … 185 225 162 201 250 103.2 93.6 131.4 125.4 64.8 … 113.8 166 106.6 125.2 133.2 70 41 40 52 61 … 42 42 43 42 97 3 4 5 4 2 … 4 5 3.77 4.79 3 3.8918 4.6728 4.8903 4.2905 4.1897 … 4.9836 4.4427 4.1271 5.1299 4.5951 69 85 89 80 68 … 93 102 95 85 92 75 141 206 135 97 ... 178 104 132 220 57

Table 2.6.1 Data structure of Diabetes Data

Ideally, the model would produce accurate baseline predictions of response for future patients, and also the form of the model would suggest which covariates were important factors in diabetes treatment progression. Firstly, I evaluate the LASSO model on a grid of 40 s values. The optimal LASSO parameter s by GCV scores can be observed. (see Fig 2.6.1)

? = 0.4 Fig 2.6.1 GCV score as a function of relative bound s. s

From Fig 2.6.1, one can see that the speed of GCV scores decreases dramatically until s =0.4. Thus I pick 0.4 as our optimized s value.

14

University of Minnesota Duluth

In order to see how LASSO shrinks and predicts the coefficients more clearly, the LASSO estimates as a function of the standardized relative bound s as plotted. Intuitively, every coefficient will be squeezed to zero as s goes to zero.

1: age; 2: Sex; 3: BMI; 4: BP; 5~10: 6 levels of blood serum measurements

Fig 2.6.2: LASSO coefficient shrinkage in diabetes example: each monotone decreasing curve represents a coefficient as a function of relative bound s . The vertical lines show at which s value that each coefficient shrinks to zero. The covariates enter the regression equation sequentially as s increase, in order i=3,9,4,7,2,10,5,8,6,1. If s=0.4 as chosen by GCV, 9(S5) 3(BMI), as shown by the vertical red line, only the coefficients on the left of the red lines 4(BP) and 7(S3), and 2(Sex) are assigned to be nonzero.

? = 0.4 . The table below shows the LASSO estimate and Second, I evaluate Diabetes data at s

OLS g. The standard errors (SE) were estimated by bootstrap resampling of residuals from the original data set. LASSO chooses sex, bmi, BP, S3 and S5. Notice that LASSO yielded smaller SE for sex, bmi, BP, S3and S5 than those yielded by OLS. This shows that LASSO predicts the coefficients with more accuracy. The table shows a tendency that LASSO estimate is smaller than the ones by OLS. This is due to its constraint nature, that all predictions are subtracted by a threshold value. Also, Cp statistics of OLS is 11 while Cp statistics of LASSO is 5. Cp results illustrates that LASSO also has some good properties of subset selection.

15

University of Minnesota Duluth

Predictor Coefficients Intercept age sex bmi BP S1 S2 S3 S4 S5 S6 2.6461 0 -52.5341 509.6485 221.3422 0 0 -153.097 0 447.3803 0

LASSO SE 0.30737 51.6555 56.18816 57.26192 55.14417 106.5671 89.25398 69.5406 77.67842 73.6392 45.80782

Results Z-score 8.608846 0 -0.93497 8.900304 4.013882 0 0 -2.20155 0 6.075301 0 Pr(>|t|) 0 1 0.174903 0 0.00003 1 1 0.013848 1 0 1 Coefficients 152.133 -10.012 -239.819 519.84 324.39 -792.184 476.746 101.045 177.064 751.279 67.625 SE

OLS 2.576 59.749 61.222 66.534 65.422 416.684 339.035 212.533 161.476 171.902 65.984

Results Z-score 59.061 -0.168 -3.917 7.813 4.958 -1.901 1.406 0.475 1.097 4.37 1.025 Pr(>|t|) < 2e-16 0.867 0.000104 4.30E-14 1.02E-06 0.057947 0.160389 0.634721 0.273456 1.56E-05 0.305998 *** *** *** *** . ***

Significant code: P-value “***” 0, “**” 0.001, “*” 0.01, “.” 0.05; Table 2.6.2 Results from Diabetes Data Example

2.7 Simulation Study

2.7.1 Autoregressive model

We use AR (1) model to generate multi-correlated variables. AR (1) model is xt = φ0 + φ1 xt ?1 + ε t , where ε1 , ε 2 … is an i.i.d. sequence with mean 0 and variance σ 2 . Assuming weakly stationary, it is easy to see E ( xt ) = φ0 + φ1 E ( xt ?1 ) . Substituted E ( xt ) by μ , μ = φ0 + φ1μ , then we can get

μ=

φ0 . 1 ? φ1

(2.7.1)

The variance can be inducted as following

xt = (1 ? φ1 ) μ + φ1 xt ?1 + ε t

? xt ? μ = φ1 ( xt ?1 ? μ ) + ε t

ρl = Cov( xt , xt ?l ) = E ( xt ? μ ) ( xt ?l ? μ ) = E ? ?(φ1 ( xt ?1 ? μ ) + ε t )( xt ?l ? μ ) ? ?

16

University of Minnesota Duluth

l = E? ?φ1 ( xt ?1 ? μ )( xt ?l ? μ ) ? ? + E [ε t ( xt ?1 ? μ ) ] = φ1 E ? ?( xt ?1 ? μ )( xt ?l ? μ ) ? ? = φ1rl ?1 = φ1r0 .

Since ρij denotes the correlation between xi and x j ,. Simplifying, I can write φ1 = φ , which will leads to

ρ i ,i ? l =

rl =φ l ro

(2.7.2)

Given Signal-to-Noise ratio( SNR), which is defined as SNR =

βT X . In order to find φ , I σ

assume that φ = x0 (initial value of the sequence). Then we can easily calculate φ and x0 . The details of calculation will be shown in the next section.

2.7.2 Simulation I

The model is y = β T X + ε . I generated 1000 observations with 8 correlated variables. The parameters are designed to be

β = (3,1.5, 0, 0, 2, 0, 0, 0)T . ε is random error with normal distribution N ( 0, σ 2 ) . The

correlation between xi , x j was φ approximately 5.7. In our example, φ = 0.5 l = i ? j , ρij = ρl = φ i ? j . According to xt = (1 ? φ1 ) μ + φ1 ( xt ?1 ? 1) + ε t , we have x1 = 0.5 x0 + 0.5μ + ε t ;

x2 = 0.5 x1 + 0.5μ + ε t = 0.5 ( 0.5 x0 + 0.5μ ) + 0.5μ + ε t = 0.75μ + 0.25 x0 + ε t ;

i? j

, φ = 0.5 , σ = 3 . The data generated yields a SNR to be

…

17

University of Minnesota Duluth

x5 = 0.0625 x0 + 0.9375μ ;

… If I wish to get SNR=5.7 and σ = 3 , then β T X = 5.7 × 3 = 17 , Note that

β T X = 3 x1 + 1.5 x2 + 3x5 = 3 ( 0.5 x0 + 0.5μ ) + 1.5 ( 0.75μ + 0.25 x0 ) + 2 ( 0.0625 x0 + 0.9375μ )

where x0 is a initial value, which can be set to be μ . Solving for μ , μ = x0 = 2.6153 .

? p ? SNR × σ . To generalize the procedure, SNR × σ = ? ∑ β j ? × μ , and μ = x0 = p ? ? ? j =1 ? ?∑ βj ? ? j =1 ?

(2.7.3)

All initial values and means of the other simulation examples are decided by the same method. The other two simulations followed the same procedure. The result of simulation I is shown in Fig 2.7.1.

Fig 2.7.1 LASSO coefficient shrinkage in simulation 1: each monotone decreasing curve represents a coefficient as a function of relative bound s. The covariates enter the regression equation sequentially as s increase, in order i=1, 5, 2, 6, …, 7. If s=0.6, shown by the vertical red line, chosen by GCV (shown in the right panel), variables 1, 5, 2 are nonzero, which is consistent with β = (3,1.5,0, 0, 2, 0,0,0)T .

18

University of Minnesota Duluth

2.7.3 Simulation II

I generated 1000 observations with 8 correlated variables. The parameters are designed to be β = (0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85)T . The correlation between xi ,

x j remains φ

i? j

, φ = 0.5 , standard deviation of error is σ = 3 . The data generated yields a

signal-to-noise ratio (SNR) of approximately 1.8. The result of Simulation II is shown in Fig 2.7.2.

Fig2.7.2 LASSO coefficient shrinkage in simulation 2: The GCV score plot doesn’t have an obvious elbow position. If I have to choose one, optimal s can be 0.8. The covariates enter the regression equation sequentially as s increase, in order i=5, 7, 3…, 8. If s=0.8, shown by the vertical red line, all variables are nonzero, which is consistent with the assigned β = (0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85)T

2.7.4 Simulation III

I generated 1000 observations with 8 correlated variables. The parameters are designed to be β = (6, 0, 0, 0, 0, 0, 0, 0)T . The correlation between xi , x j is still φ

i? j

, φ = 0.5 , standard

deviation of error is σ = 2 . The data generated yields a signal-to-noise ratio (SNR) of approximately 7. The result of Simulation III is shown in Fig 2.7.3.

19

University of Minnesota Duluth

Fig 2.7.3 LASSO coefficient shrinkage in simulation 3: The GCV score plot one the right panel have an obvious elbow position, optimal s is approximately 0.6. The covariates enter the regression equation sequentially as s increase, the 1st variable enter first, all the other variables are hard to tell the order of entering. If s=0.6, only the 1st variable are nonzero, which is consistent with the assigned

β = (6, 0,0, 0,0, 0, 0, 0)T

From Fig 2.7.3, because one of the variables is clearly more significant than the others, the GCV score plot shows a more obvious elbow position. To generalize, the significance level of variables determines how distinguished the elbow position is.

20

University of Minnesota Duluth

Chapter 3 Logistic LASSO model 3.1 Logistic LASSO model and Maximum likelihood estimate

3.1.1 Logistic LASSO model

The l1 -norm constraint of the LASSO can also be applied to logistic regression models (Genkin et al., 2004) by minimizing the corresponding negative log-likelihood instead of the residual sum of squares. LASSO logistic model is similar to the LASSO linear model except

Yi ’s can only take two possible values, 0 and 1. I will assume that the response variable Yi is

a Bernoulli random variable with probability

P (Yi = 1| xi , β ) = π i , P(Yi = 0 | xi , β ) = 1 ? π i .

(3.1.1)

It is easily proven that

E (Yi ) = π i

and

2 σY = π i (1 ? π i ) .

i

(3.1.2)

(3.1.3)

Generally, when the response variable is binary, a monotonically increasing (or decreasing) S-shaped function is usually employed. The function is called the logistic response function, or logistic function.

P (Yi = 1| βi , xi ) = E (Yi ) =

1 , 1 + exp(? xi β )

(3.1.4)

21

University of Minnesota Duluth

Fig 3.1.1 Logistic Response function

Replacing β T xi by η = β T xi (linear response), then

η = ln

π

1? π

.

(3.1.5)

3.1.2 Laplace Priors & LASSO Logistic Regression

We extend the ordinary logistic model to the LASSO logistic regression model by imposing an l1 constraint on the parameters:

∑β

j

j

≤t.

Tibshirani(1996) firstly suggested that the Bayesian approach to the logistic regression involves a Laplace prior distribution of β . For LASSO Logistic regression model, Genkin(2004) assumed that β j arises from a normal distribution with mean 0, and variance

τ j , that is

p( β j | τ j ) = ? β j2 ? 1 . exp ? ? ? 2τ ? ? 2πτ j j ? ?

(3.1.6)

The assumption of mean 0 indicates our belief that β j is close to zero. The variances τ j

22

University of Minnesota Duluth

are positive constants. A small value of τ j represents a prior belief that β j is close to zero. Conversely, a large value of τ j represents a less informative prior belief. In the simplest

case, assume that τ j = τ , for all j , and the component of β are independent and hence the overall prior of β is the product of the prior of each of the components β j ’s. Tibshirani (1996) suggested τ j arises from an Laplace prior (double exponential distribution) with density

p (τ j | γ j ) =

γj

? γj ? exp ? ? τ j ? . 2 ? 2 ?

(3.1.7)

Integrating out τ j can give us the distribution of β j as follows (Lemma 3.1)

p(β j |ζ j ) =

ζj

? ζj ? exp ? ? βj ?, ζ j = γ j . 2 ? 2 ?

(3.1.8)

Lemma 3.1 β arises from a normal distribution with mean 0, and variance τ , and τ arises

from an Laplace prior with coefficient γ , then p β | γ =

(

)

? ? γ .. exp ? ? β ? ? 2 ? 2 ? ?

γ

p ( β | γ ) = ∫ p ( β | τ ) p(τ | γ )dτ

0

∞

1 = 2 2π Substituting x = p(β | γ ) =

γ

∫

γ

2

∞

1

0

? ?γ β 2 ?? exp ? ? ? τ + ? ?dτ . 2τ ? ? π ? ?2

2 2τ

τ , dτ =

γ

dx and

β 2 β 2γ = leads to 2τ 4 x 2

γ π

∫

∞

0

? ? β 2γ ?? exp ? ? ? 2 + x 2 ? ? dx . ?? ? ? 4x

We know that ∫

∞

0

? ? a2 ?? β γ π exp ? ? ? 2 + x 2 ? ?dx = exp [ ?2a ] , where a = . Now we reach our 2 2 ?? ? ?x ? ? γ . If we do substitution ζ = γ , we have exp ? ? β ? ? 2 ? 2 ? ?

designated form p ( β | γ ) =

γ

23

University of Minnesota Duluth

p(β |ζ ) =

ζ

? ζ ? exp ? ? β ? 2 ? 2 ?

■ This is the density of double exponential distribution. Fig 3.1.2 gives the plot of this density function together with normal density function.

Fig 3.1.2 Double exponential density function (black) and normal density function (red).

From the Fig 3.1.2, I can see that Laplace prior (double exponential distribution) favor values around 0, which indicates that LASSO favors zeros as estimates for some variables.

3.2 Maximum likelihood and Maximum a Posteriori estimate

3.2.1 Maximum Likelihood (ML) Estimate

Firstly recall the linear logistic regression model; we will use maximum likelihood to estimate the parameters in β T xi . The density function of each sample observation is fi ( yi ) = π i yi (1 ? π i )1? yi . (3.2.1)

Since each observation is assumed to be independent, the likelihood function is L(Y , β ) = ∏ fi ( yi ) =∏ π i yi (1 ? π i )1? yi .

i =1 i =1 n n

(3.2.2)

The log-likelihood function is

n n ? ? π ln L(Y , β ) = ln ∏ fi ( yi ) =∑ ? yi ln ? i i =1 ? i =1 ? 1? πi n n ?? n T ?1 + exp( xT ? ln 1 π y x β ln ? + ? = ? ( ) ?? ∑ ∑ ∑ i i i i β )? . i =1 i =1 ? ? i =1

24

University of Minnesota Duluth

Let

? exp ( xiT β ) ? d ln L(Y , β ) T ?xiT = 0 . = ∑ yi xi ? ∑ ? T dβ i i ?1 + exp ( xi β ) ? ? ?

(3.2.4)

We can solve (3.2.4) and get an estimate for β

There is no closed-form solution for this likelihood equation. The most common optimization approach in statistical software, such as SAS and Splus/R , is multidimensional NewtonRaphson algorithm implemented via iteratively reweighted least squares (IRLS) algorithm (Dennis and Schnabel 1989; Hastie and Pregibon 1992). Newton-Raphson method has the advantage of converging after very few iterations if you have a very good initial value. For detail, please refer to Introduction to Linear Regression Analysis by D. Montgomery (4th edition) appendix C.14.

3.2.2 Maximum A Posteriori (MAP) Estimate

MAP was introduced initially by Harold W. Sorenson (1980). MAP estimate can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to Fisher's method of maximum likelihood (ML). It incorporates an option for prior distribution of parameters which allows us to apply constraint to the problem. MAP estimate can be seen as a special case of ML estimate. Assume that I want to estimate an unobserved population parameter θ on the basis of observations x. Let f be the sampling distribution of X, such that f ( x | θ ) is the probability of x when the underlying population parameter is θ. Then the function f ( x | θ ) is known as the likelihood function and the estimate

θ?ML ( x) = arg max f ( x | θ )

β

(3.2.5)

is the maximum likelihood estimate of θ. Now assume that θ has a prior distribution g. We will treat θ as a random variable in Bayesian statistics. Then the posteriori distribution of θ is as follows:

25

University of Minnesota Duluth

θ|x~

f ( x | θ ) g (θ )

Θ

∫

f ( x | θ ' ) g (θ ' )dθ '

.

(3.2.6)

Θ is the domain of g. MAP method then estimates θ as the argument maximizes posteriori density of this random variable θ:

θ?MAP ( x) = arg max

β

f ( x | θ ) g (θ )

Θ

∫

f ( x | θ ' ) g (θ ' )dθ '

= arg max f ( x | θ ) g (θ ) .

β

(3.2.7)

The denominator of the posteriori distribution does not depend on θ. Therefore, minimizing the posteriori distribution is equivalent to minimizing numerator. Observe that the MAP estimate of θ coincides with the ML estimate when the prior g is uniform. The MAP estimate is the Bayes estimator under the uniform loss function. When MAP is applied to LASSO, ? β = arg max f ( x | β ) p( β ) is equivalent to maximizing the log likelihood function.

MAP

β

? β MAP = arg max [ ln L (Y , β ) + ln p ( β ) ]

β

p n ? n T ? ? = arg max ? ∑ yi xT ? + ? β ln 1 exp( x β ) ∑ i i ? ? ∑ (ln 2 ? ln λ j + λ j β j β = 1 = 1 i i j =1 ?

n n p

? ) ? . (3.2.8) ?

T ? ? Let f ( x, y, β ) = ∑ yi xT i β ? ∑ ln ?1 + exp( x i β ) ? ? ∑ (ln 2 ? ln λ j + λ j β j ) . i =1 i =1 j =1

(3.2.9)

Taking the derivative of (3.2.9) with respect to β , I can get the following function

p exp( xT T iβ ) y x ?∑ x ? ∑ λ j sign( β j ) = 0 . ∑ T i =1 i =1 1 + exp( x i β ) j =1 n T i i n

(3.2.10)

There is no closed-form solution for equation (3.2.10); however, a variety of alternate optimization approaches have been explored for MAP optimization, especially when p is large (number of variables is large). For further detail, you can refer to Yuan, M. and Lin, Y. (2006) and Genkin, A., Lewis, D. D. and Madigan, D. (2004). Fortunately, some numerical

26

University of Minnesota Duluth

approach has been implemented into statistical software, such as R package LASSO2 (written by R. Hastie) and will be introduced in Chapter 4

3.3 Case Study: Kyphosis Data

The kyphosis data is a very popular data set in logistic regression analysis. The data frame has 81 rows representing data on 81 children who have had corrective spinal surgery. The outcome Kyphosis is a binary variable; the other three variables (columns) are numeric. It was first published by John M. Chambers and Trevor J. Hastie (1992) Kyphosis "absent”. Age Number Start the age of the child (unit: months). the number of vertebrae involved in the operation. the beginning of the range of vertebrae involved in the operation. a factor telling whether a postoperative deformity (kyphosis) is "present" or

The data structure looks like

index 1 2 3 4 … 74 75 76 78 79 80 81 Kyphosis absent absent present absent … absent absent absent absent absent absent absent Age 71 158 128 2 … 1 168 1 78 175 80 27 Number 3 3 4 5 … 4 3 3 6 5 5 4 Start 5 14 5 1 … 15 18 16 15 13 16 9

Table 3.3 data structure of Kyphosis data

27

University of Minnesota Duluth

Fig 3.3.1 Boxplots of kyphosis data. Age and number doesn’t show a strong location shift; it turns out that quadratic forms of age and number should be added to the model

After adding the quadratic terms of age, number and start, we analyze the data using logistic LASSO model. The result is shown in Fig 3.3.2 and Table 3.3.

1: intercept; 2: age; 3: Number; 4: Start; 5: age^2; 6: Number^2; 7: Start^2 Fig3.3.2 LASSO coefficient shrinkage in kyphosis example: GCV score plot is discrete because error of logistic model is non-normally distributed. Monotone decreasing curve represents a coefficient as a function of relative bound s. The covariates enter the regression equation sequentially as s increase, in order i=4, 3, 5, 2, 7. If s=0.55, shown by the vertical red line, chosen by GCV (shown in the right panel), variables 4(Start), 3(Number), 5(Age^2), 2(Age) are nonzero.

We can see from Table 3.3 that LASSO yields a much smaller standard error than the OLS method as expected.

28

University of Minnesota Duluth Predictor Intercept Age Number Start Age^2 Number^2 Start^2 LASSO Coefficients -1.0054 0.1425 0.2765 -0.6115 -0.5987 0 0 Results SE 0.445053 0.178466 0.258829 0.291552 0.379761 0.043329 0.047823 Z-score -2.25906 0.798473 1.068271 -2.09739 -1.57652 0 0 Coefficients -0.1942 1.1128 1.0031 -2.6504 -1.4006 -0.3076 -1.1143 SE 0.6391 0.5728 0.5817 0.9361 0.664 0.2398 0.5281 OLS Results Z-score -0.304 1.943 1.725 -2.831 -2.109 -1.283 -2.11 Pr(>|t|) 0.76125 0.05204 0.08461 0.00463 0.03491 0.19955 0.03486 * . . ** *

Table 3.3 Results of kyphosis data

3.4 Simulation Study

The model is π i = exp( β T X + ε ) . 1 + exp ( β T X + ε ) i = 1, 2......, n

We also use autoregressive model to generate multi-correlated data. ε is random error with normal distribution N ( 0, σ 2 ) . There are 5 variables, and a total of 40 observations. The correlation between xi , x j is φ

i? j

, φ = 0.3 , σ = 1 . The assigned coefficients are

β = ( 3,1.5, 0, 0, 2 )

index

T

y 1 2 3 4 5 … 37 38 39 40 1 0 0 1 0 … 0 0 0 1

x1 1.002997 -1.55593 -0.25711 0.884903 -1.67065 … -0.31754 -0.7015 0.305101 -0.22636

x2 1.339722 -0.61825 0.944475 0.558206 0.497213 … -0.6843 0.404193 1.483059 -0.37299

x3 1.556792 -0.02163 0.59679 0.828492 -0.24751 … -0.41027 -0.58057 0.467948 -0.94524

x4 -1.06295 -0.02956 -1.21193 -0.01518 1.258528 … -0.08336 0.235259 2.321507 -1.30565

x5 -0.98428 -3.10966 -0.62585 0.285952 0.670419 … -0.08773 -1.67983 -1.00605 1.511724

Table 3.4.1 Data structure of Simulation

We can see the LASSO analysis procedure more clearly from Fig 3.4.1. The left diagram shows the change of GCV scores with respect to the relative bound. There is an evident elbow position at approximately s=0.6. The right diagram denotes the change of LASSO

29

University of Minnesota Duluth

coefficients with respect to relative bound. If I choose s to be 0.6, and draw a vertical line through the diagram, I can pick out 2(x1), 3(x2), and 6(x5). The black line represents intercept.

Fig 3.4.1 LASSO shrinkage of coefficients in Simulation example: from the left panel, I can see the

? = 0.6 . The right panel show how lasso shrinkage the obvious elbow position is at 0.6. Thus, s coefficients along with the shrinkage of s

Now I will fix s at 0.6, and analyze the data set using both general linear model and LASSO. I can tell from the table below how LASSO benefits us in minimizing standard error. The standard errors of LASSO estimates and OLS are computed by bootstrapping from the original data set. LASSO yields a much smaller standard error than the OLS method as expected.

Predictor Intercept x1 x2 x3 x4 x5

LASSO Coefficients 1.127088 2.625973 0.05362 0 0 0.543464

Results SE 0.354877 0.546351 0.207421 0.069338 0.183042 0.232676 Z-score 3.175998 4.806383 0.258509 0 0 2.335708 Coefficients 1.49703 3.961 0.97223 -0.90678 -0.08364 1.51508

OLS SE 0.96905 1.78794 0.93206 0.97318 0.70519 0.9877

Results Z-score 1.545 2.215 1.043 -0.932 -0.119 1.534 Pr(>|t|) 0.1224 0.0267 0.2969 0.3515 0.9056 0.125 * ** *

Table 3.4.2 Results fro the Simulation study

30

University of Minnesota Duluth

Chapter 4 Computation of the LASSO

Tibshirani(1996) firstly suggested that the computation of the LASSO solutions is a quadratic programming problem, which doesn’t necessarily have a closed form solution. However, there are numerous algorithms developed by statisticians which can approach the solution numerically. The algorithm proposed by Tibshirani (1996) is adequate for moderate values of p, but it is not the most efficient possible. It is inefficient when number of variables is large; it is unusable when the number of variables is larger than the number of observations. Thus, statisticians have explored how to find more efficient ways to estimate parameters for years. A few significant results are the dual algorithm (M. Osborne, B. Presnell, and B. Turlach, 2000); L-2 boosting approaches (N. Meinshausen, G. Rocha and B. Yu. 2007); Shooting method (Wenjiang Fu, 1998); Least Angle Regression (LARS) (Efron, B., Johnstone, I., Hastie, T. and Tibshirani, R., 2002). LARS is the most widely used algorithm. This

algorithm exploits the special structure of the LASSO problem, and provides an efficient way to compute the solutions simultaneously for all values of s . In this chapter, I will only briefly introduced Osborne’s dual algorithm and B. Efron’s LARS algorithm.

4.1 Osborne’s dual algorithm

Osborne et. al present a fast converging algorithm in his paper LASSO and its Dual(2000). He first introduces a general algorithm based on his duality theory described in his original paper. In LASSO and its Dual(2000), the algorithm can be used to compute the LASSO estimates in any setting. For the details, please refer to Osborne (2000). However, here I will only illustrate a simpler case, the orthogonal design case.

I have already proven that the solutions to (1.1) subjected to (1.2) is (2.1.7), I substituted λ = 2γ , then (2.1.7) becomes ? L = sign β ?o β ?o ?γ β j j j Suppose that

( )(

L j j

)

+

j = 1,

,p.

(4.1.1)

∑ β?

j

o j

= to and

∑ β?

=t,

31

University of Minnesota Duluth

?o ? β ?L ?γ to ? t = ∑ β o j ? ∑ β L j = ∑ β j j

j j j

{

(

)}

+

?o I β ?o ≤ γ + γ I β =∑ β ∑ ? oj ≥ γ j j

j j

(

)

(

)

(4.1.1)

= ∑ bj + γ ( p ? K ) .

j

Where, p is the number of variables; b1 ≤ b2 ≤ … ≤ bp are the ordered statistics of ?o , β ? o ,..., β ? o , K = max{ j : b ≤ γ } . Since usually, t < t , K < p and b ≤ γ ≤ b . β p 1 2 j o K K +1

?

Start with c0 = 0 . Do the iteration of j from 1 to p, 0 = c0 ≤ c1 ≤ … ≤ c p = t0 .

?

c j = ∑ bi + b j ( p ? j ) such that

i =1

j

?

Let K = max{i : ci ≤ t0 ? t} which is easily computed after t is chosen by GCV.

K ? ? ?( to ? t ) ? ∑ bi ? i =1 ? Corresponding γ = ?

?

( p ? K)

.

(4.1.2)

?

? L = sign β ?o Solving all the LASSO estimates by β j j

( )( β?

o j

?γ

)

+

. j = 1,

,p

Based on Osborne’s algorithm, Justin Lokhorst, Bill Venables and Berwin Turlach developed an R package “lasso2”. The package and the manual can be downloaded from the following link: http://www.maths.uwa.edu.au/~berwin/software/lasso.html

4.2 Least angle regression algorithm

Least Angle Regression (LARS) is a new model selection algorithm. LARS is introduced in detail in a paper by Brad Efron, Trevor Hastie, Iain Johnstone and Rob Tibshirani. In the paper, they establish the theory behind the algorithm, and then introduce how LARS relates to LASSO and forward stepwise regression. One advantage of LARS is that it implements

32

University of Minnesota Duluth

these two techniques. The modification from LARS to LASSO is that if a non-zero coefficient hits zero, remove it from the active set of predictors and recompute the joint direction. (Hastie 2002). The algorithm is showed below:

? ? ?

Start with all coefficients β j equal to zero. Find the predictor x j most correlated with y j . Increase the coefficient β j in the direction of the sign of its correlation with y j . Take ? j along the way. Stop when some other predictor xk has as much residuals rj = y j ? y correlation with r as x j has.

?

Increase ( β j , β k ) in their joint least squares direction, until some other predictor xm has as much correlation with the residual r.

?

Continue in this way until all p predictors have been entered. After p steps, one arrives at the full least-squares solutions.

Based on this algorithm, Trevor Hastie and Brad Efron developed a R package called “LARS”. The package and the manual can be downloaded from the following link: http://www-stat.stanford.edu/~hastie/Papers/#LARS

33

University of Minnesota Duluth

References

Chambers J. and Hastie T., (1992) Statistical Models in S, Wadsworth and Brooks, Pacific Grove, CA 1992, pg. 200. Craven, P. and Wahba, G. (1979). Smoothing noisy data with spline functions. Numer. Math., 31: 377-403. Efron, B., Johnstone, I., Hastie, T. and Tibshirani, R. (2002). Least angle regression , Annals of Statistics 20 Fu Wenjiang(1998). Penalized regressions: the bridge versus LASSO. JCGS vol 7, no.3, 397-416. Genkin, A., Lewis, D. D. and Madigan, D. (2004) Large-Scale Bayesian Logistic Regression for Text Categorization Hastie, T., Tibshirani, R. & Friedman, J. (2001), The Elements of Statistical earning; Data Mining, Inference and Prediction, Springer Verlag, New York. Hastie T., Taylor J., Tibshirani R. and Walther G., Forward Stagewise Regression and the Monotone LASSO, Harold W. Sorenson, (1980) Parameter Estimate: Principles and Problems, Marcel Dekker Meinshausen N, Rocha G., and Yu G. (2007) A TALE OF THREE COUSINS: LASSO, L-2BOOSTING. AND DANTZIG. Submitted to Annual of Statistics, 2008. UC Berkeley. Osborne, M., B. Presnell, and B. Turlach (2000). On the LASSO and its dual. Journal of Computational and Graphical Statistics 9, 319–337. Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. J. Royal. Statist. Soc B., Vol. 58, No. 1, pages 267-288) Yuan, M. and Lin, Y. (2006), Model Selection and Estimate in Regression with Grouped Variables. Journal of the Royal Statistical Society Series B, 68, 49–67

34

University of Minnesota Duluth

Appendix A: The effective Number of Parameters

The effective Number of Parameters was introduced in the book by Hastie, Tibshirani, Friedman (2001) The concept of “number of parameters” can be generalized, especially to models where regulation is used in the fitting.

? = HY Y ? is a vector composed with predictors Y is a vector composed with outcomes y1 ,…, yn , Y

?1 ,…, y ? n , H is a n × n matrix depending on the input vectors X . y The effective number of parameters is defined as

d ( H ) = trace( H )

Appendix B: R Code for Examples and Simulations

1. Acetylene data

library(lasso2) library(lars) plot.new()

P<-as.vector(c(49,50.2,50.5,48.5,47.5,44.5,28,31.5,34.5,35,38,38.5,15,17,20.5,29.5),mode="numeric") Temperature<-as.vector(c(1300,1300,1300,1300,1300,1300,1200,1200,1200,1200,1200,1200,1100,1100,1 100,1100),mode="numeric") H2<-as.vector(c(7.5,9,11,13.5,17,23,5.3,7.5,11,13.5,17,23,5.3,7.5,11,17),mode="numeric") ContactTime<-as.vector(c(0.012,0.012,0.0115,0.013,0.0135,0.012,0.04,0.038,0.032,0.026,0.034,0.041,0.0 84,0.098,0.092,0.086),mode="numeric") acetylene<-data.frame(Temperature=Temperature,H2=H2,ContactTime=ContactTime)

#plot(Temperature,ContactTime,xlab="Reactor Temperature",ylab="Contact time") cor(acetylene) TH<-Temperature*H2

35

University of Minnesota Duluth TC<-Temperature*ContactTime HC<-H2*ContactTime T_2<-Temperature^2 H2_2<-H2^2 C_2<-ContactTime^2 acetylene<-data.frame(T=Temperature,H=H2,C=ContactTime,TC=TC,TH=TH,HC=HC,T2=T_2,H2=H2_ 2,C2=C_2) acetylene

a.mean<-apply(acetylene,2,mean) acetylene<-sweep(acetylene,2,a.mean,"-") a.var<-apply(acetylene,2,var) acetylene<-sweep(acetylene,2,sqrt(a.var),"/") xstd<-as.matrix(acetylene) acetylene<-data.frame(y=P,acetylene)

a.lm<-lm(y~.,data=acetylene) summary(a.lm) anova(a.lm) #ace_step<-step(a.lm) #summary(ace_step)

ace_lasso2<-l1ce(y~.,data=acetylene,bound=(1:40)/40) ystd<-as.vector(P,mode="numeric") ace_lars <- lars(xstd,ystd,type="lasso",intercept=TRUE) #plot(ace_lars,main="shrinkage of coefficients Acetylene") ############plot the GCV vs relative bound graph############# lassogcv<-gcv(ace_lasso2) lassogcv lgcv<-matrix(lassogcv,ncol=4) plot(lgcv[,1],lgcv[,4],type="l",main="Acetylene:GCV score vs s",xlab="relative bound",ylab="GCV") ace<-l1ce(y~.,data=acetylene,bound=0.23) summary(ace)

36

University of Minnesota Duluth

2 . Linear LASSO Simulation Example 1

library(lasso2) library(lars) par(mfrow=c(1,2)) ###########generating the desired data Generator<-function(mean,Ro,beta,sigma,ermean=0,erstd=1,dim) { x<-matrix(c(rep(0,dim*9)),ncol=9) y<-matrix(c(rep(0,dim)),ncol=1) er<-matrix(c(rnorm(8*dim,0,1)),ncol=8) error<-matrix(c(rnorm(dim,mean=ermean,sd=erstd)),ncol=1) for(j in 1:dim) { x[j,1]<-mean for(i in 2:9) { x[j,i]<-(1-Ro)*mean+Ro*x[j,i-1]+er[j,i-1] } y[j]<-x[j,2:9]%*%beta+sigma*error[j] } signal<-x[,2:9]%*%beta SNR<-mean(signal)/erstd return(list(x=x[,2:9],y=y,SNR=SNR)) }

#########################################

SNR<-5.7 #Signal to Noise ratio sigma<-3 #standard deviation of white noise n<-100 #data set size #destinated coefficient

beta<-matrix(c(3,1.5,0,0,2,0,0,0),ncol=1) Ro<-0.5 #Correlation

signalmean<-SNR*sigma/sum(beta) data<-Generator(signalmean,Ro,beta,sigma=sigma,n=n)

37

University of Minnesota Duluth data ############Standardized the data############################ x.mean<-apply(data$x,2,mean) xros<-sweep(data$x,2,x.mean,"-") x.std<-apply(xros,2,var) #mean for each parameter values #every element minus its corresponding mean

#variance of each parameter values #every centered data dived its standard deviation

xstd<-sweep(xros,2,sqrt(x.std),"/")

ystd<-as.vector(data$y,mode="numeric")

############Analysis of the data with LASSO################## #Using "Lars" Package plres <- lars(xstd,ystd,type="lasso",intercept=TRUE) plot(plres,main="shrinkage of coefficients Example 1")

#Using "Lasso" Package l1c.P <- l1ce(ystd~xstd,xros, bound=(1:40)/40) l1c.P

betaLasso<-matrix(coef(l1c.P),ncol=8)

############plot the GCV vs relative bound graph############# lassogcv<-gcv(l1c.P) lassogcv lgcv<-matrix(lassogcv,ncol=4) plot(lgcv[,1],lgcv[,4],type="l",main="Acetylene:GCV score vs s",xlab="relative bound",ylab="GCV")

3. Diabetes data

library(lasso2) library(lars) data(diabetes)

diabetes_x<-diabetes[,1] diabetes_x[1:10,] diabetes_y<-diabetes[,2] diab<-data.frame(age=diabetes_x[,1],sex=diabetes_x[,2],bmi=diabetes_x[,3],

38

University of Minnesota Duluth BP=diabetes_x[,4],S1=diabetes_x[,5],S2=diabetes_x[,6], S3=diabetes_x[,7],S4=diabetes_x[,8],S5=diabetes_x[,9], S6=diabetes_x[,10],y=diabetes_y) # responsor still use the original data

#####OLS######## diabetes_OLS<-lm(y~.,data=diab) diabetes_OLS summary(diabetes_OLS) anova(diabetes_OLS)

#####Cross validation procedure to decide tuning parameter t cv.diabetes<-cv.lars(diabetes_x,diabetes_y,K=10,fraction=seq(from=0,to=1,length=40)) cv.diabetes title("10-fold Cross Validation and Standard error") lars_diabetes<-lars(diabetes_x,diabetes_y,type="lasso",intercept=TRUE) plot(lars_diabetes)

#Using "Lasso" Package l1c.diabetes<- l1ce(y ~ .,diab, bound=(1:40)/40) l1c.diabetes anova(l1c.diabetes) betaLasso<-matrix(coef(l1c.diabetes),ncol=9)

lassogcv<-gcv(l1c.diabetes) lassogcv lgcv<-matrix(lassogcv,ncol=4) plot(lgcv[,1],lgcv[,4],type="l",main="Diabetes:GCV vs s",xlab="s",ylab="GCV")

########bootstrap to get the standard error #################### resample<-function(data,m) { dim<-9 res<-data

39

University of Minnesota Duluth r<-ceiling(runif(m)*m) for(i in 1:m) { for(j in 1:dim) res[i,j]<-data[r[i],j] } return(list(res=res)) }

########################### l1c.diabetes<-l1ce(y~.,diab, bound=0.4) summary(l1c.diabetes)

nboot<-500 diab_res<-resample(diab,442) l1c.diabetes<-l1ce(y~.,diab_res$res, bound=0.4) sum(residuals(l1c.diabetes)^2) summary(l1c.diabetes)

l1c.C<-coefficients(l1c.diabetes) for(m in 2:nboot) { diab_res<-resample(diab,442) l1c.diabetes<-l1ce(y~.,diab_res$res, bound=0.4) l1c.C<-cbind(l1c.C,coefficients(l1c.diabetes)) }

sqrt(var(l1c.C["(Intercept)",])) sqrt(var(l1c.C["age",])) sqrt(var(l1c.C["sex",])) sqrt(var(l1c.C["bmi",])) sqrt(var(l1c.C["BP",])) sqrt(var(l1c.C["S1",])) sqrt(var(l1c.C["S2",])) sqrt(var(l1c.C["s3",]))

40

University of Minnesota Duluth sqrt(var(l1c.C["S4",])) sqrt(var(l1c.C["S5",])) sqrt(var(l1c.C["S6",]))

4. Kyphosis data

library(lasso2) library(lars) library(rpart) data(kyphosis)

############Transform Kyphosis to numeric mode##############\

n<-length(kyphosis$Kyphosis) n y<-matrix(c(rep(0,n)),ncol=1) for(i in 1:n) { if(kyphosis$Kyphosis[i]=="absent") y[i]<-0 if(kyphosis$Kyphosis[i]=="present") y[i]<-1 }

############Standardized the data########################### Temp<-kyphosis[,2:4] k.mean<-apply(Temp,2,mean) #mean for each parameter values #every element minus its corresponding mean

kyph<-sweep(Temp,2,k.mean,"-") k.var<-apply(kyph,2,var)

#variance of each parameter values #every centered data dived its standard deviation

kyph<-sweep(kyph,2,sqrt(k.var),"/") kyph[,"Kyphosis"]<-y kyph<-as.data.frame(kyph)

# responsor still use the original data

ageSq<-(kyph$Age)^2

numSq<-(kyph$Number)^2

41

University of Minnesota Duluth startSq<-(kyph$Start)^2 kyph<-data.frame(Age=kyph$Age, Number=kyph$Number, Start=kyph$Start,ageSq=ageSq,

numSq=numSq, startSq=startSq,Kyphosis=y)

#Linear logistic fitted model glm.K<-glm(Kyphosis~.,data=kyph,family=binomial()) beta_logistic<-as.vector(coefficients(glm.K),mode="numeric") beta_logistic

###logistic lasso with varying relative bound gl1c.Coef<-matrix(c(rep(0,7*40)),ncol=7) for(i in 1:40) { temp<-gl1ce(Kyphosis~.,data=kyph,family=binomial(),bound=i/40) gl1c.Coef[i,]<-coefficients(temp) } gl1c.Coef<-data.frame(Intercept=gl1c.Coef[,1],Age=gl1c.Coef[,2],Number=gl1c.Coef[,3],Start=gl1c.Coef [,4],AgeSq=gl1c.Coef[,5],NumberSq=gl1c.Coef[,6],StartSq=gl1c.Coef[,7])

42