Predicting Dependent Measure Y3 using RACE and SEX: GLM Analysis | Exams Statistics

Paper 422-2013

Being Continuously Discrete (or Discretely Continuous):

Understanding Models with Continuous and Discrete Predictors and

Testing Associated Hypotheses

David J. Pasta, ICON Late Phase & Outcomes Research, San Francisco, CA

ABSTRACT

Often a general (or generalized) linear model has both discrete predictors (included in the CLASS

statement) and continuous predictors. Binary variables can be treated either as continuous or discrete;

the resulting models are equivalent but the interpretation of the parameters differs. In many cases,

interactions between discrete and continuous variables are of interest. This paper provides practical

suggestions for building and interpreting models with both continuous and discrete predictors. It includes

some examples of the use of the STORE statement and PROC PLM to understand models and test

hypotheses without repeating the estimation step.

INTRODUCTION

First let us be clear on what we mean by discrete and continuous predictors. A discrete (or categorical)

predictor is one which is included in the CLASS statement. I use the terms discrete and categorical

interchangeably in this context. The individual values are not assumed to have any particular relationship

to each other: they are treated as just “names” for the categories and are not to be interpreted

quantitatively even if they are numbers. Variables for which the categories are considered to be names

without even a partial ordering are referred to as being nominal variables; I prefer to use the terms

discrete or categorical to emphasize the way they are being used in the model rather than the nature of

the variable itself. It is possible for the underlying variable to be continuous, but what is important for our

purposes is that we want to estimate the effect of each value separately and not to assume specific

spacing between values.

A continuous predictor is one for which the numeric values are treated as meaningful and the estimated

coefficient is interpreted as the effect of a one-unit change. In practice, ordinal variables can be treated

as discrete or as continuous (and sometimes profitably as both discrete and continuous in the same

analysis). In addition, continuous variables can be grouped into categories and converted into discrete

variables. This issue is discussed at length in Pasta (2009), but it is worthwhile to summarize a few of the

points made there. Treating an ordinal variable as continuous allows you to estimate the linear

component of the relationship, as recommended by Moses et al. (1984). On the other hand, treating an

ordinal variable as discrete allows you to capture much more complicated relationships. It seems

worthwhile to consider both aspects of the variable.

A WORD ABOUT BINARY VARIABLES

Binary variables take on exactly two values, such as 0 and 1 or True and False or Male and Female. For

analysis purposes, they can be treated as continuous or discrete. Because you generally get an

equivalent model whether or not the binary variables is in the CLASS statement, it is easy to get lazy

about considering the implications. In fact whether a binary variable is treated as continuous or discrete

affects the parameterization of models and therefore the interpretation of results and computational

algorithms. It is especially important to remember with binary variables are continuous or discrete when

interpreting least squares means (LSMEANS). Generally, my recommendation is to treat binary

variables as discrete (include them in the CLASS) statement, but sometimes you should treat them as

continuous.

PARAMETERIZATIONS

Before getting into models that include both discrete and continuous variables and, more interestingly,

their interactions, it is important to understand the way that models are parameterized in SAS®. This

includes an understanding of the CLASS statement and both the default and the alternative

parameterizations available. This material is covered in numerous places, including several of my papers

from previous conferences (Pritchard and Pasta 2004; Pasta 2005; Pasta 2009; Pasta 2010).

Statistics and Data Anal

sis

SAS Global Forum 20

Predicting Dependent Measure Y3 using RACE and SEX: GLM Analysis, Exams of Statistics

Related documents

Partial preview of the text

Download Predicting Dependent Measure Y3 using RACE and SEX: GLM Analysis and more Exams Statistics in PDF only on Docsity!

Paper 422-

Being Continuously Discrete (or Discretely Continuous):

Understanding Models with Continuous and Discrete Predictors and

Testing Associated Hypotheses

David J. Pasta, ICON Late Phase & Outcomes Research, San Francisco, CA

ABSTRACT

Often a general (or generalized) linear model has both discrete predictors (included in the CLASS

statement) and continuous predictors. Binary variables can be treated either as continuous or discrete;

the resulting models are equivalent but the interpretation of the parameters differs. In many cases,

interactions between discrete and continuous variables are of interest. This paper provides practical

suggestions for building and interpreting models with both continuous and discrete predictors. It includes

some examples of the use of the STORE statement and PROC PLM to understand models and test

hypotheses without repeating the estimation step.

INTRODUCTION

First let us be clear on what we mean by discrete and continuous predictors. A discrete (or categorical)

predictor is one which is included in the CLASS statement. I use the terms discrete and categorical

interchangeably in this context. The individual values are not assumed to have any particular relationship

to each other: they are treated as just “names” for the categories and are not to be interpreted

quantitatively even if they are numbers. Variables for which the categories are considered to be names

without even a partial ordering are referred to as being nominal variables; I prefer to use the terms

discrete or categorical to emphasize the way they are being used in the model rather than the nature of

the variable itself. It is possible for the underlying variable to be continuous, but what is important for our

purposes is that we want to estimate the effect of each value separately and not to assume specific

spacing between values.

A continuous predictor is one for which the numeric values are treated as meaningful and the estimated

coefficient is interpreted as the effect of a one-unit change. In practice, ordinal variables can be treated

as discrete or as continuous (and sometimes profitably as both discrete and continuous in the same

analysis). In addition, continuous variables can be grouped into categories and converted into discrete

variables. This issue is discussed at length in Pasta (2009), but it is worthwhile to summarize a few of the

points made there. Treating an ordinal variable as continuous allows you to estimate the linear

component of the relationship, as recommended by Moses et al. (1984). On the other hand, treating an

ordinal variable as discrete allows you to capture much more complicated relationships. It seems

worthwhile to consider both aspects of the variable.

A WORD ABOUT BINARY VARIABLES

Binary variables take on exactly two values, such as 0 and 1 or True and False or Male and Female. For

analysis purposes, they can be treated as continuous or discrete. Because you generally get an

equivalent model whether or not the binary variables is in the CLASS statement, it is easy to get lazy

about considering the implications. In fact whether a binary variable is treated as continuous or discrete

affects the parameterization of models and therefore the interpretation of results and computational

algorithms. It is especially important to remember with binary variables are continuous or discrete when

interpreting least squares means (LSMEANS). Generally , my recommendation is to treat binary

variables as discrete (include them in the CLASS) statement, but sometimes you should treat them as

continuous.

PARAMETERIZATIONS

Before getting into models that include both discrete and continuous variables and, more interestingly,

their interactions, it is important to understand the way that models are parameterized in SAS®. This

includes an understanding of the CLASS statement and both the default and the alternative

parameterizations available. This material is covered in numerous places, including several of my papers

from previous conferences (Pritchard and Pasta 2004; Pasta 2005; Pasta 2009; Pasta 2010).

One parameterization for discrete variables is the “less than full rank” approach in which dummy variables

(indicator variables) are created for each category. This parameterization, also called the GLM

parameterization, includes all the dummy variables but recognizes that there are redundancies and uses

appropriate computational methods such as generalized inverses to obtain parameter estimates. The last

category (as ordered using the formatted value) ends up as the reference category. To change the

reference category it is necessary to reorder the categories of the variable.

It is now possible to specify the parameterization you want to use on the CLASS statement (but be aware

that which procedures support this approach depends on which version of SAS you are using). You can

specify REFERENCE coding, which allows you to specify a reference category which is omitted from the

design matrix in various convenient ways. Alternatively you can specify EFFECT coding, which

effectively compares each category to the overall average rather than to a single category, although there

is still an omitted category that you can specify. My experience is that people find EFFECT coding rather

confusing at first, so I recommend the use of REFERENCE coding. Note that LOGISTIC now uses

EFFECT coding by default. You can specify different coding for different variables and different reference

categories (the default is LAST), making it much easier to manipulate the parameterization of discrete

variables.

When a model includes discrete variables, the parameter estimates are often difficult to interpret and the

test that they are zero may not be of interest. The LSMEANS statement allows the calculation of least

squares means, also called adjusted means, for the values of a variable (or of interactions among

discrete variables). There are also options to compare least squares means with or without adjustments

for multiple comparisons. One of the things to pay attention to when a model includes more than one

predictor variable is whether to specify the OBSMARGINS option (abbreviated OM) on LSMEANS. This

option causes the LSMEANS to use the observed marginal distributions of the variable rather than using

equal coefficients across classification effects (thereby assuming balance among the levels). Sometimes

you want one version and sometimes you want the other, but in my work I generally find that

OBSMARGINS more often gives me the LSMEANS I want. The issue of estimability also arises

(assuming the model is less than full rank). It is quite possible for the LSMEANS to be nonestimable with