Download Predicting Dependent Measure Y3 using RACE and SEX: GLM Analysis and more Exams Statistics in PDF only on Docsity!
Paper 422-
Being Continuously Discrete (or Discretely Continuous):
Understanding Models with Continuous and Discrete Predictors and
Testing Associated Hypotheses
David J. Pasta, ICON Late Phase & Outcomes Research, San Francisco, CA
ABSTRACT
Often a general (or generalized) linear model has both discrete predictors (included in the CLASS
statement) and continuous predictors. Binary variables can be treated either as continuous or discrete;
the resulting models are equivalent but the interpretation of the parameters differs. In many cases,
interactions between discrete and continuous variables are of interest. This paper provides practical
suggestions for building and interpreting models with both continuous and discrete predictors. It includes
some examples of the use of the STORE statement and PROC PLM to understand models and test
hypotheses without repeating the estimation step.
INTRODUCTION
First let us be clear on what we mean by discrete and continuous predictors. A discrete (or categorical)
predictor is one which is included in the CLASS statement. I use the terms discrete and categorical
interchangeably in this context. The individual values are not assumed to have any particular relationship
to each other: they are treated as just “names” for the categories and are not to be interpreted
quantitatively even if they are numbers. Variables for which the categories are considered to be names
without even a partial ordering are referred to as being nominal variables; I prefer to use the terms
discrete or categorical to emphasize the way they are being used in the model rather than the nature of
the variable itself. It is possible for the underlying variable to be continuous, but what is important for our
purposes is that we want to estimate the effect of each value separately and not to assume specific
spacing between values.
A continuous predictor is one for which the numeric values are treated as meaningful and the estimated
coefficient is interpreted as the effect of a one-unit change. In practice, ordinal variables can be treated
as discrete or as continuous (and sometimes profitably as both discrete and continuous in the same
analysis). In addition, continuous variables can be grouped into categories and converted into discrete
variables. This issue is discussed at length in Pasta (2009), but it is worthwhile to summarize a few of the
points made there. Treating an ordinal variable as continuous allows you to estimate the linear
component of the relationship, as recommended by Moses et al. (1984). On the other hand, treating an
ordinal variable as discrete allows you to capture much more complicated relationships. It seems
worthwhile to consider both aspects of the variable.
A WORD ABOUT BINARY VARIABLES
Binary variables take on exactly two values, such as 0 and 1 or True and False or Male and Female. For
analysis purposes, they can be treated as continuous or discrete. Because you generally get an
equivalent model whether or not the binary variables is in the CLASS statement, it is easy to get lazy
about considering the implications. In fact whether a binary variable is treated as continuous or discrete
affects the parameterization of models and therefore the interpretation of results and computational
algorithms. It is especially important to remember with binary variables are continuous or discrete when
interpreting least squares means (LSMEANS). Generally , my recommendation is to treat binary
variables as discrete (include them in the CLASS) statement, but sometimes you should treat them as
continuous.
PARAMETERIZATIONS
Before getting into models that include both discrete and continuous variables and, more interestingly,
their interactions, it is important to understand the way that models are parameterized in SAS®. This
includes an understanding of the CLASS statement and both the default and the alternative
parameterizations available. This material is covered in numerous places, including several of my papers
from previous conferences (Pritchard and Pasta 2004; Pasta 2005; Pasta 2009; Pasta 2010).
One parameterization for discrete variables is the “less than full rank” approach in which dummy variables
(indicator variables) are created for each category. This parameterization, also called the GLM
parameterization, includes all the dummy variables but recognizes that there are redundancies and uses
appropriate computational methods such as generalized inverses to obtain parameter estimates. The last
category (as ordered using the formatted value) ends up as the reference category. To change the
reference category it is necessary to reorder the categories of the variable.
It is now possible to specify the parameterization you want to use on the CLASS statement (but be aware
that which procedures support this approach depends on which version of SAS you are using). You can
specify REFERENCE coding, which allows you to specify a reference category which is omitted from the
design matrix in various convenient ways. Alternatively you can specify EFFECT coding, which
effectively compares each category to the overall average rather than to a single category, although there
is still an omitted category that you can specify. My experience is that people find EFFECT coding rather
confusing at first, so I recommend the use of REFERENCE coding. Note that LOGISTIC now uses
EFFECT coding by default. You can specify different coding for different variables and different reference
categories (the default is LAST), making it much easier to manipulate the parameterization of discrete
variables.
LEAST SQUARES MEANS AND THE OBSMARGINS OPTION
When a model includes discrete variables, the parameter estimates are often difficult to interpret and the
test that they are zero may not be of interest. The LSMEANS statement allows the calculation of least
squares means, also called adjusted means, for the values of a variable (or of interactions among
discrete variables). There are also options to compare least squares means with or without adjustments
for multiple comparisons. One of the things to pay attention to when a model includes more than one
predictor variable is whether to specify the OBSMARGINS option (abbreviated OM) on LSMEANS. This
option causes the LSMEANS to use the observed marginal distributions of the variable rather than using
equal coefficients across classification effects (thereby assuming balance among the levels). Sometimes
you want one version and sometimes you want the other, but in my work I generally find that
OBSMARGINS more often gives me the LSMEANS I want. The issue of estimability also arises
(assuming the model is less than full rank). It is quite possible for the LSMEANS to be nonestimable with
the OM option but estimable without, or vice versa. Some time spent understanding the model, together
with some tools that SAS provides, make the determination of estimability less mysterious. See Pasta
(2010) for additional details.
AN EXAMPLE: AN ORDINAL VARIABLE AS DISCRETE OR CONTINUOUS OR BOTH
An ordinal variable might be treated as discrete or continuous. It can also be profitably treated as both
discrete and continuous in the same model. This approach can be used to test deviations from linearity.
Let’s start by considering an ordinal variable, EDUCAT, which measures years of education in categories,
as a categorical (discrete) predictor of our dependent variable Y:
Code for GLM proc glm data=anal; class educat; model y = educat / solution; title1 'EDUCAT categorical with typical labels'; run;
EDUCAT categorical with typical labels
Dependent Variable: y
Sum of Source DF Squares Mean Square F Value Pr > F Model 4 21702.2880 5425.5720 2.24 0. Error 95 230398.3776 2425.
Corrected Total 99 252100.
R-Square Coeff Var Root MSE y Mean 0.086086 33.46631 49.24679 147.
EDUCAT continuous
Dependent Variable: y Sum of Source DF Squares Mean Square F Value Pr > F
Model 1 10457.6803 10457.6803 4.24 0. Error 98 241642.9853 2465. Corrected Total 99 252100.
R-Square Coeff Var Root MSE y Mean
0.041482 33.74458 49.65627 147.
Source DF Type I SS Mean Square F Value Pr > F l_educat 1 10457.68028 10457.68028 4.24 0.
Source DF Type III SS Mean Square F Value Pr > F l_educat 1 10457.68028 10457.68028 4.24 0.
Standard
Parameter Estimate Error t Value Pr > |t| Intercept 121.1955784 13.54728807 8.95 <. l_educat 8.4005599 4.07910241 2.06 0.
That gave us a p-value of 0.042, so we have a statistically significant linear trend. But are the deviations
from linearity statistically significant? This is the moment we've been waiting for!
How do you go about testing for deviations from linearity? It's actually pretty easy, but it leads to output
that people find a little odd-looking at first. For any ordinal variable, (1) put the ordinal variables in the
CLASS statement, (2) make an exact copy that will not be in the CLASS statement, and (3) include both
variables in the MODEL statement. Here, we have EDUCAT with K=5 categories. We created
L_EDUCAT (L for Linear) as an exact copy, and include both in the model. What happens? L_EDUCAT
will have 0 degrees of freedom and 0 Type III effect (it doesn't add any information after the categorical
EDUCAT is included). EDUCAT will be a test of deviations from linearity with K-2=3 degrees of freedom;
1 df is lost to the overall constant, and 1 df is lost to the linear effect L_EDUCAT. There are some details
to watch out for, best expressed by looking at some SAS output.
EDUCAT continuous and categorical
Dependent Variable: y
Sum of Source DF Squares Mean Square F Value Pr > F Model 4 21702.2880 5425.5720 2.24 0. Error 95 230398.3776 2425. Corrected Total 99 252100.
R-Square Coeff Var Root MSE y Mean 0.086086 33.46631 49.24679 147.
Source DF Type I SS Mean Square F Value Pr > F
l_educat 1 10457.68028 10457.68028 4.31 0. educat 3 11244.60769 3748.20256 1.55 0.
Source DF Type III SS Mean Square F Value Pr > F l_educat 0 0....
educat 3 11244.60769 3748.20256 1.55 0.
Standard Parameter Estimate Error t Value Pr > |t|
Intercept 227.8448953 B 73.98734261 3.08 0. l_educat -14.0213477 B 16.64845280 -0.84 0.
educat 1 less than HS -74.5544302 B 59.07208796 -1.26 0. educat 2 HS grad -65.4997931 B 42.86841897 -1.53 0. educat 3 some college -49.1245137 B 26.32351518 -1.87 0. educat 4 college grad 0.0000000 B...
educat 5 post college 0.0000000 B...
The overall p-value is the same as it was originally (0.071). As promised, the L_EDUCAT variable has 0
degrees of freedom in the Type III section. The EDUCAT variable has 3 degrees of freedom and a p-
value of 0.21, indicating a lack of statistical significance. That is, the deviations from linearity are non-
significant. I will leave it as an exercise for the reader to figure out how to manipulate the parameter
estimates from this run to get the values from the first two. It's a good way to make sure you understand
what SAS is doing.
BINARY VARIABLES – DISCRETE OR CONTINUOUS?
Binary variables might be treated as discrete or continuous. What difference does it make? Let’s take a
look at a fairly simple example. We have a dependent measure Y3 that we want to predict using RACE
(which takes on values BLACK, HISPANIC, and WHITE) and SEX (which takes on values FEMALE and
MALE). First look at a main-effects model with both RACE and SEX discrete. We look at the LSMEANS
both with the default weighting (equal weighting) and with OBSMARGINS specified (abbreviated OM).
Code for GLM with discrete SEX
proc glm data=analsub;
class sex race; model y3 = race sex / solution; lsmeans race sex / stderr; lsmeans race sex / stderr om; title3 'y3 = race sex (discrete) without and with OM'; run;
Dependent Variable: y Sum of Source DF Squares Mean Square F Value Pr > F Model 3 228506.315 76168.772 30.42 <. Error 835 2090548.362 2503. Corrected Total 838 2319054.
R-Square Coeff Var Root MSE y3 Mean 0.098534 29.44697 50.03649 169.
Source DF Type I SS Mean Square F Value Pr > F race 2 72961.0062 36480.5031 14.57 <. sex 1 155545.3085 155545.3085 62.13 <.
Source DF Type III SS Mean Square F Value Pr > F race 2 81271.8507 40635.9253 16.23 <. sex 1 155545.3085 155545.3085 62.13 <.
R-Square Coeff Var Root MSE y3 Mean 0.098534 29.44697 50.03649 169.
Source DF Type I SS Mean Square F Value Pr > F race 2 72961.0062 36480.5031 14.57 <. sex 1 155545.3085 155545.3085 62.13 <.
Source DF Type III SS Mean Square F Value Pr > F race 2 81271.8507 40635.9253 16.23 <. sex 1 155545.3085 155545.3085 62.13 <.
Standard Parameter Estimate Error t Value Pr > |t|
Intercept 113.1933804 B 6.86769622 16.48 <. race Black 26.7820211 B 4.72268870 5.67 <. race Hispanic 7.9571591 B 4.88460275 1.63 0. race White 0.0000000 B... sex 29.9219029 3.79618722 7.88 <.
Least Squares Means [without OM] Standard race y3 LSMEAN Error Pr > |t|
Black 191.010256 4.230100 <. Hispanic 172.185394 4.409031 <. White 164.228235 2.096816 <.
Least Squares Means [with OM] Standard race y3 LSMEAN Error Pr > |t|
Black 191.010256 4.230100 <. Hispanic 172.185394 4.409031 <. White 164.228235 2.096816 <.
It appears that SAS has ignored our OBSMARGINS (OM) specification. Actually, because SAS sets
continuous variables at their mean when calculating the least squares means for discrete variables,
binary variables treated as continuous act similarly to having specified the OBSMARGINS option. That is,
instead of assuming the binary variable is balanced (half the cases at one value and half at the other
value), SAS uses the observed mean value. If we want to imitate the effect of omitting OM, we could
calculate least squares means with the binary variables set to the average of the two values (0.5 if it is
coded 0 and 1).
INTERACTING DISCRETE AND BINARY VARIABLES
When we interact binary variables that are treated as continuous with discrete variables, we lose some of
the power of the LSMEANS statement. For example, if we add in the RACE*SEX interaction with SEX
continuous, we cannot get the least squares means for the various combinations of RACE and SEX.
Code for GLM proc glm data=analsub; class race; model y3 = race sex racesex / solution; lsmeans race / stderr; lsmeans race / stderr om; title3 'y3 = race sex (continuous) racesex without and with OM'; run;
Dependent Variable: y Sum of Source DF Squares Mean Square F Value Pr > F Model 5 273774.590 54754.918 22.30 <. Error 833 2045280.087 2455. Corrected Total 838 2319054.
R-Square Coeff Var Root MSE y3 Mean 0.118054 29.16135 49.55117 169.
Source DF Type I SS Mean Square F Value Pr > F race 2 72961.0062 36480.5031 14.86 <. sex 1 155545.3085 155545.3085 63.35 <. sex*race 2 45268.2749 22634.1375 9.22 0.
Source DF Type III SS Mean Square F Value Pr > F race 2 17870.4231 8935.2116 3.64 0. sex 1 161037.6917 161037.6917 65.59 <. sex*race 2 45268.2749 22634.1375 9.22 0.
Standard Parameter Estimate Error t Value Pr > |t|
Intercept 126.5119270 B 8.25345173 15.33 <. race Black -45.1835486 B 17.66961884 -2.56 0. race Hispanic 5.1670020 B 17.89458887 0.29 0. race White 0.0000000 B... sex 22.1911783 B 4.63675523 4.79 <. sexrace Black 42.6693767 B 10.09492850 4.23 <. sexrace Hispanic 1.3841721 B 10.30476992 0.13 0. sex*race White 0.0000000 B...
Least Squares Means [without OM] Standard race y3 LSMEAN Error Pr > |t|
Black 191.954665 4.194847 <. Hispanic 171.889092 4.383842 <. White 164.361243 2.077003 <.
Least Squares Means [with OM] Standard race y3 LSMEAN Error Pr > |t|
Black 191.954665 4.194847 <. Hispanic 171.889092 4.383842 <. White 164.361243 2.077003 <.
When we treat SEX as discrete, we can get more least squares means and the OBSMARGINS option
has an effect. The model is equivalent but is parameterized rather differently.
Code for GLM proc glm data=analsub; class sex race; model y3 = race sex racesex / solution; lsmeans race sex racesex / stderr; lsmeans race sex racesex / stderr om; title3 'y3 = race sex (discrete) racesex without and with OM'; run;
Least Squares Means [with OM]
race y3 LSMEAN
Black Non-est Hispanic Non-est White Non-est
sex y3 LSMEAN
Female Non-est Male Non-est
Standard sex race y3 LSMEAN Error Pr > |t|
Female Black 146.188933 7.386652 <. Female Hispanic 155.254279 7.470120 <. Female White 148.703105 3.942079 <. Male Black 211.049488 5.083843 <. Male Hispanic 178.829630 5.374579 <. Male White 170.894284 2.441211 <.
However, when we specify OM the least squares means for RACE and SEX are non-estimable. It turns
out the marginal means for RACE and SEX are non-estimable because the distribution of RACE is
different for males and females (or, equivalently, the proportion of males and females varies by race). For
more on this, see Pasta (2010). But how did we get RACE least squares means when we treated SEX as
continuous (which is similar to using the OM option)? LSMEANS sets all the continuous variables at their
grand mean (not their subgroup mean), so the overall mean of the continuous variable SEX is used. One
could do the equivalent thing when treating SEX as discrete by using an ESTIMATE statement and
specifying the coefficients for SEX as the overall proportions for male and female.
INTERACTIONS BETWEEN DISCRETE AND CONTINUOUS VARIABLES
When you have a mix of discrete and continuous variables, it is sometimes quite handy to include
interactions without main effects. Consider a discrete variable A and a continuous variable X. If you
include A, X, and A*X you get a test of the interaction between A and X which essentially asks whether
the effect of X is parallel for the different levels of A. If you find the interaction is statistically significant, it
may be easier to interpret the SOLUTION if you model A and A*X. Then the parameters estimated for
A*X are the slopes for each of the individual levels of A rather than deviations from the slope of the
reference category of A (which is what you get when you model A, X, and A*X).
With a more complicated high-order interaction between two discrete variables and one continuous
variable, such as ABX, it is almost always a good idea to include A*B in the model (this allows the effect
of X to vary across the various levels of A*B without any imposed structure). However it may not be
especially useful to include AX and BX as separate terms once you have established that ABX is
significant in the presence of those terms. It’s easier to see the results if you have just A, B, A*B, and
ABX as the terms in the model.
Here’s an extended example, as usual using simulated data. First we run a model with just the discrete
variables. We find we have a sex*race interaction.
Source DF Type III SS Mean Square F Value Pr > F
race 2 35171.7257 17585.8629 7.16 0. sex 1 161037.6917 161037.6917 65.59 <. sex*race 2 45268.2749 22634.1375 9.22 0.
The SOLUTION is as usual a bit hard to read:
Standard Parameter Estimate Error t Value Pr > |t|
Intercept 170.8942836 B 2.44121082 70.00 <. race Black 40.1552047 B 5.63958911 7.12 <. race Hispanic 7.9353461 B 5.90301678 1.34 0. race White 0.0000000 B... sex Female -22.1911783 B 4.63675523 -4.79 <. sex Male 0.0000000 B... sexrace Female Black -42.6693767 B 10.09492850 -4.23 <. sexrace Female Hispanic -1.3841721 B 10.30476992 -0.13 0. sexrace Female White 0.0000000 B... sexrace Male Black 0.0000000 B... sexrace Male Hispanic 0.0000000 B... sexrace Male White 0.0000000 B...
However, the LSMEANS make it easier to see what is going on. These are the same whether you
specify OM or not because these are the fully interacted values so we’re just fitting the marginal means.
The LSMEANS for SEX and RACE are also available without the OM option but if you specify OM they
are non-estimable for the reasons given above.
Standard sex race y3 LSMEAN Error Pr > |t|
Female Black 146.188933 7.386652 <. Female Hispanic 155.254279 7.470120 <. Female White 148.703105 3.942079 <. Male Black 211.049488 5.083843 <. Male Hispanic 178.829630 5.374579 <. Male White 170.894284 2.441211 <.
Now let’s introduce the continuous variable, EDUYRS.
Source DF Type III SS Mean Square F Value Pr > F
race 2 35558.6216 17779.3108 7.35 0. sex 1 161148.5447 161148.5447 66.65 <. sex*race 2 49930.0508 24965.0254 10.33 <. eduyrs 1 33688.3103 33688.3103 13.93 0.
It definitely has an effect – what about interactions? Start with a full model.
Source DF Type III SS Mean Square F Value Pr > F
race 2 3847.52395 1923.76198 0.81 0. sex 1 7063.70798 7063.70798 2.97 0. sexrace 2 3756.39302 1878.19651 0.79 0. eduyrs 1 11307.44124 11307.44124 4.75 0. eduyrsrace 2 4617.62450 2308.81225 0.97 0. eduyrssex 1 29523.41710 29523.41710 12.40 0. eduyrssex*race 2 10014.48954 5007.24477 2.10 0.
Standard Parameter Estimate Error t Value Pr > |t|
Intercept 129.9580466 B 9.19099945 14.14 <. race Black 41.3190203 B 5.57974484 7.41 <. race Hispanic 12.0077010 B 5.90073985 2.03 0. race White 0.0000000 B... sex Female 28.5285226 B 18.42262692 1.55 0. sex Male 0.0000000 B... sexrace Female Black -43.2573867 B 10.02230453 -4.32 <. sexrace Female Hispanic -5.9127486 B 10.24861249 -0.58 0. sexrace Female White 0.0000000 B... sexrace Male Black 0.0000000 B... sexrace Male Hispanic 0.0000000 B... sexrace Male White 0.0000000 B... eduyrssex Female -0.7149802 1.13153942 -0.63 0. eduyrssex Male 2.8537614 0.61825338 4.62 <.
We have combined the tests for EDUYRS into a single test with 2 degrees of freedom. More importantly,
we now can easily read the coefficients for the slope of EDUYRS for Females and Males. Compare the
two sets of estimates associated with EDUYRS:
Standard Parameter Estimate Error t Value Pr > |t|
eduyrs 2.8537614 B 0.61825338 4.62 <. eduyrssex Female -3.5687416 B 1.28942573 -2.77 0. eduyrssex Male 0.0000000 B...
Standard Parameter Estimate Error t Value Pr > |t|
eduyrssex Female -0.7149802 1.13153942 -0.63 0. eduyrssex Male 2.8537614 0.61825338 4.62 <.
For the first set of values the coefficient for EDUYRS is the estimate for Males and the second one is the
difference Females minus Males. That provides a convenient test of the difference in slopes, which has
t=-2.77 and P=0.0058. This corresponds to the F test in the ANOVA table with F=7.66 and P=0.0058. It
is in fact the equivalent test (the square of a t is an F). The second set of values gives us the actual
estimates for the slope of EDUYRS for Female and for Male and tests each against zero. The test for
Female is nonsignificant, so under some circumstances it might be worth treating it as zero and including
the EDUYRS effect only for males. This would produce a slightly different overall model with 6 instead of
7 parameters. It turns out this matches the model used to create the data.
Source DF Type III SS Mean Square F Value Pr > F
race 2 38797.68438 19398.84219 8.09 0. sex 1 139.22315 139.22315 0.06 0. sexrace 2 46624.30159 23312.15080 9.73 <. maleeduyrs 1 51104.14833 51104.14833 21.32 <.
Standard Parameter Estimate Error t Value Pr > |t|
Intercept 129.9580466 B 9.18768067 14.14 <. race Black 41.3190203 B 5.57773004 7.41 <. race Hispanic 12.0077010 B 5.89860914 2.04 0. race White 0.0000000 B... sex Female 18.7450587 B 9.97914766 1.88 0. sex Male 0.0000000 B...
sexrace Female Black -43.8331922 B 9.97718550 -4.39 <. sexrace Female Hispanic -5.4565270 B 10.21945577 -0.53 0. sexrace Female White 0.0000000 B... sexrace Male Black 0.0000000 B... sexrace Male Hispanic 0.0000000 B... sexrace Male White 0.0000000 B... male*eduyrs 2.8537614 0.61803013 4.62 <.
One note of caution. Be careful about using two names for related variables – you can confuse SAS if
you’re not careful (and get wrong answers). Here we use MALE as a zero-one dummy variable that is
essentially the same as the SEX variable. It turns out everything is fine here, but if we were to mix the
two variables in constructing interactions we could get into trouble.
USING THE STORE STATEMENT AND PROC PLM TO EXPLORE YOUR MODELS
Recently SAS added the STORE statement to a variety of procedures and the associated PROC PLM.
STORE allows you to specify a location to save an “item store,” which is a special SAS file containing the
context and results of the statistical analysis. You can then use PROC PLM to manipulate the results
from that analysis without having to repeat the calculations necessary to fit the model. For example,
suppose you run a complex analysis and are happy with the results (no doubt after several tries) and
save the results. You can then retrieve the item store and examine the model in various ways, for
example specifying LSMEANS statements and a variety of ESTIMATE or CONTRAST statements. You
can use ODS Graphics to look at graphical displays of the fitted model. They are perfectly easy to use
and I believe that using STORE should become standard practice.
Here is a very simple example from the models we were just fitting. First here is the code where we use
STORE to save the item store:
Code for GLM with STORE proc glm data=analsub; class sex race; model y3 = race sex racesex eduyrs sexeduyrs / solution; store s_mod01; title3 'mod01: y3 = ... eduyrs sex*eduyrs'; run;
proc glm data=analsub; class sex race; model y3 = race sex racesex sexeduyrs / solution ; store s_mod02; title3 'mod02: y3 = ... sex*eduyrs'; run;
proc glm data=analsub; class sex race; model y3 = race sex racesex maleeduyrs sexeduyrs / solution; store s_mod03; title3 'mod03: y3 = ... maleeduyrs sex*eduyrs'; run;
proc glm data=analsub; class sex race; model y3 = race sex racesex maleeduyrs / solution; store s_mod04; title3 'mod04: y3 = ... eduyrs*male'; run;
Now we can use PROC PLM to look at the models. First consider the LSMEANS statement. If we want
to use the OBSMARGINS option, we need to point SAS to the original dataset from which the “observed”
margins should be obtained. That is not part of the item store. We use the E option on LSMEANS to find
out what the weights are that SAS is using for the LSMEANS.
sex Least Squares Means
Standard sex Margins Estimate Error DF t Value Pr > |t|
Female WORK.ANALSUB Non-est.... Male WORK.ANALSUB Non-est....
Coefficients for sex*race Least Squares Means Using WORK.ANALSUB Margins
Effect sex race Row1 Row2 Row3 Row4 Row Row
Intercept 1 1 1 1 1 1 race Black 1 1 race Hispanic 1 1 race White 1 1 sex Female 1 1 1 sex Male 1 1 1 sexrace Female Black 1 sexrace Female Hispanic 1 sexrace Female White 1 sexrace Male Black 1 sexrace Male Hispanic 1 sexrace Male White 1 eduyrs 13.969 13.969 13.969 13.969 13.
eduyrssex Female 13.969 13.969 13. eduyrssex Male 13.969 13.
sex*race Least Squares Means
Standard sex race Margins Estimate Error DF t Value Pr > |t|
Female Black WORK.ANALSUB 146.56 7.3245 831 20.01 <. Female Hispanic WORK.ANALSUB 154.59 7.4569 831 20.73 <. Female White WORK.ANALSUB 148.50 3.9096 831 37.98 <. Male Black WORK.ANALSUB 211.14 5.0248 831 42.02 <. Male Hispanic WORK.ANALSUB 181.83 5.3517 831 33.98 <. Male White WORK.ANALSUB 169.82 2.4240 831 70.06 <.
Let’s say we’re interested in the estimates for white males and females. We can mimic the LSMEANS
values in an ESTIMATE statement by using the mean of EDUYRS, which we can see from the LSMEANS
output is 13.969. (We could also calculate this and obtain it to greater accuracy, but 13.969 is close
enough.) It is worth doing so just to make sure your code is right. For more on coding ESTIMATE
statements, see my earlier papers. But suppose we’re actually interested in the estimates for a high
school graduate, with EDUYRS=12. Here’s how we can do that with PROC PLM:
Code for PLM proc plm source=s_mod01; estimate 'mod01 at 13.969 white female' intercept 1 race 0 0 1 sex 1 0 sexrace 0 0 1 0 0 0 eduyrs 13.969 sexeduyrs 13.969 0 ; estimate 'mod01 at 13.969 white male' intercept 1 race 0 0 1 sex 0 1 sexrace 0 0 0 0 0 1 eduyrs 13.969 sexeduyrs 0 13.969 ; estimate 'mod01 at 12 white female' intercept 1 race 0 0 1 sex 1 0 sexrace 0 0 1 0 0 0 eduyrs 12 sexeduyrs 12 0 ; estimate 'mod01 at 12 white male' intercept 1 race 0 0 1 sex 0 1 sexrace 0 0 0 0 0 1 eduyrs 12 sexeduyrs 0 12 ; title3 'plm using s_mod01'; run;
Estimate Standard Label Estimate Error DF t Value Pr > |t| mod01 at 13.969 white female 148.50 3.9096 831 37.98 <.
Estimate Standard Label Estimate Error DF t Value Pr > |t| mod01 at 13.969 white male 169.82 2.4240 831 70.06 <.
Estimate Standard Label Estimate Error DF t Value Pr > |t| mod01 at 12 white female 149.91 4.3370 831 34.56 <.
Estimate Standard Label Estimate Error DF t Value Pr > |t| mod01 at 12 white male 164.20 2.8148 831 58.34 <.
Now we are in a position to do similar calculations for the other three models:
Code for PLM proc plm source=s_mod02; estimate 'mod02 at 12 white female' intercept 1 race 0 0 1 sex 1 0 sexrace 0 0 1 0 0 0 sexeduyrs 12 0 ; estimate 'mod02 at 12 while male' intercept 1 race 0 0 1 sex 0 1 sexrace 0 0 0 0 0 1 sexeduyrs 0 12 ; title3 'plm using s_mod02'; run;
proc plm source=s_mod03; estimate 'mod03 at 12 white female' intercept 1 race 0 0 1 sex 1 0 sexrace 0 0 1 0 0 0 maleeduyrs 0 sexeduyrs 12 0* ; estimate 'mod03 at 12 while male' intercept 1 race 0 0 1 sex 0 1 sexrace 0 0 0 0 0 1 maleeduyrs 12 sexeduyrs 0 12* ; title3 'plm using s_mod03'; run;
proc plm source=s_mod04; estimate 'mod04 at 12 white female' intercept 1 race 0 0 1 sex 1 0 sexrace 0 0 1 0 0 0 maleeduyrs 0 ; estimate 'mod04 at 12 while male' intercept 1 race 0 0 1 sex 0 1 sexrace 0 0 0 0 0 1 maleeduyrs 12 ; title3 'plm using s_mod04'; run;
ACKNOWLEDGEMENT
Some of the material in this paper previously appeared in Pasta (2009), Pasta (2010), and Pasta (2011).
My thanks to my coauthors on previous papers , Stefanie Silva Millar, Lori Potter, and Michelle Pritchard
Turner, for their help.
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
David J. Pasta
Vice President, Statistical & Strategic Analysis
ICON Late Phase & Outcomes Research
188 Embarcadero, Suite 200
San Francisco, CA 94105
david.pasta@iconplc.com
www.iconplc.com
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc.in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.