sas univariate 结果解释_SAS第四章

个人课程笔记第四章的内容

教科书：A Handbook of Statistical Analyses Using SAS

学校的视频，slides，lab的代码和report

配合参考资料，官方文档

https://support.sas.com/documentation/onlinedoc/stat/131/anova.pdf

基本介绍

第二章考虑过一个response variable和一个categorical variable之间的关系。当categorical variable只有两个可能的值得时候可以使用two sample t test 来判断mean是否相等。

这一章介绍ANOVA model，可以解释更广泛的情况。要求有一个Response variable（是numerical data或者说是连续的数据），predictors可以有任意多个categorical variables，而且这些categorical variables取值数量可以任意多个，超过两个也没问题。但是局限性是predictors只能是categorical variables。按照categorical variables的数量称为

N-way analysis of variance: based on n independent categorical variables

我们可以分析response关于一个categorical variables的变化

Main effect: effect of single categorical predictor

也可以考虑两个categorical variable互相有interaction。

First-order interaction: an interaction between two categorical predictors

n-th order interaction: interaction of a categorical predictor with n-1 other categorical predictors

Anova Model只能处理balanced data：

Balanced data: data with an equal number of observations in each cell

Anova Model

只简单介绍一下two way Anova Model。有两个categorical variables，模型假设

其中i,j是两个categorical variables的indices。而k是sample的index，有n个sample就是从1取到n。

假设

，所以所有的categorical data组合，方差必须是一致的。而且还得假设response在各种情况下都是normal distribution。

有一下的Hypothesis tests

将方差SST分成几个SSA,SSB,SSAB,SSE的求和。各自可以算出一个F statistics来算出p value。

还可以定义

of the total variation in y explained by the proposed model。

If
is small but the model is significant, the model is still useful (better than noise) even though it does not explain a lot of the variation in the response. We need to include more useful predictors in order to improve

.

ANOVA Model使用流程

首先当然要看我们的数据是否对Anova model适用。要求研究一个continuous response variable和多个categorical variables之间的关系。而且要求是balanced data。

然后我们判断是否满足equal variance和normal distribution这两个条件。

如果variance equal，就可以直接使用Anova model来看哪些categorical variables或其组合是significant的。

如果variance 不相等，那就要使用Welch’s option in proc ANOVA。

注意：Anova model有hierarchy，如果我们想要加入3rd-interaction，那就必须要加入所有的2nd-interaction。这个其实从模型的定义来看很好理解，如果有三阶的interaction，自然一定有二阶的interaction。

如果我们发现有一些significant factors，还可以进一步用Bonferroni’s Method，Tukey’s Method，Scheffe’s Method等等来对比不同factor取值下得response mean的区别。

例子

想要研究变量”bp“（血压和blood pressure）和三个categorical variables ”drug diet biofeed“（drug代表吃药的种类X,Y,Z ; diet代表是否进行饮食控制N,Y ; biofeed不太清楚，大概类似心里辅导P(present),N(not present) ）之间的关系。简单的说，就是高血压有三种疗法，有的人同时接受多种，想要研究血压和这三种疗法的关系，来看疗效。

先看和drug的关系

proc anova data=hyper;
  class drug;
  model bp = drug;
  means drug /hovtest tukey cldiff welch;
run;

hovtest用来测试variance相等，tukey cldiff用来看means之间的差距，如果mean不相等就用welch。

先看下面的Levene's test检测variance是否相等

p value 大于0.05，我们就认为variance相等，所以可以相信Anova的结果

再看Anova的结果

因为drug只有三种取值，所以模型的df是 3-1 = 2。它解释了3675的variance，error占了18919的variance，这些是未解释的。算出来的F test的p value是0.0022，所以我们认为使用不同的drug对于血压有不同的影响。或者说drug has significant effect on blood pressure.

R-square代表被模型解释的variance占得比例，这里很小，所以模型仅有drug不够，还需要引入更多而predictors。

这里tucke‘s test发现 Y-X的95% confidence interval是正的，还有Z-Y，这代表significantly Z比Y的平均血压大，Y比X的平均血压大。但是Y和Z的confidence level (-8.949,13.949)，不好说两个谁更大。

多个categorical variables的例子

简单写吧。可以看看教科书。多个变量检测variance好像SAS在anova命令里有点问题，书里使用cell来实现的，我不知道一般怎么实现。

proc anova data = hyper;
class cell;
model bp = cell;
means cell=hovtest;
run;

考虑所有的可能组合

这里的DF是11是因为2*3*2 - 1 =11 。或者说格子数减1是df。R^2是0.58比之前要好多了。再看每个组合，其中main effect的p value都小于0.05，二阶的只有diet*drug还算接近0.05。三阶尽管很小，但是由于我们说了要保留三阶就需要保留所有二阶。

书上把bp进行了log变换，再重新做Anova来判断是否保留三阶。这里略过。