Preface page .................................................. xix
Content - how the chapters fit together ....................... xxv
1 A brief introduction to R ................................... 1
1.1 An overview of R ........................................ 1
1.1.1 A short R session ................................ 1
1.1.2 The uses of R .................................... 6
1.1.3 Online help ...................................... 7
1.1.4 Input of data from a file ........................ 8
1.1.5 R packages ....................................... 9
1.1.6 Further steps in learning R ...................... 9
1.2 Vectors, factors, and univariate time series ........... 10
1.2.1 Vectors ......................................... 10
1.2.2 Concatenation-joining vector objects ............ 10
1.2.3 The use of relational operators to compare
vector elements ................................. 11
1.2.4 The use of square brackets to extract subsets
of vectors ...................................... 11
1.2.5 Patterned data .................................. 11
1.2.6 Missing values .................................. 12
1.2.7 Factors ......................................... 13
1.2.8 Time series ..................................... 14
1.3 Data frames and matrices ............................... 14
1.3.1 Accessing the columns of data frames - with ()
and attach () ................................... 17
1.3.2 Aggregation, stacking, and unstacking ........... 17
1.3.3 Data frames and matrices ........................ 18
1.4 Functions, operators, and loops ........................ 19
1.4.1 Common useful built-in functions ................ 19
1.4.2 Generic functions, and the class of an object ... 21
1.4.3 User-written functions .......................... 22
1.4.4 If Statements ................................... 23
1.4.5 Selection and matching .......................... 23
1.4.6 Functions for working with missing values ....... 24
1.4.7 Looping ......................................... 24
1.5 Graphics in R .......................................... 25
1.5.1 The function plot () and allied functions ....... 25
1.5.2 The use of color ................................ 27
1.5.3 The importance of aspect ratio .................. 28
1.5.4 Dimensions and other settings for graphics
devices ......................................... 28
1.5.5 The plotting of expressions and mathematical
symbols ......................................... 29
1.5.6 Identification and location on the figure
region .......................................... 29
1.5.7 Plot methods for objects other than vectors ..... 30
1.5.8 Lattice (trellis) graphics ...................... 30
1.5.9 Good and bad graphs ............................. 32
1.5.10 Further information on graphics ................. 33
1.6 Additional points on the use of R ...................... 33
1.7 Recap .................................................. 35
1.8 Further reading ........................................ 36
1.9 Exercises .............................................. 37
2 Styles of data analysis ..................................... 43
2.1 Revealing views of the data ............................ 43
2.1.1 Views of a single sample ........................ 44
2.1.2 Patterns in univariate time series .............. 47
2.1.3 Patterns in bivariate data ...................... 49
2.1.4 Patterns in grouped data - lengths of cuckoo
eggs ............................................ 52
2.1.5 Multiple variables and times .................... 53
2.1.6 Scatterplots, broken down by multiple factors ... 56
2.1.7 What to look for in plots ....................... 58
2.2 Data summary ........................................... 59
2.2.1 Counts .......................................... 59
2.2.2 Summaries of information from data frames ....... 63
2.2.3 Standard deviation and inter-quartile range ..... 65
2.2.4 Correlation ..................................... 67
2.3 Statistical analysis questions, aims, and strategies ... 69
2.3.1 How relevant and how reliable are the data? ..... 70
2.3.2 How will results be used? ....................... 70
2.3.3 Formal and informal assessments ................. 71
2.3.4 Statistical analysis strategies ................. 72
2.3.5 Planning the formal analysis .................... 72
2.3.6 Changes to the intended plan of analysis ........ 73
2.4 Recap .................................................. 73
2.5 Further reading ........................................ 74
2.6 Exercises .............................................. 74
3 Statistical models .......................................... 77
3.1 Statistical models ..................................... 77
3.1.1 Incorporation of an error or noise component .... 78
3.1.2 Fitting models - the model formula .............. 80
3.2 Distributions: models for the random component ......... 81
3.2.1 Discrete distributions - models for counts ...... 82
3.2.2 Continuous distributions ........................ 84
3.3 Simulation of random numbers and random samples ........ 86
3.3.1 Sampling from the normal and other continuous
distributions ................................... 87
3.3.2 Simulation of regression data ................... 88
3.3.3 Simulation of the sampling distribution of the
mean ............................................ 88
3.3.4 Sampling from finite populations ................ 90
3.4 Model assumptions ...................................... 91
3.4.1 Random sampling assumptions - independence ...... 91
3.4.2 Checks for normality ............................ 92
3.4.3 Checking other model assumptions ................ 95
3.4.4 Are non-parametric methods the answer? .......... 95
3.4.5 Why models matter - adding across contingency
tables .......................................... 96
3.5 Recap .................................................. 97
3.6 Further reading ........................................ 98
3.7 Exercises .............................................. 98
4 review of inference concepts ............................... 102
4.1 Basic concepts of estimation .......................... 102
4.1.1 Population parameters and sample statistics .... 102
4.1.2 Sampling distributions ......................... 102
4.1.3 Assessing accuracy - the standard error ........ 103
4.1.4 The standard error for the difference of
means .......................................... 103
4.1.5 The standard error of the median ............... 104
4.1.6 The sampling distribution of the f-statistic ... 105
4.2 Confidence intervals and tests of hypotheses .......... 106
4.2.1 A summary of one- and two-sample
calculations ................................... 109
4.2.2 Confidence intervals and tests for
proportions .................................... 112
4.2.3 Confidence intervals for the correlation ....... 113
4.2.4 Confidence intervals versus hypothesis tests ... 113
4.3 Contingency tables .................................... 114
4.3.1 Rare and endangered plant species .............. 116
4.3.2 Additional notes ............................... 119
4.4 One-way unstructured comparisons ...................... 119
4.4.1 Multiple comparisons ........................... 122
4.4.2 Data with a two-way structure, i.e., two
factors ........................................ 123
4.4.3 Presentation issues ............................ 124
4.5 Response curves ....................................... 125
4.6 Data with a nested variation structure ................ 126
4.6.1 Degrees of freedom considerations .............. 127
4.6.2 General multi-way analysis of variance
designs ........................................ 127
4.7 Resampling methods for standard errors, tests, and
confidence intervals .................................. 128
4.7.1 The one-sample permutation test ................ 128
4.7.2 The two-sample permutation test ................ 129
4.7.3 Estimating the standard error of the median:
bootstrapping .................................. 130
4.7.4 Bootstrap estimates of confidence intervals .... 131
4.8 Theоries of infe re nee ............................... 132
4.8.1 Maximum likelihood estimation .................. 133
4.8.2 Bayesian estimation ............................ 133
4.8.3 If there is strong prior information, use
it! ............................................ 135
4.9 Recap ................................................. 135
4.10 Further reading ....................................... 136
4.11 Exercises ............................................. 137
5 Regression with a single predictor ......................... 142
5.1 Fitting a line to data ................................ 142
5.1.1 Summary information - lawn roller example ...... 143
5.1.2 Residual plots ................................. 143
5.1.3 Iron slag example: is there a pattern in the
residuals? ..................................... 145
5.1.4 The analysis of variance table ................. 147
5.2 Outliers, influence, and robust regression ............ 147
5.3 Standard errors and confidence intervals .............. 149
5.3.1 Confidence intervals and tests for the slope ... 150
5.3.2 SEs and confidence intervals for predicted
values ......................................... 150
5.3.3 Implications for design ........................ 151
5.4 Assessing predictive accuracy ......................... 152
5.4.1 Training/test sets and cross-validation ........ 153
5.4.2 Cross-validation - an example .................. 153
5.4.3 Bootstrapping .................................. 155
5.5 Regression versus qualitative anova comparisons -
issues of power ....................................... 158
5.6 Logarithmic and other transformations ................. 160
5.6.1 A note on power transformations ................ 160
5.6.2 Size and shape data - allometric growth ........ 161
5.7 There are two regression lines! ....................... 162
5.8 The model matrix in regression ........................ 163
5.9 Bayesian regression estimation using the MCMCpack
package ............................................... 165
5.10 Recap ................................................. 166
5.11 Methodological references ............................. 167
5.12 Exercises ............................................. 167
6 Multiple linear regression ................................. 170
6.1 Basic ideas: a book weight example .................... 170
6.1.1 Omission of the intercept term ................. 172
6.1.2 Diagnostic plots ............................... 173
6.2 The interpretation of model coefficients .............. 174
6.2.1 Times for Northern Irish hill races ............ 174
6.2.2 Plots that show the contribution of
individual terms ............................... 177
6.2.3 Mouse brain weight example ..................... 179
6.2.4 Book dimensions, density, and book weight ...... 181
6.3 Multiple regression assumptions, diagnostics, and
efficacy measures ..................................... 183
6.3.1 Outliers, leverage, influence, and Cook's
distance ....................................... 183
6.3.2 Assessment and comparison of regression
models ......................................... 186
6.3.3 How accurately does the equation predict? ...... 187
6.4 A strategy for fitting multiple regression models ..... 189
6.4.1 Suggested steps ................................ 190
6.4.2 Diagnostic checks .............................. 191
6.4.3 An example - Scottish hill race data ........... 191
6.5 Problems with many explanatory variables .............. 196
6.5.1 Variable selection issues ...................... 197
6.6 Multicollinearity ..................................... 199
6.6.1 The variance inflation factor .................. 201
6.6.2 Remedies for multicollinearity ................. 203
6.7 Errors in x ........................................... 203
6.8 Multiple regression models - additional points ........ 208
6.8.1 Confusion between explanatory and response
variables ...................................... 208
6.8.2 Missing explanatory variables .................. 208
6.8.3 The use of transformations ..................... 210
6.8.4 Non-linear methods - an alternative to
transformation? ................................ 210
6.9 Recap ................................................. 212
6.10 Further reading ....................................... 212
6.11 Exercises ............................................. 214
7 Exploiting the linear model framework ...................... 217
7.1 Levels of a factor - using indicator variables ........ 217
7.1.1 Example - sugar weight ......................... 217
7.1.2 Different choices for the model matrix when
there are factors .............................. 220
7.2 Block designs and balanced incomplete block designs ... 222
7.2.1 Analysis of the rice data, allowing for block
effects ........................................ 222
7.2.2 A balanced incomplete block design ............. 223
7.3 Fitting multiple lines ................................ 224
7.4 Polynomial regression ................................. 228
7.4.1 Issues in the choice of model .................. 229
7.5 Methods for passing smooth curves through data ........ 231
7.5.1 Scatterplot smoothing - regression splines ..... 232
7.5.2 Roughness penalty methods and generalized
additive models ................................ 235
7.5.3 Distributional assumptions for automatic
choice of roughness penalty .................... 236
7.5.4 Other smoothing methods ........................ 236
7.6 Smoothing with multiple explanatory variables ......... 238
7.6.1 An additive model with two smooth terms ........ 238
7.6.2 A smooth surface ............................... 240
7.7 Further reading ....................................... 240
7.8 Exercises ............................................. 240
8 Generalized linear models and survival analysis ............ 244
8.1 Generalized linear models ............................. 244
8.1.1 Transformation of the expected value on the
left ........................................... 244
8.1.2 Noise terms need not be normal ................. 245
8.1.3 Log odds in contingency tables ................. 245
8.1.4 Logistic regression with a continuous
explanatory variable ........................... 246
8.2 Logistic multiple regression .......................... 249
8.2.1 Selection of model terms, and fitting the
model .......................................... 252
8.2.2 Fitted values .................................. 254
8.2.3 A plot of contributions of explanatory
variables ...................................... 255
8.2.4 Cross-validation estimates of predictive
accuracy ....................................... 255
8.3 Logistic models for categorical data-an example ....... 256
8.4 Poisson and quasi-Poisson regression .................. 258
8.4.1 Data on aberrant crypt foci .................... 258
8.4.2 Moth habitat example ........................... 261
8.5 Additional notes on generalized linear models ......... 266
8.5.1 Residuals, and estimating the dispersion ....... 266
8.5.2 Standard errors and z- or f-statistics for
binomial models ................................ 267
8.5.3 Leverage for binomial models ................... 268
8.6 Models with an ordered categorical or categorical
response .............................................. 268
8.6.1 Ordinal regression models ...................... 269
8.6.2 Loglinear models ............................... 272
8.7 Survival analysis ..................................... 272
8.7.1 Analysis of the Aids2 data ..................... 273
8.7.2 Right-censoring prior to the termination of
the study ...................................... 275
8.7.3 The survival curve for male homosexuals ........ 276
8.7.4 Hazard rates ................................... 276
8.7.5 The Cox proportional hazards model ............. 277
8.8 Transformations for count data ........................ 279
8.9 Further reading ....................................... 280
8.10 Exercises ............................................. 281
9 Time series models ......................................... 283
9.1 Time series - some basic ideas ........................ 283
9.1.1 Preliminary graphical explorations ............. 283
9.1.2 The autocorrelation and partial
autocorrelation function ....................... 284
9.1.3 Autoregressive models .......................... 285
9.1.4 Autoregressive moving average models -
theory ......................................... 287
9.1.5 Automatic model selection? ..................... 288
9.1.6 A time series forecast ......................... 289
9.2 Regression modeling with ARIMA errors ................. 291
9.3 Non-linear time series ................................ 298
9.4 Further reading ....................................... 300
9.5 Exercises ............................................. 301
10 Multi-level models and repeated measures ................... 303
10.1 A one-way random effects model ........................ 304
10.1.1 Analysis with aov () ........................... 305
10.1.2 A more formal approach ......................... 308
10.1.3 Analysis using lmer () ......................... 310
10.2 Survey data, with clustering .......................... 313
10.2.1 Alternative models ............................. 313
10.2.2 Instructive, though faulty, analyses ........... 318
10.2.3 Predictive accuracy ............................ 319
10.3 A multi-level experimental design ..................... 319
10.3.1 The anova table ................................ 321
10.3.2 Expected values of mean squares ................ 322
10.3.3 The analysis of variance sums of squares
breakdown ...................................... 323
10.3.4 The variance components ........................ 325
10.3.5 The mixed model analysis ....................... 326
10.3.6 Predictive accuracy ............................ 328
10.4 Within- and between-subject effects ................... 329
10.4.1 Model selection ................................ 329
10.4.2 Estimates of model parameters .................. 331
10.5 A generalized linear mixed model ...................... 332
10.6 Repeated measures in time ............................. 334
10.6.1 Example - random variation between profiles .... 336
10.6.2 Orthodontic measurements on children ........... 340
10.7 Further notes on multi-level and other models with
correlated errors ..................................... 344
10.7.1 Different sources of variance - complication
or focus of interest? .......................... 344
10.7.2 Predictions from models with a complex error
structure ...................................... 345
10.7.3 An historical perspective on multi-level
models ......................................... 345
10.7.4 Meta-analysis .................................. 347
10.7.5 Functional data analysis ....................... 347
10.7.6 Error structure in explanatory variables ....... 347
10.8 Recap ................................................. 347
10.9 Further reading ....................................... 348
10.10 Exercises ............................................ 349
11 Tree-based classification and regression ................... 351
11.1 The uses of tree-based methods ........................ 352
11.1.1 Problems for which tree-based regression may
be used ........................................ 352
11.2 Detecting email spam — an example ..................... 353
11.2.1 Choosing the number of splits .................. 356
11.3 Terminology and methodology ........................... 356
11.3.1 Choosing the split - regression trees .......... 357
11.3.2 Within and between sums of squares ............. 357
11.3.3 Choosing the split - classification trees ...... 358
11.3.4 Tree-based regression versus loess regression
smoothing ...................................... 359
11.4 Predictive accuracy and the cost-complexity
trade-off ............................................. 361
11.4.1 Cross-validation ............................... 361
11.4.2 The cost-complexity parameter .................. 362
11.4.3 Prediction error versus tree size .............. 363
11.5 Data for female heart attack patients ................ 363
11.5.1 The one-standard-deviation rule ............... 365
11.5.2 Printed information on each split ............. 366
11.6 Detecting email spam - the optimal tree ............... 366
11.7 The random Forest package ............................. 369
11.8 Additional notes on tree-based methods ................ 372
11.9 Further reading and extensions ........................ 373
11.10 Exercises ............................................
12 Multivariate data exploration and discrimination ........... 377
12.1 Multivariate exploratory data analysis ................ 378
12.1.1 Scatterplot matrices ........................... 378
12.1.2 Principal components analysis .................. 379
12.1.3 Multi-dimensional scaling ...................... 383
12.2 Discriminant analysis ................................. 385
12.2.1 Example - plant architecture ................... 386
12.2.2 Logistic discriminant analysis ................. 387
12.2.3 Linear discriminant analysis ................... 388
12.2.4 An example with more than two groups ........... 390
12.3 High-dimensional data, classification, and plots ...... 392
12.3.1 Classifications and associated graphs .......... 394
12.3.2 Flawed graphs .................................. 394
12.3.3 Accuracies and scores for test data ............ 398
12.3.4 Graphs derived from the cross-validation
process ........................................ 404
12.4 Further reading ....................................... 406
12.5 Exercises ............................................. 407
13 Regression on principal component or discriminant scores ... 410
13.1 Principal component scores in regression .............. 410
13.2 Propensity scores in regression comparisons - labor
training data ......................................... 414
13.2.1 Regression comparisons ......................... 417
13.2.2 A strategy that uses propensity scores ......... 419
13.3 Further reading ....................................... 426
13.4 Exercises ............................................. 426
14 The R system - additional topics ........................... 427
14.1 Graphical user interfaces to R ........................ 427
14.1.1 The R Commander's interface - a guide to
getting started ................................ 428
14.1.2 The rattle GUI ................................. 429
14.1.3 The creation of simple GUIs - the ƒgui
package ........................................ 429
14.2 Working directories, workspaces, and the search
list .................................................. 430
14.2.1 The search path ................................ 430
14.2.2 Workspace management ........................... 430
14.2.3 Utility functions .............................. 431
14.3 R system configuration ................................ 432
14.3.1 The R Windows installation directory tree ...... 432
14.3.2 The library directories ........................ 433
14.3.3 The startup mechanism .......................... 433
14.4 Data input and output ................................. 433
14.4.1 Input of data .................................. 434
14.4.2 Data output .................................... 437
14.4.3 Database connections ........................... 438
14.5 Functions and operators - some further details ........ 438
14.5.1 Function arguments ............................. 439
14.5.2 Character string and vector functions .......... 440
14.5.3 Anonymous functions ............................ 441
14.5.4 Functions for working with dates (and times) ... 441
14.5.5 Creating groups ................................ 443
14.5.6 Logical operators .............................. 443
14.6 Factors ............................................... 444
14.7 Missing values ........................................ 446
14.8 Matrices and arrays ................................... 448
14.8.1 Matrix arithmetic .............................. 450
14.8.2 Outer products ................................. 451
14.8.3 Arrays ......................................... 451
14.9 Manipulations with lists, data frames, matrices, and
time series ........................................... 452
14.9.1 Lists - an extension of the notion of
"vector" ....................................... 452
14.9.2 Changing the shape of data frames (or
matrices) ...................................... 454
14.9.3 Merging data frames - merge () ................. 455
14.9.4 Joining data frames, matrices, and vectors -
cbind () ....................................... 455
14.9.5 The apply family of functions .................. 456
14.9.6 Splitting vectors and data frames into lists
- split () ..................................... 457
14.9.7 Multivariate time series ....................... 458
14.10 Classes and methods .................................. 458
14.10.1 Printing and summarizing model objects ........ 459
14.10.2 Extracting information from model objects ..... 460
14.10.3 S4 classes and methods ........................ 460
14.11 Manipulation of language constructs .................. 461
14.11.1 Model and graphics formulae ................... 461
14.11.2 The use of a list to pass arguments ........... 462
14.11.3 Expressions ................................... 463
14.11.4 Environments .................................. 463
14.11.5 Function environments and lazy evaluation ..... 464
14.12 Creation of R packages ............................... 465
14.13 Document preparation - Sweave () and xtable () ....... 467
14.14 Further reading ...................................... 468
14.15 Exercises ............................................ 469
15 Graphs in R ................................................ 472
15.1 Hardcopy graphics devices ............................. 472
15.2 Plotting characters, symbols, line types, and
colors ................................................ 472
15.3 Formatting and plotting of text and equations ......... 474
15.3.1 Symbolic substitution of symbols in an
expression ..................................... 475
15.3.2 Plotting expressions in parallel ............... 475
15.4 Multiple graphs on a single graphics page ............. 476
15.5 Lattice graphics and the grid package ................. 477
15.5.1 Groups within data, and / or columns in
parallel ....................................... 478
15.5.2 Lattice parameter settings ..................... 480
15.5.3 Panel functions, strip functions, strip
labels, and other annotation ................... 483
15.5.4 Interaction with lattice (and other) plots -
the playwith package ........................... 485
15.5.5 Interaction with lattice plots - focus,
interact, unfocus .............................. 485
15.5.6 Overlaid plots with different scales ........... 486
15.6 An implementation of Wilkinson's Grammar of
Graphics .............................................. 487
15.7 Dynamic graphics - the rgl and rggobi packages ........ 491
15.8 Further reading ....................................... 492
Epilogue ...................................................... 493
References .................................................... 495
Index of R symbols and functions .............................. 507
Index of terms ................................................ 514
Index of authors .............................................. 523
The color plates will be found between pages 328 and 329.
|