










Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
An overview of the R package 'lattice' for data visualization, focusing on Trellis graphics and its capabilities for multivariate data. Topics include histograms, density plots, contour plots, and scatterplots, as well as the use of conditioning and grouping variables. The document also discusses the advantages and limitations of lattice graphics.
What you will learn
Typology: Study notes
1 / 18
This page cannot be seen from the preview
Don't miss anything!
DEEPAYAN SARKAR
Lattice graphics
lattice is an add-on package that implements Trellis graphics (originally developed for S and S-PLUS) in R. It is a powerful and elegant high-level data visualization system, with an emphasis on multivariate data, that is sufficient for typical graphics needs, and is also flexible enough to handle most nonstandard requirements. This tutorial covers the basics of lattice and gives pointers to further resources.
Some examples
To fix ideas, we start with a few simple examples. We use the Chem97 dataset from the mlmRev package.
data(Chem97, package = "mlmRev") head(Chem97)
lea school student score gender age gcsescore gcsecnt 1 1 1 1 4 F 3 6.625 0. 2 1 1 2 10 F -3 7.625 1. 3 1 1 3 10 F -4 7.250 0. 4 1 1 4 10 F -2 7.500 1. 5 1 1 5 8 F -1 6.444 0. 6 1 1 6 10 F 4 7.750 1.
The dataset records information on students appearing in the 1997 A-level chemistry examination in Britain. We are only interested in the following variables:
Using lattice, we can draw a histogram of all the gcsescore values using
histogram(~ gcsescore, data = Chem97)
Date: November 2015. 1
2 DEEPAYAN SARKAR
gcsescore
Percent of Total
0
5
10
15
20
25
0 2 4 6 8
This plot shows a reasonably symmetric unimodal distribution, but is otherwise uninteresting. A more interesting display would be one where the distribution of gcsescore is compared across different subgroups, say those defined by the A-level exam score. This can be done using
histogram(~ gcsescore | factor(score), data = Chem97)
gcsescore
Percent of Total
0
10
20
30
0 2 4 6 8
0 2 4 6 8
0 2 4 6 8 8
0
10
20
30
More effective comparison is enabled by direct superposition. This is hard to do with conventional histograms, but easier using kernel density estimates. In the following example, we use the same subgroups
4 DEEPAYAN SARKAR
Trellis displays are defined by the type of graphic and the role different variables play in it. Each display type is associated with a corresponding high-level function (histogram, densityplot, etc.). Possible roles depend on the type of display, but typical ones are
The following display types are available in lattice.
Function Default Display histogram() Histogram densityplot() Kernel Density Plot qqmath() Theoretical Quantile Plot qq() Two-sample Quantile Plot stripplot() Stripchart (Comparative 1-D Scatterplots) bwplot() Comparative Box-and-Whisker Plots dotplot() Cleveland Dot Plot barchart() Bar Plot xyplot() Scatterplot splom() Scatterplot Matrix contourplot() Contour Plot of Surfaces levelplot() False Color Level Plot of Surfaces wireframe() Three-dimensional Perspective Plot of Surfaces cloud() Three-dimensional Scatterplot parallel() Parallel Coordinates Plot
New high-level functions can be written to represent further visualization types; examples are ecdfplot() and mapplot() in the latticeExtra package.
Design goals
Visualization is an art, but it can benefit greatly from a systematic, scientific approach. In particular,? has shown that it is possible to come up with general rules that can be applied to design more effective graphs.
One of the primary goals of Trellis graphics is to provide tools that make it easy to apply these rules, so that the burden of compliance is shifted from the user to the software to the extent possible. Some obvious examples of such rules are:
These design goals have some technical drawbacks; for example, non-wastage of space requires the complete display to be known when plotting begins, so, the incremental approach common in traditional R graphics (e.g., adding a main title after the main plot is finished) doesn’t fit in. lattice deals with this using an object-based paradigm: plots are represented as regular R objects, incremental updates are performed by modifying such objects and re-plotting them.
AN INTRODUCTION TO R 5
Although rules are useful, any serious graphics system must also be flexible. lattice is designed to be flexible, but obviously there is a trade-off between flexibility and ease of use for the more common tasks. lattice tries to achieve a balance using the following model:
In each case, additional arguments to the high-level calls can be used to activate common variants, and full flexibility is allowed through arbitrary user-defined functions. This is particularly useful for controlling the primary display through panel functions.
Most nontrivial use of lattice involves manipulation of one or more of these elements. Not all graphical designs segregate neatly into these elements; lattice may not be a good tool for such displays.
Common high-level functions
Visualizing univariate distributions. Several standard statistical graphics are intended to visualize the distribution of a continuous random variable. We have already seen histograms and density plots, which are both estimates of the probability density function. Another useful display is the normal Q-Q plot, which is related to the distribution function F (x) = P (X ≤ x). Normal Q-Q plots can be produced by the lattice function qqmath().
qqmath(~ gcsescore | factor(score), Chem97, groups = gender, f.value = ppoints(100), auto.key = list(columns = 2), type = c("p", "g"), aspect = "xy")
qnorm
gcsescore 3
4
5
6
7
8
−2 −1 0 1 2
−2 −1 0 1 2 2
−2 −1 0 1 2
−2 −1 0 1 2 6
−2 −1 0 1 2
−2 −1 0 1 2 10
Normal Q-Q plots plot empirical quantiles of the data against quantiles of the normal distribution (or some other theoretical distribution). They can be regarded as an estimate of the distribution function F , with the probability axis transformed by the normal quantile function. They are designed to detect departures from normality; for a good fit, the points lie approximate along a straight line. In the plot above, the
AN INTRODUCTION TO R 7
Box-and-whisker plots can be produced by the lattice function bwplot().
bwplot(factor(score) ~ gcsescore | gender, Chem97)
gcsescore
0
2
4
6
8
10
0 2 4 6 8
0 2 4 6 8
F
The decreasing lengths of the boxes and whiskers suggest decreasing variance, and the large number of outliers on one side indicate heavier left tails (characteristic of a left-skewed distribution).
The same box-and-whisker plots can be displayed in a slightly different layout to emphasize a more subtle effect in the data: for example, the median gcsescore does not uniformly increase from left to right in the following plot, as one might have expected.
bwplot(gcsescore ~ gender | factor(score), Chem97, layout = c(6, 1))
8 DEEPAYAN SARKAR
gcsescore
0
2
4
6
8
M F
M F
M F
M F
M F
M F
The layout argument controls the layout of panels in columns, rows, and pages (the default would not have been as useful in this example). Note that the box-and-whisker plots are now vertical, because of a switch in the order of variables in the formula.
Exercise 3. All the plots we have seen suggest that the distribution of gcsescore is slightly skewed, and have unequal variances in the subgroups of interest. Using a Box–Cox transformation often helps in such situations. The boxcox() function in the MASS package can be used to find the “optimal” Box–Cox transformation, which in this case is approximate 2. 34. Reproduce the previous plots replacing gcsescore by gcsescore^2.34. Would you say the transformation was successful?
Exercise 4. Not all tools are useful for all problems. Box-and-whisker plots, and to a lesser extent Q-Q plots, are mostly useful when the distributions are symmetric and unimodal, and can be misleading otherwise. For example, consider the display produced by
data(gvhd10, package = "latticeExtra") bwplot(Days ~ log(FSC.H), data = gvhd10)
What would you conclude about the distribution of log(FSC.H) from this plot? Now draw kernel density plots of log(FSC.H) conditioning on Days. Would you reach the same conclusions as before?
10 DEEPAYAN SARKAR
dataset, which gives death rates in the U.S. state of Virginia in 1941 among different population subgroups. VADeaths is a matrix.
VADeaths
Rural Male Rural Female Urban Male Urban Female 50-54 11.7 8.7 15.4 8. 55-59 18.1 11.7 24.3 13. 60-64 26.9 20.3 37.0 19. 65-69 41.0 30.9 54.6 35. 70-74 66.0 54.3 71.1 50.
To use the lattice formula interface, we first need to convert it into a data frame.
VADeathsDF <- as.data.frame.table(VADeaths, responseName = "Rate") VADeathsDF
Var1 Var2 Rate 1 50-54 Rural Male 11. 2 55-59 Rural Male 18. 3 60-64 Rural Male 26. 4 65-69 Rural Male 41. 5 70-74 Rural Male 66. 6 50-54 Rural Female 8. 7 55-59 Rural Female 11. 8 60-64 Rural Female 20. 9 65-69 Rural Female 30. 10 70-74 Rural Female 54. 11 50-54 Urban Male 15. 12 55-59 Urban Male 24. 13 60-64 Urban Male 37. 14 65-69 Urban Male 54. 15 70-74 Urban Male 71. 16 50-54 Urban Female 8. 17 55-59 Urban Female 13. 18 60-64 Urban Female 19. 19 65-69 Urban Female 35. 20 70-74 Urban Female 50.
Bar charts are produced by the barchart() function, and Cleveland dot plots by dotplot(). Both allow a formula of the form y ~ x (plus additional conditioning and grouping variables), where one of x and y should be a factor.
A bar chart of the VADeathsDF data is produced by
barchart(Var1 ~ Rate | Var2, VADeathsDF, layout = c(4, 1))
AN INTRODUCTION TO R 11
Rate
50−
55−
60−
65−
70−
20 40 60
Rural Male
20 40 60 Rural Female
20 40 60
Urban Male
20 40 60 Urban Female
This plot is potentially misleading, because a strong visual effect in the plot is the comparison of the areas of the shaded bars, which do not mean anything. This problem can be addressed by making the areas proportional to the values they encode.
barchart(Var1 ~ Rate | Var2, VADeathsDF, layout = c(4, 1), origin = 0)
Rate
50−
55−
60−
65−
70−
0 20 40 60
Rural Male
0 20 40 60 Rural Female
0 20 40 60
Urban Male
0 20 40 60 Urban Female
A better design is to altogether forego the bars, which distract from the primary comparison of the endpoint positions, and instead use a dot plot.
dotplot(Var1 ~ Rate | Var2, VADeathsDF, layout = c(4, 1))
AN INTRODUCTION TO R 13
used for multiway tables stored as arrays, lattice also includes suitable methods that bypass the conversion to a data frame that would be required otherwise. For example, an alternative to the last example is
dotplot(VADeaths, type = "o", auto.key = list(points = TRUE, lines = TRUE, space = "right"))
Freq
50−
55−
60−
65−
70−
20 40 60
Rural Male Rural Female Urban Male Urban Female
Methods available for a particular generic can be listed using^2
methods(generic.function = "dotplot")
[1] dotplot.array* dotplot.default* dotplot.formula* dotplot.matrix* [5] dotplot.numeric* dotplot.table* see '?methods' for accessing help and source code
The special features of the methods, if any, are described in their respective help pages; for example, ?dotplot.matrix for the example above.
Exercise 5. Reproduce the ungrouped dot plot using the “matrix” method.
Scatterplots and extensions. Scatterplots are commonly used for continuous bivariate data, as well as for time-series data. We use the Earthquake data, which contains measurements recorded at various seismometer locations for 23 large earthquakes in western North America between 1940 and 1980. Our first example plots the maximum horizontal acceleration measured against distance of the measuring station from the epicenter.
data(Earthquake, package = "nlme") xyplot(accel ~ distance, data = Earthquake)
(^2) This is only true for S3 generics and methods. lattice can be extended to use the S4 system as well, although we will not discuss such extensions here.
14 DEEPAYAN SARKAR
distance
accel
0 100 200 300
The plot shows patterns typical of a right skewed distribution, and can be improved by plotting the data on a log scale. It is common to add a reference grid and some sort of smooth; for example,
xyplot(accel ~ distance, data = Earthquake, scales = list(log = TRUE), type = c("p", "g", "smooth"), xlab = "Distance From Epicenter (km)", ylab = "Maximum Horizontal Acceleration (g)")
16 DEEPAYAN SARKAR
long
lat
−
−
−
−
−
−
165 170 175 180 185
Depth Depth
165 170 175 180 185
Depth
Depth Depth
−
−
−
−
−
−
Depth
−
−
−
−
−
−
Depth
165 170 175 180 185 Depth
Trivariate displays. Of course, for continuous trivariate data, it may be more effective to use a three-dimensional scatterplot.
cloud(depth ~ lat * long, data = quakes, zlim = rev(range(quakes$depth)), screen = list(z = 105, x = -70), panel.aspect = 0.75, xlab = "Longitude", ylab = "Latitude", zlab = "Depth")
AN INTRODUCTION TO R 17
Longitude
Latitude
Depth
Static three-dimensional scatterplots are not very useful because of the strong effect of “camera” direction. Unfortunately, lattice does not allow interactive manipulation of the viewing direction. Still, looking at a few such plots suggests that the epicenter locations are concentrated around two planes in three-dimensional space.
Other trivariate functions are wireframe() and levelplot(), which display data in the form of a three-dimensional surface. We will not explicitly discuss these and the other high-level functions in lattice, but examples can be found in their help pages.
Exercise 6. Try changing viewing direction in the previous plot by modifying the screen argument.
The “trellis” object. One important feature of lattice that makes it different from traditional R graphics is that high-level functions do not actually plot anything. Instead, they return an object of class “trellis”, that needs to be print()-ed or plot()-ed. R’s automatic printing rule means that in most cases, the user does not see any difference in behaviour. Here is one example where we use optional arguments of the plot() method for “trellis” objects to display two plots side by side.
dp.uspe <- dotplot(t(USPersonalExpenditure), groups = FALSE, layout = c(1, 5), xlab = "Expenditure (billion dollars)") dp.uspe.log <- dotplot(t(USPersonalExpenditure), groups = FALSE, layout = c(1, 5), scales = list(x = list(log = 2)), xlab = "Expenditure (billion dollars)") plot(dp.uspe, split = c(1, 1, 2, 1)) plot(dp.uspe.log, split = c(2, 1, 2, 1), newpage = FALSE)