Download Data Visualization: Creating Plots with ggplot2 & Framingham Heart Study Dataset and more Study notes Data Mining in PDF only on Docsity!
Introduction to R
Data Visualization with ggplot
Summary
In this illustration, you will learn how to produce some basic graphs (hopefully some useful ones!) using the package
ggplot2. You will be using an R dataset that you import directly into R Studio.
Page
Introduction: Framingham Heart Study (Didactic Dataset) ………………………….… 2
1 Introduction to ggplot2 ……………………………………………………………….
a. Syntax of ggplot …,,……..…………………………………………………………...
b. Illustration – Build Your Plot Layer by Layer ………………………………………
2 Preliminaries ……………………………………………………………………………… 7
3 Single Variable Graphs ………………...…………………………………………………
a. Discrete Variable: Bar Chart …………………………………………………………
b. Continuous Variable: Histogram …………………………..………….……………...
c. Continuous Variable: Box Plot ……………………………………………………….
4 Multiple Variable Graphs ……………..…………………………………………………
a. Continuous, by Group (Discrete): Side-by-side Box Plot ……………………………
b. Continuous, by Group (Discrete): Side-by-side Histogram ..………….……………...
c. Continuous: X-Y Plot (Scatterplot) ………………………………………………..…
d. Continuous: X-Y Plot, with Overlay Linear Regression Model Fit …………………
e. Continuous: X-Y Plot, by Group (Discrete) …………………………………………
Before You Begin: Be sure to have downloaded from the course website: framingham.Rdata
Before You Begin: Be sure to have installed (one time) the following packages:
From the console pane only, the command is install.packages(“ nameofpackage ”).
__#1. Hmisc
__#2. stargazer
__#3. summarytools
__#4. ggplot
Introduction
Framingham Heart Study (Didactic Dataset)
The dataset you are using in this illustration (framingham.Rdata) is a subset of the data from the Framingham Heart
Study, Levy (1999) National Heart Lung and Blood Institute, Center for Bio-Medical Communication.
The objective of the Framingham Heart Study was to identify the common factors or characteristics that contribute to
cardiovascular disease (CVD) by following its development over a long period of time in a large group of participants
who had not yet developed overt symptoms of CVD or suffered a heart attack or stroke. The researchers recruited 5,
men and women between the ages of 30 and 62 from the town of Framingham, Massachusetts, and began the first round
of extensive physical examinations and lifestyle interviews that they would later analyze for common patterns related to
CVD development. Since 1948, the subjects have continued to return to the study every two years for a detailed medical
history, physical examination, and laboratory tests, and in 1971, the study enrolled a second generation - 5,124 of the
original participants' adult children and their spouses - to participate in similar examinations. In April 2002 the Study
entered a new phase: the enrollment of a third generation of participants, the grandchildren of the original cohort. This
step is of vital importance to increase our understanding of heart disease and stroke and how these conditions affect
families. Over the years, careful monitoring of the Framingham Study population has led to the identification of the major
CVD risk factors - high blood pressure, high blood cholesterol, smoking, obesity, diabetes, and physical inactivity - as
well as a great deal of valuable information on the effects of related factors such as blood triglyceride and HDL
cholesterol levels, age, gender, and psychosocial issues. With the help of another generation of participants, the Study
may close in on the root causes of cardiovascular disease and help in the development of new and better ways to prevent,
diagnose and treat cardiovascular disease.
This dataset is a HIPAA de-identified subset of the 40-year data. It consists of measurements of 9 variables on 4699
patients who were free of coronary heart disease at their baseline exam.
Coding Manual
Position Variable Variable Label Codes
1. id Patient identifier
2. sex Patient gender 1 = male
2 = female
3. sbp Systolic blood pressure, mm Hg
4. dbp Diastolic blood pressure, mm Hg
5. scl Serum cholesterol, mg/100 ml
6. age Age at baseline exam, years
7. bmi Body mass index, kg/m^2
8. month Month of year of baseline exam
9. followup Subject’s follow-up, days since
baseline
10. chdfate Event of CHD at end of follow-up 1 = patient developed CHD at follow-up
0 = otherwise
__ 1 b. Illustration: Build Your Plot Layer by Layer
Hack: The continuation character ”+” must go at then end of the line NOT the start of the next line!
_# Base: DEFAULT DATASET AND AESTHETIC MAPPINGS
Tell R which data to use. Tell R how to map variable to graph_
ggplot (data=framinghamdf, aes (x=bmi)) +
# Layer 1. + GEOM_
# Tell R which kind of plot to produce geom_histogram (binwidth= 1 , colour="blue", aes (y=..density..)) +
# LAYER 2. + STAT Note: This is actually an argument of the GEOM_
_# Here we are telling R overlay a statistical calculation, in particular an overlay normal curve
IMPORTANT: Be sure to include na.rm=TRUE since calculations will not happen if there are missing values_
+ stat_function (fun=dnorm, color="red", args= list (mean= mean (framinghamdf $ bmi,na.rm=TRUE), sd= sd (framinghamdf $ bmi,na.rm=TRUE))) +
# LAYER 3. + ADD TITLE, LABELS, AXIS LIMITS, etc
ggtitle ("Framingham Heart Study Didactic (n=4699): \nHistogram of Body Mass Index (kg/m2)") + xlab ("Body Mass Index (kg/m2)") + ylab ("Density") + # EXTRA. Carol decides to go back in and edit so as to the superscript for meters squared ggtitle ("Framingham Heart Study Didactic (n=4699): \nHistogram of Body Mass Index") + xlab ( expression ("Body Mass Index, kg/m" ^ { 2 } )) + ylab ("Density") +
Preliminaries
setwd ("/Users/cbigelow/Desktop") library (Hmisc) library (stargazer) library (summarytools) library (ggplot2) Input data. Check. Label variables. Label variable values. load (file="framingham.Rdata") str (framinghamdf)
'data.frame': 4699 obs. of 10 variables:
$ id : int 2642 4627 2568 4192 3977 659 2290 4267 2035 3587 ...
$ sex : int 1 1 1 1 1 2 1 1 1 1 ...
$ sbp : int 120 130 144 92 162 212 140 174 142 115 ...
$ dbp : int 80 78 90 66 98 118 85 102 94 70 ...
$ scl : int 267 192 207 231 271 182 276 259 242 242 ...
$ age : int 55 53 61 48 39 61 44 39 47 60 ...
$ bmi : num 25 28.4 25.1 26.2 28. ...
$ month : int 8 12 8 11 11 2 6 11 5 10 ...
$ followup: int 18 35 109 147 169 199 201 209 265 278 ...
$ chdfate : int 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "datalabel")= chr ""
- attr(*, "time.stamp")= chr "17 Apr 2014 14:25"
- attr(*, "formats")= chr "%8.0g" "%8.0g" "%8.0g" "%8.0g" ...
- attr(*, "types")= int 252 251 252 252 252 251 254 251 252 251
- attr(*, "val.labels")= chr "" "" "" "" ...
- attr(*, "var.labels")= chr "" "" "" "" ...
- attr(*, "version")= int 12
label (framinghamdf $ bmi) <- "bmi: Body Mass Index (kg/m2)" label (framinghamdf $ age) <- "age: Age (years)" label (framinghamdf $ chdfate) <- "chdfate: Event of CHD (0/1)" framinghamdf $ chdfate <- factor (framinghamdf $ chdfate, levels= c ( 0 , 1 ), labels= c ("0=Other", "1=Event of CHD")) Descriptives on the variables used in this illustration freq ( as.factor (framinghamdf $ chdfate))
Frequencies
Freq % Valid % Valid Cum. % Total % Total Cum.
-------------------- ------ --------- -------------- --------- --------------
0=Other 3226 68.65 68.65 68.65 68.
1=Event of CHD 1473 31.35 100.00 31.35 100.
0 0.00 100.
Total 4699 100.00 100.00 100.00 100.
stargazer (framinghamdf[ c ("bmi","age")],type="text",summary.stat= c ("n","mean","sd", "min", "max"))
=============================================
Statistic N Mean St. Dev. Min Max
---------------------------------------------
bmi 4,690 25.632 4.095 16.200 57.
age 4,699 46.041 8.504 30 68
---------------------------------------------
ggsave (file="histogram.tiff",p2, width= 7 , height= 5 , units="in")
__3c. Continuous: Box Plot
_# SINGLE CONTINUOUS VARIABLE: BOX PLOT - Vertical
ggplot(data=DATAFRAME, aes(x="",y=CONTINUOUSVARIABLE)) + geom_boxplot_
p3 <- ggplot (data=framinghamdf, aes (x="", y=age)) + geom_boxplot (color="black", fill="blue") + xlab ("") + ylab ("Age,years") + ggtitle ("Framingham Heart Study Didactic (n=4699): \nBox Plot of Age (years)") + theme_bw () p
_# SINGLE CONTINUOUS VARIABLE: BOX PLOT - Horizontal
ggplot(data=DATAFRAME, aes(x="",y=CONTINUOUSVARIABLE)) + geom_boxplot + coord_flip()_
p4 <- ggplot (data=framinghamdf, aes (x="", y=age)) + geom_boxplot (color="black", fill="blue") + coord_flip () + xlab ("") + ylab ("Age,years") + ggtitle ("Framingham Heart Study Didactic (n=4699): \nBox Plot of Age (years)") + theme_bw () p
ggsave (file="boxplot_vertical.tiff", p5, width= 7 , height= 5 , units="in")
__4b. Continuous, by Group (Discrete): Side-by-Side Histogram
_# CONTINUOUS VARIABLE, BY GROUP: SIDE-BY-SIDE BOX HISTOGRAM - Separate Panels
ggplot(data=framinghamdf, aes(x=bmi)) + geom_histogram() + facet_grid(GROUPVARIABLE ~ .) + options_
p7 <- ggplot (data=framinghamdf, aes (x=bmi)) + geom_histogram (binwidth= 5 , color="blue", fill="grey") + facet_grid (chdfate ~ .) + scale_color_grey () +scale_fill_grey () + ggtitle ("Framingham Heart Study Didactic (n=4699): \nBody Mass Index (kg/m2)") + xlab ("Body Mass Index (kg/m2)") + ylab ("Density") + theme (axis.title= element_text (size= 9 ), plot.title= element_text (size= 10 )) + theme_bw () p
_# CONTINUOUS VARIABLE, BY GROUP: SIDE-BY-SIDE BOX HISTOGRAM - Overlay, slight transparency
ggplot(data=framinghamdf, aes(x=CONTINUOUSVARIABLE,fill=GROUPVARIABLE,color=GROUPVARIABLE)) +
geom_histogram(binwidth=#,position="identity",alpha=0.5) + options_
p8 <- ggplot (data=framinghamdf, aes (x=bmi, fill=chdfate, color=chdfate)) + geom_histogram (binwidth= 1 ,position="identity", alpha=0.5) + ggtitle ("Framingham Heart Study Didactic (n=4699): \nBody Mass Index (kg/m2)") + labs (y="Density", x="Body Mass Index (kg/m2)",caption="Your nifty caption here") + scale_color_grey () +scale_fill_grey () + theme (axis.title= element_text (size= 9 ), plot.title= element_text (size= 10 )) p
ASIDE: For the next plots I want to work with a random sample size of n=100 from my dataframe.
# KEY:
# NEWDATATFRAME <- OLDDATAFRAME[sample(nrow(OLDDATAFRAME), 100), ] smalldf <- framinghamdf[ sample ( nrow (framinghamdf), 100 ), ]
__4e. Continuous: X-Y Plot, by Group (Discrete)
# TWO CONTINUOUS VARIABLES, GROUP: XY SCATTERPLOT – BY GROUP
# ggplot(data=DATAFRAME, aes(x=XVARIABLE, y=YVARIABLE, color=GROUPVARIABLE)) + geom_point() + options p11 <- ggplot (data=smalldf, aes (x=age,y=bmi,color=chdfate)) + geom_point () + xlab ("Age, years") + ylab ("Body Mass Index (kg/m2)") +
ggtitle ("Framingham Heart Study Didactic \n(n=4699)") +
theme_bw () p