Mar 24, 2012

ggplot2 for SEER data in a jiffy

ggplot2 produces wonderful plots. But it represents a complete change in thought from how I usually draw plots. And this is a problem in a time crunch.

Objective: I need to plot a simple dataset from SEER that has combined incidence and mortality rates of breast cancer by race. Simple, yes? NO! The wretched thing is complicated!

Resources: Hadley Wickham's site, random resources on the net (Stack etc.) and The R Cookbook

Notes:
Plotting is done in Layers that consist of:
A data set
A set of mappings between variables and aesthetic properties
A geom, and parameters
A stat, and parameters
A position adjustment

Jargon: ggplot has Geoms, stats, scales, facets and coordinate systems.

My dataset from SEER: http://seer.cancer.gov/csr/1975_2008/browse_csr.php?section=4&page=sect_04_zfig.01.html

Preliminary data manipulation:
I combined the blacks and whites data into one dataset called allrates. It has an extra variable called race in it. Combination consisted of cleaning up the extra headers and footers from the csv, reading them in, adding a variable called race, assigning the data value of race to black/white. For renaming variables, I found the package gregmisc useful. The syntax is:

data <- rename.vars(data, c("x","y","z"), c("first","second","third"))


The Variables and the relationship between the variables (aes)

names(allrates) <- c("Year", "DAIncidence", "DAIncidenceJP", "ObsIncidence", "ObsIncidenceJP", "Mortality", "MortalityJP", "Race")


ggplot2 code skeleton from Wickham's site:

ggplot (data, mapping) +
layer (
stat = "",
geom = "",
position= "",
geom_parms = list(),
stat_params = list(),
)


With this, I somehow got to:


palette <- c("black", "red") colscale <- scale_color_manual(values=palette) plot <- ggplot(allrates, aes(allrates[,1])) + geom_point(aes(group=Race, color=Race, y=allrates[,2], shape=18)) + geom_smooth(aes(group=Race, color=Race, y=allrates[,2]), span=0.5) + geom_point(aes(group=Race, color=Race, y=allrates[,4], shape=4)) + geom_point(aes(group=Race, color=Race, y=allrates[,6], shape=1), span=0.2) + geom_smooth(aes(group=Race, color=Race, y=allrates[,6]), span=0.2) + colscale + xlab('') + ylab('') + opts(legend.position=c(0.62,0.47)) + opts(legend.background = theme_rect(col = 0)) + opts(legend.text = theme_text(size = 10))


Q. How do you change the font in the graph?
A. Change the fontfamily= specification in geom_text
Q. What fontfamilies are available? Can I just find out what the default font of ggplot2 is?
A. Found the answer to this at The R Cookbook: Font families. The default sans font in ggplot2 is Helvetica (Most likely guess, comparing the font on the legend to all the fonts on that page).
Q. How do you add subtitles?
A. Apparently, this is not possible.

Q. So where are we in our SEER graph plotting exercise?
A. We are stuck. Period. Luckily, the perl philosophy now comes in. Whatever works.

I ended up importing the graph into gimp, scaling the image to the dimensions of my thesis, manipulating in all the text and then including it via the \includegraphics route.


\begin{figure}[H]
\begin{center}
\includegraphics{chapter1/seerrates.png}
\end{center}
\caption{SEER Incidence and Mortality Rates for breast cancer in women, 1975-2008}
\label{fig:seer}
\end{figure}


Goal accomplished but I really wish I didn't have to waste all this time and go through so much agony and two different software to finally arrive at what I wanted.
---------------
Lessons from this exercise
  • ggplot2 produces pretty graphs but is super-tricky to figure out and frustrations are terrible.
  • ggplot2 documentation "looks" simple and logical but is actually not that simple at all.
  • It takes HUGE amount of googling and internet trawling to achieve desired results.
  • At some point I had to draw the line and polish the rough edges - eg. subtitles, extra legends etc with gimp.
---------------


Some additional Q/A I ended up learning:
Q. How do you change the size of the font on the x-axis and y-axis?
A. Use this piece of code:

opts_plot <- opts(strip.text.x = theme_text(size=7), axis.text.x = theme_text(size=7), axis.text.y = theme_text(size=8), strip.background = theme_blank(), plot.title = theme_text(size=8))

Inspired from this post at stackoverflow As a bonus, the above code with change that dark grey background of the strip headings (on a facet graph) to a decent blank and change the size of the title of the plot on the strip.

--------------------------------------------------------------------

Some other random references I found through this exercise:

More from Hadley Wickham
http://had.co.nz/
Entertaining Statistics Courses:
http://stat310.had.co.nz/
http://had.co.nz/stat480/
http://had.co.nz/hon322f/
http://had.co.nz/stat645/


Alternative to Sweave
https://github.com/hadley/decumar

Stats in Practice
at310.had.co.nz/stats-in-practice/

Book
2nd Chapter of his book:
http://had.co.nz/ggplot2/book/qplot.pdf

0 Comments:

Post a Comment