Friday, September 30, 2011

R tutorial on visualizations/graphics

Recology has moved, go to http://recology.info/2011/09/r-tutorial-on-visualizationsgraphics

Rolf Lohaus, a Huxley postdoctoral fellow here in the EEB dept at Rice University, gave our R course a talk on basic visualizations in R this morning.

Enjoy!


#######################################################
#### Created by Rolf Lohaus, EEB Dept., Rice University
####
#### Part of the code modified and extended from
#### Robert I. Kabacoff's Quick-R website
#### (http://www.statmethods.net)
#######################################################
############### HIGH-LEVEL PLOTTING FUNCTIONS ###############
# load some data (dataset of types of cars and their specs used earlier in our meetings)
data(mtcars)
# create a basic scatterplot (e.g weight vs. miles per gallon)
plot(mtcars$wt, mtcars$mpg)
# specify better axis labels and add a title (note plot always creates a new plot erasing the current one)
plot(mtcars$wt, mtcars$mpg, xlab="Weight", ylab="Miles per gallon", main="Weight vs. Miles per Gallon")
##### SAVING GRAPHS #####
# save the graph either by using 'Export' in the 'Plots' tab in RStudio or using the following functions
pdf("weight_mpg.pdf") # create PDF in current working directory
plot(mtcars$wt, mtcars$mpg, xlab="Weight", ylab="Miles per gallon", main="Weight vs. Miles per Gallon")
dev.off() # "save" the PDF
# pdf("C:/Graphs/weight_mpg.pdf") # create PDF in some other directory
# plot(mtcars$wt, mtcars$mpg, xlab="Weight", ylab="Miles per gallon", main="Weight vs. Miles per Gallon")
# dev.off() # "save" the PDF
# change the size of the PDF (in inches)
pdf("weight_mpg.pdf", width=6, height=5) # create PDF in current working directory
plot(mtcars$wt, mtcars$mpg, xlab="Weight", ylab="Miles per gallon", main="Weight vs. Miles per Gallon")
dev.off() # "save" the PDF
# change the font family of the text in the PDF (default is sans-serif "Helvetica")
pdf("weight_mpg.pdf", width=6, height=5, family="Times") # create PDF in current working directory
plot(mtcars$wt, mtcars$mpg, xlab="Weight", ylab="Miles per gallon", main="Weight vs. Miles per Gallon")
dev.off() # "save" the PDF
png("weight_mpg.png") # create PNG in current working directory
plot(mtcars$wt, mtcars$mpg, xlab="Weight", ylab="Miles per gallon", main="Weight vs. Miles per Gallon")
dev.off() # "save" the PNG
# jpeg("weight_mpg.jpg") # create JPEG in current working directory
# plot(mtcars$wt, mtcars$mpg, xlab="Weight", ylab="Miles per gallon", main="Weight vs. Miles per Gallon")
# dev.off() # "save" the JPEG
##### HISTOGRAMS #####
# create a basic histogram
hist(mtcars$mpg)
hist(mtcars$mpg, xlab="Miles per gallon", main="Histogram of Miles per Gallon")
# control number of bins with parameter 'breaks'
hist(mtcars$mpg, breaks=12)
hist(mtcars$mpg, breaks=seq(10,34))
# get information about histogram
hInfo <- hist(mtcars$mpg, breaks=seq(10,34))
hInfo
# plot probability density instead of frequencies
hist(mtcars$mpg, breaks=seq(10,34), freq=F)
# create a kernel density plot
plot(density(mtcars$mpg))
##### BAR PLOTS #####
# create a basic barplot
gearCounts <- table(mtcars$gear)
gearCounts
barplot(gearCounts)
barplot(gearCounts, xlab="Number of gears")
# label the bars individually with parameter 'names.arg'
barplot(gearCounts, names.arg=c("3 Gears", "4 Gears", "5 Gears"))
# horizontal bar plot
barplot(gearCounts, names.arg=c("3 Gears", "4 Gears", "5 Gears"), horiz=TRUE)
# stacked bar plot
gearTransmissionCounts <- table(mtcars$am, mtcars$gear)
gearTransmissionCounts
barplot(gearTransmissionCounts, names.arg=c("3 Gears", "4 Gears", "5 Gears"))
# add a legend with automatic labels with parameter 'legend.text'
barplot(gearTransmissionCounts, names.arg=c("3 Gears", "4 Gears", "5 Gears"), legend.text=T)
# add a legend with custom labels
barplot(gearTransmissionCounts, names.arg=c("3 Gears", "4 Gears", "5 Gears"), legend.text=c("automatic", "manual"))
# grouped bar plot with parameter 'beside=TRUE'
barplot(gearTransmissionCounts, names.arg=c("3 Gears", "4 Gears", "5 Gears"), legend.text=c("automatic", "manual"), beside=TRUE)
##### BOX PLOTS #####
# create a boxplot for an individual variable (e.g. miles per gallon)
boxplot(mtcars$mpg, ylab="Miles per gallon")
# create a boxplot for a variable by group (e.g. miles per gallon by cylinders)
boxplot(mpg ~ cyl, data=mtcars, xlab="Number of cylinders", ylab="Miles per gallon")
# do not draw outliers
boxplot(mpg ~ cyl, data=mtcars, xlab="Number of cylinders", ylab="Miles per gallon", outline=FALSE)
##### LINE PLOTS #####
# load some time data (dataset of growth of five orange trees over time)
data(Orange)
Orange
# extract data for 1st tree
tree1 <- subset(Orange, Orange$Tree==1)
# plot age vs circumference
plot(tree1$age, tree1$circumference, xlab="Age (days)", ylab="Circumference (mm)")
# use lines with parameter 'type="l"'
plot(tree1$age, tree1$circumference, xlab="Age (days)", ylab="Circumference (mm)", type="l")
# use both points and lines with parameter 'type="b"'
plot(tree1$age, tree1$circumference, xlab="Age (days)", ylab="Circumference (mm)", type="b")
# use both points and lines overlayed with parameter 'type="o"'
plot(tree1$age, tree1$circumference, xlab="Age (days)", ylab="Circumference (mm)", type="o")
##### PLOTS OF THREE VARIABLES / 3D PLOTS #####
# image(x, y, z, ...)
# contour(x, y, z, ...)
# persp(x, y, z, ...)
############### LOW-LEVEL PLOTTING FUNCTIONS & GRAPHICAL PARAMETERS ###############
##### plot lines and points separately #####
# set up a plot without any points and lines
plot(tree1$age, tree1$circumference, xlab="Age (days)", ylab="Circumference (mm)", type="n")
# draw lines
lines(tree1$age, tree1$circumference)
# draw points
points(tree1$age, tree1$circumference)
# draw points with black border and white fill color
plot(tree1$age, tree1$circumference, xlab="Age (days)", ylab="Circumference (mm)", type="n")
lines(tree1$age, tree1$circumference)
points(tree1$age, tree1$circumference, pch=21, bg="white") # specify point type and background fill color
##### plot mutiple data in one graph (data for all trees) and add a legend #####
# get the range for x and y axes
xRange <- range(Orange$age)
xRange
yRange <- range(Orange$circumference)
yRange
# set up plot using data
plot(xRange, yRange, xlab="Age (days)", ylab="Circumference (mm)", type="n")
# set up plot using x and y axis limits
plot(0, xlab="Age (days)", ylab="Circumference (mm)", type="n", xlim=xRange, ylim=yRange)
# draw points and lines for all trees
for (t in 1:5) {
tree <- subset(Orange, Orange$Tree==t)
lines(tree$age, tree$circumference)
points(tree$age, tree$circumference, pch=21, bg="white")
}
# plot data for all trees using different line types
plot(0, xlab="Age (days)", ylab="Circumference (mm)", type="n", xlim=xRange, ylim=yRange)
# create a vector with 5 elements specifying the line type to be used for each tree
lineTypes <- c(2, 1, 4, 5, 3)
for (t in 1:5) {
tree <- subset(Orange, Orange$Tree==t)
lines(tree$age, tree$circumference, lty=lineTypes[t]) # specify line type
points(tree$age, tree$circumference, pch=21, bg="white")
}
# add a legend
legend("topleft", legend=1:5, lty=lineTypes, title="Tree")
# plot data for all trees using different line types, point types and colors
plot(0, xlab="Age (days)", ylab="Circumference (mm)", type="n", xlim=xRange, ylim=yRange)
# create a vector with 5 elements specifying the point type to be used for each tree
pointTypes <- seq(21,25)
# create a vector with 5 elements specifying the color to be used for each tree
colors <- rainbow(5)
for (t in 1:5) {
tree <- subset(Orange, Orange$Tree==t)
lines(tree$age, tree$circumference, lty=lineTypes[t], col=colors[t])
points(tree$age, tree$circumference, pch=pointTypes[t], bg="white", col=colors[t])
}
# add the corresponding legend but this time omit the legend box and title,
# and place it in the bottom right corner
legend("bottomright", legend=c("Tree 1", "Tree 2", "Tree 3", "Tree 4", "Tree 5"), lty=lineTypes, pch=pointTypes, col=colors, pt.bg="white", bty="n")
##### COLORS #####
# list all color names
colors()
# create color(s) by specifying red, green, blue intensities
yellowColor = rgb(red=1, green=1, blue=0)
yellowColor = rgb(red=255, green=255, blue=0, maxColorValue=255)
transparentGrayColor = rgb(0.3, 0.3, 0.3, alpha=0.5)
##### calculate and plot mean circumference of trees with error bars ######
# calculate means at each age
meanCircumferences <- tapply(Orange$circumference, Orange$age, mean)
meanCircumferences
# calculate standard deviation at each age
sdCircumferences <- tapply(Orange$circumference, Orange$age, sd)
sdCircumferences
# get vector of ages
ages <- as.numeric(names(sdCircumferences))
ages
# set up plot
plot(0, xlab="Age (days)", ylab="Mean circumference (mm)", type="n", xlim=xRange, ylim=yRange)
# draw lines for means
lines(ages, meanCircumferences)
# draw error bars as simple lines
segments(ages, meanCircumferences-sdCircumferences, ages, meanCircumferences+sdCircumferences)
# draw points for means
points(ages, meanCircumferences, pch=21, bg="white")
# draw error bars with cross bars instead
plot(0, xlab="Age (days)", ylab="Mean circumference (mm)", type="n", xlim=xRange, ylim=yRange)
lines(ages, meanCircumferences)
# draw error bars using 'arrows' function
arrows(ages, meanCircumferences-sdCircumferences, ages, meanCircumferences+sdCircumferences, angle=90, code=3, length=0.05)
points(ages, meanCircumferences, pch=21, bg="white")
##### AXES #####
# set up plot without box around with parameter 'bty="n'
plot(0, xlab="Age (days)", ylab="Mean circumference (mm)", type="n", xlim=xRange, ylim=yRange, bty="n")
# use math and/or symbols in axis labels
# (for more details see 'help(plotmath)', 'example(plotmath)' and 'demo(plotmath)')
yLabel = expression(paste("Mean circumference ", bar(italic(C)), " (mm)"))
plot(0, xlab="Age (days)", ylab=yLabel, type="n", xlim=xRange, ylim=yRange, bty="n")
# set up plot without axes with parameter 'axes=FALSE'
# alternatively:
# - 'xaxt="n"' for no x axis
# - 'yaxt="n"' for no y axis
plot(0, xlab="Age (days)", ylab=yLabel, type="n", xlim=c(100, 1600), ylim=c(0, 250), axes=FALSE, bty="n")
# add custom x axis ('side=1' for bottom side of the graph)
axis(side=1, at=c(100, 600, 1100, 1600), labels=c("100", "600", "1,100", "1,600"))
# add custom x axis ('side=2' for left side of the graph)
axis(side=2, at=c(0, 50, 100, 150, 200, 250))
lines(ages, meanCircumferences)
segments(ages, meanCircumferences-sdCircumferences, ages, meanCircumferences+sdCircumferences)
points(ages, meanCircumferences, pch=21, bg="white")
##### MULTI-PANEL PLOTS #####
# split the graph area horizontally in 2 "screens"" (1 row and 2 columns)
split.screen(c(1,2))
# select the 1st screen for plotting
screen(1)
plot(0, xlab="Age (days)", ylab=expression(paste("Circumference ", italic(C), " (mm)")), type="n", xlim=c(100, 1600), ylim=c(0, 250), axes=FALSE, bty="n")
axis(side=1, at=c(100, 600, 1100, 1600), labels=c("100", "600", "1,100", "1,600"))
axis(side=2, at=c(0, 50, 100, 150, 200, 250))
for (t in 1:5) {
tree <- subset(Orange, Orange$Tree==t)
lines(tree$age, tree$circumference, lty=lineTypes[t])
points(tree$age, tree$circumference, pch=pointTypes[t], bg="white")
}
legend("bottomright", legend=c("Tree 1", "Tree 2", "Tree 3", "Tree 4", "Tree 5"), lty=lineTypes, pch=pointTypes, pt.bg="white", bty="n")
# select the 2nd screen for plotting
screen(2)
plot(0, xlab="Age (days)", ylab=yLabel, type="n", xlim=c(100, 1600), ylim=c(0, 250), axes=FALSE, bty="n")
axis(side=1, at=c(100, 600, 1100, 1600), labels=c("100", "600", "1,100", "1,600"))
axis(side=2, at=c(0, 50, 100, 150, 200, 250))
lines(ages, meanCircumferences)
segments(ages, meanCircumferences-sdCircumferences, ages, meanCircumferences+sdCircumferences)
points(ages, meanCircumferences, pch=21, bg="white")
# exit the split-screen mode (close all screens)
close.screen(all = TRUE)
# add panel labels
split.screen(c(1,2))
screen(1)
plot(0, xlab="Age (days)", ylab=expression(paste("Circumference ", italic(C), " (mm)")), type="n", xlim=c(100, 1600), ylim=c(0, 250), axes=FALSE, bty="n")
axis(side=1, at=c(100, 600, 1100, 1600), labels=c("100", "600", "1,100", "1,600"))
axis(side=2, at=c(0, 50, 100, 150, 200, 250))
for (t in 1:5) {
tree <- subset(Orange, Orange$Tree==t)
lines(tree$age, tree$circumference, lty=lineTypes[t])
points(tree$age, tree$circumference, pch=pointTypes[t], bg="white")
}
legend("bottomright", legend=c("Tree 1", "Tree 2", "Tree 3", "Tree 4", "Tree 5"), lty=lineTypes, pch=pointTypes, pt.bg="white", bty="n")
# add bold "A" on top left side of plot
mtext(expression(bold(A)), side=3, adj=0, cex=2, line=1)
screen(2)
plot(0, xlab="Age (days)", ylab=yLabel, type="n", xlim=c(100, 1600), ylim=c(0, 250), axes=FALSE, bty="n")
axis(side=1, at=c(100, 600, 1100, 1600), labels=c("100", "600", "1,100", "1,600"))
axis(side=2, at=c(0, 50, 100, 150, 200, 250))
lines(ages, meanCircumferences)
segments(ages, meanCircumferences-sdCircumferences, ages, meanCircumferences+sdCircumferences)
points(ages, meanCircumferences, pch=21, bg="white")
# add bold "B" on top left side of plot
mtext(expression(bold(B)), side=3, adj=0, cex=2, line=1)
close.screen(all = TRUE)
##### GENERAL GRAPHICAL PARAMETERS: MARGINS, TEXT & SYMBOL SIZE, LINE WIDTH, ... #####
# (for more details see 'help(par)')
pdf("orangetree_growth.pdf", width=11, height=5)
# specify margin sizes ('mar', in lines), increase text size of axis labels ('cex.lab')
# and text size of axes annotation/tick labels ('cex.axis')
par(mar=c(4.5, 4.5, 2.5, 1.5), cex.lab=1.4, cex.axis=1.2)
split.screen(c(1,2))
screen(1)
plot(0, xlab="Age (days)", ylab=expression(paste("Circumference ", italic(C), " (mm)")), type="n", xlim=c(100, 1600), ylim=c(0, 250), axes=FALSE, bty="n")
# increase axes line width with 'lwd'
axis(side=1, at=c(100, 600, 1100, 1600), labels=c("100", "600", "1,100", "1,600"), lwd=1.2)
axis(side=2, at=c(0, 50, 100, 150, 200, 250), lwd=1.2)
for (t in 1:5) {
tree <- subset(Orange, Orange$Tree==t)
# increase line width with 'lwd'
lines(tree$age, tree$circumference, lty=lineTypes[t], lwd=1.2)
# increase point border line width with 'lwd' and point size with 'cex'
points(tree$age, tree$circumference, pch=pointTypes[t], bg="white", lwd=1.2, cex=1.25)
}
legend("bottomright", legend=c("Tree 1", "Tree 2", "Tree 3", "Tree 4", "Tree 5"), lty=lineTypes, pch=pointTypes, pt.bg="white", bty="n", inset=0.016, lwd=1.2, pt.cex=1.25)
mtext(expression(bold(A)), side=3, adj=0, cex=2, line=1)
screen(2)
plot(0, xlab="Age (days)", ylab=yLabel, type="n", xlim=c(100, 1600), ylim=c(0, 250), axes=FALSE, bty="n")
# increase axes line width with 'lwd'
axis(side=1, at=c(100, 600, 1100, 1600), labels=c("100", "600", "1,100", "1,600"), lwd=1.2)
axis(side=2, at=c(0, 50, 100, 150, 200, 250), lwd=1.2)
# increase line width with 'lwd'
lines(ages, meanCircumferences, lwd=1.2)
# increase line width with 'lwd'
segments(ages, meanCircumferences-sdCircumferences, ages, meanCircumferences+sdCircumferences, lwd=1.2)
# increase point border line width with 'lwd' and point size with 'cex'
points(ages, meanCircumferences, pch=21, bg="white", lwd=1.2, cex=1.25)
mtext(expression(bold(B)), side=3, adj=0, cex=2, line=1)
close.screen(all = TRUE)
dev.off()
############### ADDITIONAL RESOURCES ###############
# Quick-R: http://www.statmethods.net/index.html
# R Graph Gallery: http://addictedtor.free.fr/graphiques/
# R Graphical Manual: http://www.oga-lab.net/RGM2
# ggplot2 package: http://had.co.nz/ggplot2/
# lattice package: http://cran.r-project.org/web/packages/lattice/index.html
# grid package: http://www.stat.auckland.ac.nz/~paul/grid/grid.html
# Edward Tufte: http://www.edwardtufte.com/tufte/
# The Visual Display of Quantitative Information: http://www.edwardtufte.com/tufte/books_vdqi

Tuesday, September 27, 2011

Short on funding? Can't get a grant? Crowdfunding! #SciFund

Recology has moved, go to http://recology.info/2011/09/short-on-funding-cant-get-grant

Crowdsourced funding is becoming a sustainable way for various artists, entrepreneurs, etc. to get their idea funded from individuals. For example, think of Kickstarter and RocketHub.


Jai Ranganathan and Jarrett Byrnes have started an experiment to determine how well crowdfunding can work for scientists: The SciFund Challenge. Go here to signup and here for their website.


The deadline to sign up is Oct. 1

Thursday, September 22, 2011

@drewconway interview on @DataNoBorders at the Strata conference

Recology has moved, go to http://recology.info/2011/09/drewconway-interview-on-datanoborders

The O'Reilly Media Strata Summit has many interviews on YouTube (just search YouTube for it)

Drew Conway is the author of a R packages, including infochimps, an R wrapper to the Infochimps API service.

The YouTube video:


Open science talk by Carl Boettiger

Recology has moved, go to http://recology.info/2011/09/open-science-talk-by-carl-boettiger

Carl Boettiger gave a talk on the topic of open science to incoming UC Davis graduate students.

Here is the audio: http://www.archive.org/details/ThingsIWishIKnewThreeYearsAgo-ByTheDavisOpenScienceGroup&reCache=1

Here are the slides: http://hazelnusse.github.com/DOS_WOW2011/#title-slide

Friday, September 9, 2011

My take on an R introduction talk

Recology has moved, go to http://recology.info/2011/09/my-take-on-r-introduction-talk

UPDATE: I put in an R tutorial as a Github gist below.


Here is a short intro R talk I gave today...for what it's worth...



# Intro Tutorial to R
# Scott Chamberlain <myrmecocystus@gmail.com>
# w/ code edited from Nathan Kraft and Dan Bunker
########## HELP ##########
# Getting Help with R on listserves online, etc.
# Go here for correct way to make reproducible examples to submit to help lists:
# https://github.com/hadley/devtools/wiki/Reproducibility
# Code writing style:
# Google style guide: http://google-styleguide.googlecode.com/svn/trunk/google-r-style.html#generallayout
# Henrik Bengtsson style guide: http://www1.maths.lth.se/help/R/RCC/
# Hadley Wickham's style guide: https://github.com/hadley/devtools/wiki/Style
# Generally use <- instead of = when assigning an object to a name
########## Installing R ##########
# go to www.r-project.org, and download a 'precompiled binary'
# for your system from the 'cran' site nearest you
# follow links to 'download'
# follow the directions to install
# starting R: double click the R icon, or double click a .R file (a script)
########## R is case sensitive ##########
myvector <- c(8, 9, 5)
myvector # OK
Myvector # wrong case, no result
########## R as a calculator and simple operations ##########
# try these simple commands by typing them into the 'console'
2 + 2
c(1, 2, 3)
x = c(1, 2, 3)
x
x = c(1:10, 13)
y = x^2
y
ls() # this lists all objects you have created that are still in memory
rm(x) # Use the rm command to remove objects after you're finished.
rm(list = ls()) # removes everything!!!
ls()
########## Assignments ##########
# everything in R is an 'object'. That means that R knows something
# about each object you create, and that there are 'methods'
# available to work with objects of various classes.
########## Scalars ##########
a = 10; b <- 10
# print to console
b
########## Vectors ##########
x = c(3, 5, 2, 8, 5, 6, 8, 10)
y = c(1, 3, 4, 5, 3, 9, 8, 3)
x; y
########### Lists ##########
# lists are arbitrary collections of variables
mylist = c(dd = "foo", ee = 66)
mylist
########## Character ##########
char1 = "this is some text"
char1
########## Factors ##########
# factors are like character vectors but only contain values in predefined 'levels'
fac1 = factor(rep(c("a", "b"), 4))
fac1
########## Matrices ##########
xx = matrix(10, 2, 3)
xx
########## Dataframes ##########
# dataframes are special objects in R that we will use a lot.
# They are special because each column of a dataframe can
# itself be of a different class (UNLIKE MATRICES).
df = data.frame(x, y, fac1)
df # print data frame
summary(df) # basic stats on each column
str(df) # description of columns and format style
head(df) # top 6 rows
tail(df) # bottom 6 rows
df[ , 3] # pick columns
df[1 , ] # pick rows
########## Immutable data frames ##########
# Special case of a dataframe used in the plyr package created by Hadley
# Wickham. When summarising data in plyr, can use immutable data frames to
# do calculations faster, e.g., see the example below
library(plyr)
bigdf <- data.frame(a = rnorm(10000), b = rnorm(10000))
bigdf_i <- idata.frame(bigdf) # make an immutable data frame
system.time( replicate(100, (ddply(bigdf, .(), mean))) )
system.time( replicate(100, (ddply(bigdf_i, .(), mean))) ) # faster!
########## Transforming variables ##########
# Many options: log10, log, log2, sqrt, asin, etc.
df$lny=log10(df$y)
df
########## Indexing ##########
# indexing dataframes
df[1,2] # returns row 1, column 2
df[1, 'y'] # same
df[,2] # returns column y as a vector
df[,'y'] # same
df$y # same
df[1:4,] # rows 1-4
df[df$fac1=="a",] # rows where fac1 is 'a'
df[df$x>mean(df$x),] # rows where x is above the mean for x
df[order(df$x), ] # returns whole dataframe sorted by x
df[df$y%in%c(3:7),] # rows where 'y' matches a list of values
# indexing lists
mylist
mylist[1] # item 1 in list, with label
mylist[[1]] # the value of item 1
########## Sequences/regular patterns ##########
####Sequence, or seq(), is useful for making vectors in regular patterns:
seq(from=1, to=10, by=1)
##can also be written as '1:10' in some contexts:
print(1:10)
#you can change the interval between elements:
seq(from=-10, to=30, by=5)
##same as:
seq(-10, 30, 5)
####repeat, or rep(), also useful for regular patterns
#repeats something as many times as you want:
rep(5, times=5)
##same as:
rep(5,5)
##more complex:
rep(c(1,4,5,2), 3)
##also works with text (aka "strings"):
rep(c("red", "blue", "gray"), 3)
## vector function
vector(mode="character", length=5)
     ############ BOOLEANS (aka "TRUE" and "FALSE") and tests ##########
########## Booleans ##########
# Booleans are a special class in R, and include "TRUE" and "FALSE".
# Note that they need to be typed in all caps. In an R editor they should
# turn yellow (or a different color) when typed in. On the Mac,
# the built-in editor works fine for this; on a PC, you can use an
# editing program like TINN-R (http://www.sciviews.org/Tinn-R/).
# You used to be able to abbreviate with T and F, but this is now
# discouraged and will cause errors in some cases.
##Many functions include boolean arguments:
vector <- as.integer(c(seq(1:10), NA))
sum(vector) ##in this case "sum" fails because of an NA in the data
?sum ## see that there is an argument "na.rm" that we can set to TRUE
sum(vector, na.rm=TRUE) ##this removes the NA and sums the remaining 10 integers
##equivilently you can use the na.omit() command:
sum(na.omit(vector))
##Many tests in R return booleans:
vector > 5 ##this returns true for each part of the vector greater than 5
vector[which(vector<3)] ##which() can be used to get the indices of an object where the test is TRUE
vector >= 5 ##etc
vector == 7 ## this returns true for the element in the vector equal to 7
##Note that two equals signs ("==") are used for testing equivelence, while one ("=") is used to assign something to a variable.
##there are a bunch of commands that start with "is." that are very useful:
is.na(vector) ##shows us that element 11 is an NA
is.finite(vector) ## checks if values are finite numbers
##Booleans can be added: TRUE=1,FALSE=0
sum(is.na(vector)) ##means that there is one NA in the vector
sum(vector<=3, na.rm=TRUE) ## shows that there are 3 elements of the vector <= to 3.
## the "!" (known as a "bang" to people with a background in older
## programming languages) is used as a "not" to reverse TRUE and FALSE:
is.na(vector)
!is.na(vector)
vector != 10 ##which part of the vector is not equal to 10??
########## "for" loops: ##########
# for loops are a very powerful way of repeating commands in a simulation
# or working through each part of an object.
# the part after "for" sets up a counter variable 'i' (can be called anything),
# which starts (in this case) at 1 and increments by 1 to 10. Each time through
# the loop it does whatever is in the { }
for(i in 1:10){
print("hello")
}
##you can refer to the counter variable in the loop:
for(j in 1:10){
print(j)
}
##you can use loops to work through each element of a list or a vector:
colors <- c("red", "blue", "green", "silver")
colors
length(colors)
#this loops adds some text to each element of the 'colors' vector
for(k in 1:length(colors)){
print(colors[k])
colors[k] <- paste(colors[k], "is a color")
}
colors
# You can also use for loops to run through any arbitrary set of values:
for (k in colors){
print(k)
}
# compare this 'colors' loop with the previous one- here the colors
# themselves (not the numbers used to index the colors vector) are what
# the loop moves over.
## your loop does not have to be sequential from 1:
for (k in seq(2,10,2)) {
print(k)
}
################ "If/ else" statments ################
# "if" statements are similar in structure to for loops, and are useful for
# doing something if a certain condition is met
##lets set up a boolean (true/ false) variable:
friendly <- TRUE
##if statements just do whatever is in the {} if the statement in ()
# following "if" is TRUE:
if(friendly){
print("hello")
}
##input the entire statment- you should get a greeting
##lets change "friendly" to be false:
friendly<-FALSE
if(friendly){
print("hello")
}
##now you should get no greeting at all
## you can follow up the "if" with an "else" statement that gives an
## alternative if the first test is false:
if(friendly){
print("hello")
}else{
print("grrrrr...")
}
##switch "friendly" and run it again:
friendly <-!friendly ##remember the "!" means "not"
if(friendly){
print("hello")
}else{
print("grrrrr...")
}
#The statement in the parentheses can also be a comparison that generates a
# true or false result. In this case, remember that the double equals is used
# to make a comparison, single equals to assign a value:
x <- 1
if (x==1) print('x = 1')
## notice that you can omit the {} if you only have one command to perform after
## the if(test...)
########## Functions ###############
# Basic example
greet=function(){
print("hello there!")
}
# this is the format of function writing: a function name of your choosing-
# "greet" in this case- (action verbs are recommended to keep them distinct
# from variables), followed by the "=function" plus a () containing any
# arguments that the function requires. Here the function doesn't require
# an argument. The code inside the {} is performed whenever the function
# is called.
# You need to copy and paste the entire text of the function into the
# console before you can use the function. There are other ways to do this
# we'll cover in a little while. Once this is input, type:
greet()
## and you should see a message in the console.
##Here is an example with an argument:
square=function(x){
x*x
}
##as with function names, you can call arguments anything you want. When you call functions, you can refer to the arguments explictly:
square(x=2) ## should tell you "4"
## this should also work without a hitch:
square(2)
##once a function is input, you can always see the code for it by typing it in without the ()
square
##you can write functions with multiple arguments:
divide=function(x,y){
x/y
}
##try it out:
divide(x=8, y=2)
##same as:
divide(8,2)
##not the same as:
divide(2,8)
## But if you explictly name the arguments, the order of the arguments does not matter:
divide(y=2, x=8)
##You need to specify something for all arguments in this kind of function. For example, this should fail:
divide(4)
##You can however specify defaults for certain arguments when you write a function. Here is an example:
increase=function(x, factor=2){
x*factor
}
##input the function then run these:
increase(x=3, factor=2)
##is the same as
increase(x=3)
##because we specified a default for 'factor'
##same as:
increase(3)
##but you can also modify the argument "factor" when you call the function even though it has a default:
increase(3, factor=10)
##same as
increase(3,10)
##the "return" command is very useful in functions for "outputting" what you want from the function. Here we've modified the "increase" function to assign to output of the operation to a variable called "output"
increase_V2=function(x, factor=2){
output <- (x*factor)
}
##now run:
increase_V2(50)
##whoops! No result!
##try calling the "output" object that we created:
output ##(this should fail if you've run the script as is)
##Just as with "for" loops, anything created in the function (such as the 'output' variable) vanishes when it is done running.
## here is a third version that has added a line of code telling the function to "return" the output variable- that is- send the value of it out to the world once it is done:
increase_V3=function(x, factor=2){
output <- (x*factor)
return(output)
}
##now run:
increase_V3(50)
##now we have the value of the "output" object to play with again
##notice that we still do not have a variable named output:
output ##should fail again
## Assign the returned value to a variable for further use:
result <- increase_V3(50)
result
##Here's one final example. It is very common to use booleans (true/ false) as arguments with defaults in a function. Lets say you want a function that returns the range of values in a vector, and sometimes you want the absolute minimum and maximum numbers, but sometime you want to know how far apart the extremes are. You could put both options into one function:
get_range=function(vector, minMax=FALSE){
vector_min <- min(vector)
vector_max <- max(vector)
if(minMax){
temp <- data.frame(vector_min, vector_max)
return(temp)
}else{
range <- vector_max-vector_min
return(range)
}
}
##you actually *don't* need the "else" here- but lets keep it in for clarity.
##once it's input, try it out:
##here is a practice vector:
v <- seq(5,20)
v
get_range(vector=v, minMax=FALSE)
##same as:
get_range(v)
##different from:
get_range(v, minMax=TRUE)
# notice that if minMax is FALSE the function returns a number, but if it is
# TRUE the function returns a dataframe. This might cause errors in other
# things you might be doing with the output of this function, so you might
# want to avoid things like this when writing functions for your own use.
# as a final note, you can place functions in .r text file and then
# import them into your workspace with the source() command, such as:
# source("my_favorite_functions.r"). This can be faster than copying and
# pasting the functions you always use into the start of a script.
############ Error catching ##########
# try
try()
# tryCatch
tryCatch()
# traceback
foo <- function(x) { print(1); bar(2) }
bar <- function(x) { x + a.variable.which.does.not.exist }
## Not run:
foo(2) # gives a strange error
traceback()
## End(Not run)
## 2: bar(2)
## 1: foo(2)
bar ## Ah, this is the culprit ...
############ Data Manipulation ##########
# The following are five packages that do data manipulation
library(plyr); library(data.table); library(doBy); library(sqldf)
library(reshape2)
# 'Pivot tables'
dataset <- data.frame(var1 = rep(c("a","b","c","d","e","f"), each = 4),
var2 = rep(c("level1","level1","level2","level2"), 6),
var3 = rep(c("h","m"), 12), meas = rep(1:12))
dataset_d <- dcast(dataset, var1 ~ var2 + var3) # pivot table
melt(dataset_d, id = 1) # pivot table back, not ideal, but works
# reshape and summarise the dataframe
data(iris)
iris_m <- melt(iris, id="Species", na.rm=TRUE)
head(iris_m)
dcast(iris_m, Species ~ variable, length) # oops, just length of data vector
dcast(iris_m, Species ~ variable, identity) # that's better
# Summarise by grouping levels of variables
ddply(dataset, .(var1, var2), summarise,
mean = mean(meas),
sd = sd(meas))
# Good way to go from lists to dataframes
mylist <- list(var1 = c(1,2,3,4),
var2 = c(5,6,7,8))
library(plyr) # load plyr package
ldply(mylist) # makes a dataframe from the list
####### Looking at your data - pairwise comparison plots #######
require(lattice)
require(ggplot2)
require(car)
# base graphics
pairs(iris[1:4])
# lattice
splom(~iris[1:4])
# ggplot2
plotmatrix(iris[1:4])
# other
source("/Users/ScottMac/Dropbox/R Group/Week1_R-Intro/ggcorplot.R")
ggcorplot(
data = iris[1:4],
var_text_size = 5,
cor_text_limits = c(5,10))
# other2
panel.cor <- function(x, y, digits=2, prefix="", cex.cor)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits=digits)[1]
txt <- paste(prefix, txt, sep="")
if(missing(cex.cor)) cex <- 0.8/strwidth(txt)
test <- cor.test(x,y)
# borrowed from printCoefmat
Signif <- symnum(test$p.value, corr = FALSE, na = FALSE,
cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1),
symbols = c("***", "**", "*", ".", " "))
text(0.5, 0.5, txt, cex = cex * r)
text(.8, .8, Signif, cex=cex, col=2)
}
pairs(iris[1:4], lower.panel=panel.smooth, upper.panel=panel.cor)
# car package, function scatterplotMatrix (spm abbreviation)
spm(iris[1:4])
# compare run times for different methods
system.time(pairs(iris[1:4]))
system.time(splom(~iris[1:4]))
system.time(plotmatrix(iris[1:4]))
system.time(ggcorplot(
data = iris[1:4],
var_text_size = 5,
cor_text_limits = c(5,10)))
system.time(pairs(iris[1:4], lower.panel=panel.smooth, upper.panel=panel.cor))
system.time(spm(iris[1:4]))
########## Comparison of speed of data summarisation packages #########
# Modified from the Recipes, scripts, and genomics blog
library(sqldf)
library(doBy)
library(plyr)
library(data.table)
n<-10000
grp1<-sample(1:750, n, replace=T)
grp2<-sample(1:750, n, replace=T)
d<-data.frame(x=rnorm(n), y=rnorm(n), grp1=grp1, grp2=grp2, n,
replace=T)
# sqldf
rsqldf <- system.time(sqldf("select grp1, grp2, sum(x), sum(y) from d
group by grp1, grp2"))
#doBy
rdoby <- system.time(summaryBy(x+y~grp1+grp2, data=d, FUN=c(sum)))
#aggregate
raggregate <- system.time(aggregate(d, list(d$grp1, d$grp2),
function(x)sum(x)))
#plyr
rplyr <- system.time(ddply(d, .(grp1, grp2), summarise, avx = sum(x),
avy=sum(y)))
#data.table
DT = data.table(d)
rdataT <- system.time(DT[,list(sum(x),sum(y)),by=list(grp1,grp2)])
rsqldf
rdoby
raggregate
rplyr
rdataT
install.packages("gplots")
library(gplots)
x <- c(rdataT[3],rsqldf[3],rdoby[3],raggregate[3],rplyr[3])
balloonplot( rep("time.elapsed",5),c("data.table","sqldf","doBy","aggregate","plyr"),
round(x,1), ylab ="Method", xlab="",sorted=F,dotcolor=rev(heat.colors(5)),
main="time.elapsed for different methods of grouping")
##### Comparison of lattice and ggplot2 graphics ########
library(lattice)
library(ggplot2)
head(iris)
# Base graphics
# Not sure how to do the specific example here?
# lattice graphics
pl <- densityplot(~Sepal.Length, data = iris, groups = Species,
plot.points = FALSE, ref = TRUE)
print(pl)
# ggplot2 graphics
pg <- ggplot(iris, aes(Sepal.Length)) +
stat_density(geom = "path", position = "identity",
aes(colour = factor(Species)))
print(pg)

A Data Visualization Book

Recology has moved, go to http://recology.info/2011/09/data-visualization-book

Note: thanks to Scott for inviting me to contribute to the Recology blog despite being an ecology outsider; my work is primarily in atomic physics. -Pascal
A part of me has always liked thinking about how to effectively present information, but until the past year, I had not read much to support my (idle) interest in information visualization. That changed in the spring when I read Edward Tufte's The Visual Display of Quantitative Information, a book that stimulated me to think more deeply about presenting information. I originally started with a specific task in mind--a wonderful tool for focusing one's interests--but quickly found that Tufte's book was less a practical guide and more a list of general design principles. Then, a few months ago, I stumbled upon Nathan Yau's blog, FlowingData, and found out he was writing a practical guide to design and visualization. Conveniently enough for me, Yau's book, Visualize This, would be released within a month of my discovery of his blog; what follows are my impressions of Visualize This.
I have liked Visualize This a lot.  Yau writes with much the same informal tone as on his blog, and the layout is visually pleasing (good thing, too, for a book about visualizing information!).  The first few chapters are pretty basic if you have done much data manipulation before, but it is really nice to have something laid out so concisely.  The examples are good, too, in that he is very explicit about every step: there is no intuiting what that missing step should be.  The author even acknowledges in the introduction that the first part of the book is at an introductory level.
Early in the book, Yau discusses where to obtain data. This compilation of sources is potentially a useful reference for someone, like me, who almost always generates his own data in the lab. Unfortunately, Yau does not talk much about preparation of (or best practices for) your own data.  Additionally, from the perspective of a practicing scientist, it would have been nice to hear about how to archive data to make sure it is readable far into the future, but that is probably outside the scope of the book.
Yau seems really big into using open source software for getting and analyzing data (e.g. Python, R, etc…), but he is surprisingly attached to the proprietary Adobe Illustrator for turning figures into presentation quality graphics.  He says that he feels like the default options in most analysis programs do not make for very good quality graphics (and he is right), but he does not really acknowledge that you can generate nice output if you go beyond the default settings.  For me, the primary advantage of generating output programmatically is that it is easy to regenerate when you need to change the data or the formatting on the plot.  Using a graphical user interface, like in Adobe Illustrator, is nice if you are only doing something once (how often does that happen?), but when you have to regenerate the darn figure fifty times to satisfy your advisor, it gets tedious to move things around pixel by pixel.
By the time I reached the middle chapters, I started finding many of the details to be repetitive. Part of this repetition stems from the fact that Yau divides these chapters by the type of visualization. For example, "Visualizing Proportions" and "Visualizing Relationships" are two of the chapter titles. While I think these distinctions are important ones for telling the right story about one's data, creating figures for the different data types often boils down to choosing different functions in R or Python. People with less analysis and presentation experience should find the repetition helpful, but I increasingly skimmed these sections as I went along.  
Working through Yau's examples for steps you do not already know would probably be the most useful way of getting something out of the book.  So, for example, I started trying to use Python to scrape data from a webpage, something I had not previously done.  I followed the book's example of this data-scraping just fine, but as with most things in programming, you find all sorts of minor hurdles to clear when you try your own thing. In my case, I am re-learning the Python I briefly learned about 10 years ago--partly in anticipation of not having access to Matlab licenses once I vacate the academy--since I have forgotten a lot of the syntax.  A lot of this stuff would be faster if I were working in Matlab which I grew more familiar with in graduate school.
Overall, Visualize This is a really nice looking book and will continue to be useful to me as a reference. Yau concludes his book with a refreshing reminder to provide context for the data we present. This advice is particularly relevant when presenting to a wider or lay audience, but it is still important for us, as scientists, to clearly communicate our findings in the literature. Patterns in the data are not often self-evident, and therefore we should think carefully about which visualization tools will best convey the meaning of our results.
Edited to add a link to Visualize This here and in the introductory paragraph.






Thursday, September 8, 2011

FigShare Talk

Recology has moved, go to http://recology.info/2011/09/figshare-talk

FigShare - I very much like this idea of a place to put your data online that is NOT published. Dryad is a nice place for datastes linked with published papers, but there isn't really a place for datasets that perhaps did not make the cut for a published paper, and if known to the scientific community, could potentially help resolve the "file-drawer" effect in meta-analyses. (wow, run on sentence)

 
"Figshare - Why don't you publish all your research?" Mark Hahnel Imperial College London from London Biogeeks on Vimeo.