Sunday, January 30, 2011

Good riddance to Excel pivot tables

Excel pivot tables have been how I have reorganized data...up until now. These are just a couple of examples why R is superior to Excel for reorganizing data:

UPDATE: I fixed the code to use 'dcast' instead of 'cast'. And library(ggplot2) instead of library(plyr) [plyr is called along with ggplot2]. Thanks Bob!

Also, see another post on this topic:
-http://news.mrdwab.com/2010/08/08/using-the-reshape-packagein-r/



################ Good riddance to pivot tables ############
library(reshape2)
library(ggplot2) 
dataset <- data.frame(var1 = rep(c("a","b","c","d","e","f"), each = 4), 
 var2 = rep(c("level1","level1","level2","level2"), 6), 
 var3 = rep(c("h","m"), 12), meas = rep(1:12))
Created by Pretty R at inside-R.org


# simply pivot table
dcast(dataset, var1 ~ var2 + var3)
Using meas as value column.  Use the value argument to cast to override this choice
  var1 level1_h level1_m level2_h level2_m
1    a        1        2        3        4
2    b        5        6        7        8
3    c        9       10       11       12
4    d        1        2        3        4
5    e        5        6        7        8
6    f        9       10       11       12
 
# mean by var1 and var2
dcast(dataset, var1 ~ var2, mean)
Using meas as value column.  Use the value argument to cast to override this choice
  var1 level1 level2
1    a    1.5    3.5
2    b    5.5    7.5
3    c    9.5   11.5
4    d    1.5    3.5
5    e    5.5    7.5
6    f    9.5   11.5

 
# mean by var1 and var3
dcast(dataset, var1 ~ var3, mean)
Using meas as value column.  Use the value argument to cast to override this choice
  var1  h  m
1    a  2  3
2    b  6  7
3    c 10 11
4    d  2  3
5    e  6  7
6    f 10 11
 
# mean by var1, var2 and var3 (version 1)
dcast(dataset, var1 ~ var2 + var3, mean)
Using meas as value column.  Use the value argument to cast to override this choice
  var1 level1_h level1_m level2_h level2_m
1    a        1        2        3        4
2    b        5        6        7        8
3    c        9       10       11       12
4    d        1        2        3        4
5    e        5        6        7        8
6    f        9       10       11       12
 
# mean by var1, var2 and var3 (version 2)
dcast(dataset, var1 + var2 ~ var3, mean)
Using meas as value column.  Use the value argument to cast to override this choice
   var1   var2  h  m
1     a level1  1  2
2     a level2  3  4
3     b level1  5  6
4     b level2  7  8
5     c level1  9 10
6     c level2 11 12
7     d level1  1  2
8     d level2  3  4
9     e level1  5  6
10    e level2  7  8
11    f level1  9 10
12    f level2 11 12

 
# use package plyr to create flexible data frames...
dataset_plyr <- ddply(dataset, .(var1, var2), summarise, 
 mean = mean(meas), 
 se = sd(meas),
 CV = sd(meas)/mean(meas)
)
> dataset_plyr
   var1   var2 mean        se         CV
1     a level1  1.5 0.7071068 0.47140452
2     a level2  3.5 0.7071068 0.20203051
3     b level1  5.5 0.7071068 0.12856487
4     b level2  7.5 0.7071068 0.09428090
5     c level1  9.5 0.7071068 0.07443229
6     c level2 11.5 0.7071068 0.06148755
7     d level1  1.5 0.7071068 0.47140452
8     d level2  3.5 0.7071068 0.20203051
9     e level1  5.5 0.7071068 0.12856487
10    e level2  7.5 0.7071068 0.09428090
11    f level1  9.5 0.7071068 0.07443229
12    f level2 11.5 0.7071068 0.06148755
 
# ...to use for plotting
qplot(var1, mean, colour = var2, size = CV, data = dataset_plyr, geom = "point")



Created by Pretty R at inside-R.org

Monday, January 17, 2011

R and Google Visualization API: Fish harvests

I recently gathered fish harvest data from the U.S. National Oceanic and Atmospheric Administarion (NOAA), which I downloaded from Infochimps. The data is fish harvest by weight and value, by species for 21 years, from 1985 to 2005.

Here is a link to a google document of the data I used below. I had to do some minor pocessing in Excel first; thus the link to this data.
https://spreadsheets.google.com/ccc?key=0Aq6aW8n11tS_dFRySXQzYkppLXFaU2F5aC04d19ZS0E&hl=en

Get the original data from Infochimps here:
http://infochimps.com/datasets/domestic-fish-and-shellfish-catch-value-and-price-by-species-198




################# Fish harvest data ########################################
setwd("/Mac/R_stuff/Blog_etc/Infochimps/Fishharvest") # Set path
library(ggplot2)
library(googleVis)
library(Hmisc)
 
fish <- read.csv("fishharvest.csv") # read data
fish2 <- melt(fish,id=1:3,measure=4:24) # melt table
year <- rep(1985:2005, each = 117)
fish2 <- data.frame(fish2,year) # replace year with actual values
 
# Google visusalization API
fishdata <- data.frame(subset(fish2,fish2$var == "quantity_1000lbs",-4),value_1000dollars=subset(fish2,fish2$var == "value_1000dollars",-4)[,4])
names(fishdata)[4] <- "quantity_1000lbs"
fishharvest <- gvisMotionChart(fishdata, idvar="species", timevar="year")
plot(fishharvest)
Created by Pretty R at inside-R.org




Data: fishdata, Chart ID: MotionChart_2011-01-17-08-09-24


R version 2.12.1 (2010-12-16),

Google Terms of Use



fishdatagg2 <- ddply(fish2,.(species,var),summarise,
 mean = mean(value),
 se = sd(value)/sqrt(length(value))
)
fishdatagg2 <- subset(fishdatagg2,fishdatagg2$var %in% c("quantity_1000lbs","value_1000dollars"))
limit3 <- aes(ymax = mean + se, ymin = mean - se)
bysppfgrid <- ggplot(fishdatagg2,aes(x=reorder(species,rank(mean)),y=mean,colour=species)) + geom_point() + geom_errorbar(limit3) + facet_grid(. ~ var, scales="free") + opts(legend.position="none") + coord_flip() + scale_y_continuous(trans="log")
ggsave("bysppfgrid.jpeg")
Created by Pretty R at inside-R.org


R and Google Visualization API: Wikispeedia

Wikispeedia is a website trying to gather all speed limit signs on Earth.  I recently created a Google Visualization for some of their data, specifically on speed limit signs that change speed throughout the day.

Check it out here: http://groups.google.com/group/wikispeedia/browse_thread/thread/c9c712125a597b16

Here is how to see and comment on what they are doing:
Website: http://www.wikispeedia.org/
Google groups: http://groups.google.com/group/wikispeedia?lnk=

Friday, January 14, 2011

Bipartite networks and R

Earlier, I posted about generating networks from abundance distributions that you specify. If this post was interesting, check out Jeff Kilpatrick's website, where he provides code he produced in R and Octave to compare real bipartite networks to ones generated based on ecological variables measured in the field (in our case it was abundance, body size, and nectar production). We used that code for a paper we published. Code was modified from code produced by Diego P. Vazquez

Tuesday, January 11, 2011

just for fun: Recovery.gov data snooping





Okay, so this isn't ecology related at all, but I like exploring data sets. So here goes...

Propublica has some awesome data sets available at their website: http://www.propublica.org/tools/
I played around with their data set on Recovery.gov (see hyperlink below in code). Here's some figures:

Mean award amount, ranked by mean amount, and also categorized by number of grants received ("nfund") by state (by size and color of point).  Yes, there are 56 "states", which includes things like Northern Marian Islands (MP). Notice that California got the largest number of awards, but the mean award size was relatively small.
Here is a figure by government organization that awarded each award, by mean award size (y-axis), number of awards (x-axis), and number of jobs created (numjobs=text size). Notice that the FCC (Federal Communications Commission) created nearly the most jobs despite not giving very large awards (although they did give a lot of awards).



Here is a figure of mean awards by state on a map of the US:


And by number of awards by state:





Here is the code:

################################################################################
#################### Propublica Recovery.gov data           ####################
################################################################################
install.packages(c("ggplot2","maps","stringr"))
library(ggplot2)
library(maps) 
library(stringr)
setwd("/Mac/R_stuff/Blog_etc") # Set working directory
theme_set(theme_bw())
 
# Read propublica data from file (download from here: http://propublica.s3.amazonaws.com/assets/recoverygov/propublica-recoverygov-primary-2.xls
propubdat <- read.csv("propublica-recoverygov-primary-2.csv")
str(propubdat)
 
# Summarize data
fundbystate <- ddply(propubdat,.(prime_state),summarise,
 meanfund = mean(award_amount),
 sefund = sd(award_amount)/sqrt(length(award_amount)),
 nfund = length(award_amount),
 numjobs = mean(number_of_jobs)
)
 
fundbyagency <- ddply(propubdat,.(funding_agency_name),summarise,
 meanfund = mean(award_amount),
 sefund = sd(award_amount)/sqrt(length(award_amount)),
 nfund = length(award_amount),
 numjobs = mean(number_of_jobs)
)
 
 
fun1 <- function(a) {str_c(paste(na.omit(str_extract(unlist(str_split(unlist(as.character(a[1])), " ")), "[A-Z]{1}"))), collapse="")} # Fxn to make funding agency name abbreviations within ddply below
 
fundbyagency2 <- ddply(fundbyagency,.(funding_agency_name),transform, # add to table funding agency name abbreviations
 agency_abbrev = fun1(funding_agency_name)
)
 
# Plot data, means and se's by state
limits <- aes(ymax = meanfund + sefund, ymin = meanfund - sefund)
dodge <- position_dodge(width=0.6)
awardbystate <- ggplot(fundbystate,aes(x=reorder(prime_state,meanfund),y=meanfund,colour=nfund)) + geom_point(aes(size=nfund),position=dodge) + coord_flip()  + geom_errorbar(limits, width=0.2,position=dodge) + opts(panel.grid.major = theme_blank(),panel.grid.minor=theme_blank(),legend.position=c(0.7,0.2)) + labs(x="State",y="Mean grant amount awarded +/- 1 s.e.")
ggsave("awardbystate.jpeg")
 
# Plot data, means and se's by funding agency
limits2 <- aes(ymax = meanfund + sefund, ymin = meanfund - sefund)
dodge <- position_dodge(width=0.6)
awardbyagency <- ggplot(fundbyagency2,aes(y=log(meanfund),x=log(nfund),label=agency_abbrev)) + geom_text(aes(size=numjobs)) 
ggsave("awardbyagency.jpeg")
 
 
# On US map
fundbystate2 <- read.csv("fundbystate.csv")
 
states <- map_data("state") # get state geographic data from the maps package
recovmap <- merge(states,fundbystate2,by="region") # merage datasets
 
qplot(long,lat,data=recovmap,group=group,fill=meanfund,geom="polygon")
ggsave("bystatemapmeans.jpeg")
 
qplot(long,lat,data=recovmap,group=group,fill=nfund,geom="polygon")
ggsave("bystatemapnumber.jpeg")

Created by Pretty R at inside-R.org

And the text file fundbystate2 here. I had the make this file separately so I could get in the spelled out state names as they were not provided in the propublica dataset.

Source and disclaimer:
Data provided by Propublica. Data may contain errors and/or omissions.