Monday, January 17, 2011

R and Google Visualization API: Fish harvests

I recently gathered fish harvest data from the U.S. National Oceanic and Atmospheric Administarion (NOAA), which I downloaded from Infochimps. The data is fish harvest by weight and value, by species for 21 years, from 1985 to 2005.

Here is a link to a google document of the data I used below. I had to do some minor pocessing in Excel first; thus the link to this data.
https://spreadsheets.google.com/ccc?key=0Aq6aW8n11tS_dFRySXQzYkppLXFaU2F5aC04d19ZS0E&hl=en

Get the original data from Infochimps here:
http://infochimps.com/datasets/domestic-fish-and-shellfish-catch-value-and-price-by-species-198




################# Fish harvest data ########################################
setwd("/Mac/R_stuff/Blog_etc/Infochimps/Fishharvest") # Set path
library(ggplot2)
library(googleVis)
library(Hmisc)
 
fish <- read.csv("fishharvest.csv") # read data
fish2 <- melt(fish,id=1:3,measure=4:24) # melt table
year <- rep(1985:2005, each = 117)
fish2 <- data.frame(fish2,year) # replace year with actual values
 
# Google visusalization API
fishdata <- data.frame(subset(fish2,fish2$var == "quantity_1000lbs",-4),value_1000dollars=subset(fish2,fish2$var == "value_1000dollars",-4)[,4])
names(fishdata)[4] <- "quantity_1000lbs"
fishharvest <- gvisMotionChart(fishdata, idvar="species", timevar="year")
plot(fishharvest)
Created by Pretty R at inside-R.org




Data: fishdata, Chart ID: MotionChart_2011-01-17-08-09-24


R version 2.12.1 (2010-12-16),

Google Terms of Use



fishdatagg2 <- ddply(fish2,.(species,var),summarise,
 mean = mean(value),
 se = sd(value)/sqrt(length(value))
)
fishdatagg2 <- subset(fishdatagg2,fishdatagg2$var %in% c("quantity_1000lbs","value_1000dollars"))
limit3 <- aes(ymax = mean + se, ymin = mean - se)
bysppfgrid <- ggplot(fishdatagg2,aes(x=reorder(species,rank(mean)),y=mean,colour=species)) + geom_point() + geom_errorbar(limit3) + facet_grid(. ~ var, scales="free") + opts(legend.position="none") + coord_flip() + scale_y_continuous(trans="log")
ggsave("bysppfgrid.jpeg")
Created by Pretty R at inside-R.org


10 comments:

  1. Could you post a link to the original data please?

    ReplyDelete
  2. Thanks for pointing that out. It is done.

    ReplyDelete
  3. A few comments and questions:

    1. Thanks for posting the spreadsheet. Since I'm in a similar position (transition from traditionally excel based spreadsheets to using R) I would like to know - how often do you "comb" larger datasets in order to extract subsets for use in R?

    for example: I will have multiple time series datasets that need to be combined into (one?) spreadsheet for analysis. From what I am reading/learning in R, seems the program is capable of this. I find it harder to look at the data when it's in an R data frame vs. just in a bulk spreadsheet in excel.

    2. The Google visualization API is interesting. My current time-series work may benefit from an animated visualization such as this. Great job!

    This information is meaningful when applied to fisheries, but the data I'm using might require some pooling before I could illustrate the (currently) separate categories.

    ReplyDelete
  4. Hi hawright,

    1. What do you mean by comb larger datasets? Do you mean from the internet, a URL? Or after you import the dataset into R? You can combine datasets in R quite easily with functions like merge() or match(), e.g.

    2. You could send me (myrmecocystus@gmail.com) a dataset and I could try to help with what format it should be in to use the google vis api. An alternative to google vis api is using animated series of plots, where the function creates a figure for each time step, and then you can visualize the plots in order of time steps. I can't remember where I saw the code for the time series plots though.

    Scott

    ReplyDelete
  5. Ciao Scott
    1. Cleaning up null values, missing points, etc. from the original data file. Yes, you can combine datasets in R, but if they aren't created in a standardized format in the first place, you might have problems.

    In this case, I refer to combing the dataset as "trunkating" the complete set - or taking a subset.

    Re:
    2. Not ready to send a dataset (yet) and would have to determine if it's appropriate. However, timeseries in R is probably the next step. Is that what you are speaking of - using a function to create a figure and then animating it in R?! :) beginner language for me please.

    Thanks! and I enjoy reading your blog.

    ReplyDelete
  6. Right, you don't want to throw a messy excel type table into R. But there are easy ways to clean up the data. For example:

    -NA's (when there are missing values in a cell, R assigns an "NA")
    This: dataframe2 <- subset(dataframe, !is.na(column1))
    will remove all rows in the dataframe if the cell in column1 has an NA. The "!" says the opposite of the command you call (so instead of keeping the rows with NA's in column1, you get rid of those rows. I think na.omit(dataframe) would remove ANY rows with NA's

    -Infinite values
    This: dataframe2 <- subset(dataframe, !is.finite(column1))
    will remove all rows in the dataframe if the cell in column1 has an infinite ("Inf" in R).


    I'll see if I can find the time series thing I was talking about...


    Thanks! I saw your blog. Not my area, but cool stuff. -Scott

    ReplyDelete
  7. Next step is using R for all of these intermediate steps because I'm still relying on 'xl' to do. I want to cut the umbilical cord from that program altogether.

    Ok, missing data values = NA
    What if you have <1 or values that don't make sense to even consider? Is it simply a matter of writing another line to say omit all values between [0.5:-0.5] ?

    ti<-# command for time series format (?)

    I worked a little bit with 'format' yesterday to configure dates. Tricky, but once you establish the ~format then you can break down the data analysis into months, weeks, days, etc. Still learning on this!

    Thank for the advice. Venturing into this time series analysis is an extension of all the work I've done up until now. This is providing context for the data. Biting off a large portion, but should be the 'next step' for me.

    ReplyDelete
  8. Let's cut that cord!

    Right, NA is a missing value. If you have an empty cell in an excel file you import it will replace i with an NA.

    For values that you want to exclude from analysis or plotting:
    -If you have a dataframe:
    df <- data.frame(var1 = c("a","b","c","d","e"), value1 = c(1,2,3,4,5), value2 = c(10,20,30,40,50))

    subset(df, value1 < 3)

    subset(df, value1 < 4 & value2 > 20)

    subset(df, value1 == 3)

    subset(df, var1 == "c")


    Often you might have mixed types of data in a single column in excel, but you can't really work efficiently with mixed data types in R in a single column. If you read in some data set like "it", and then look for certain characters if there aren't that many and replace them with something, convert to a numeric column and recreate dataframe:
    it <- read.csv("it.csv")
    > it
    column1 column2
    1 d 1
    2 q 2
    3 b b
    4 d 5
    5 d 6
    6 g 7
    column2_ <- gsub("b", 3, it$column2)
    it2 <- data.frame(column1 = it[,1], column2_ = as.numeric(column2_))
    > str(it2)
    'data.frame': 6 obs. of 2 variables:
    $ column1 : Factor w/ 4 levels "b","d","g","q": 2 4 1 2 2 3
    $ column2_: num 1 2 3 5 6 7

    ReplyDelete