Here is a link to a google document of the data I used below. I had to do some minor pocessing in Excel first; thus the link to this data.
https://spreadsheets.google.com/ccc?key=0Aq6aW8n11tS_dFRySXQzYkppLXFaU2F5aC04d19ZS0E&hl=en
Get the original data from Infochimps here:
http://infochimps.com/datasets/domestic-fish-and-shellfish-catch-value-and-price-by-species-198
################# Fish harvest data ######################################## setwd("/Mac/R_stuff/Blog_etc/Infochimps/Fishharvest") # Set path library(ggplot2) library(googleVis) library(Hmisc) fish <- read.csv("fishharvest.csv") # read data fish2 <- melt(fish,id=1:3,measure=4:24) # melt table year <- rep(1985:2005, each = 117) fish2 <- data.frame(fish2,year) # replace year with actual values # Google visusalization API fishdata <- data.frame(subset(fish2,fish2$var == "quantity_1000lbs",-4),value_1000dollars=subset(fish2,fish2$var == "value_1000dollars",-4)[,4]) names(fishdata)[4] <- "quantity_1000lbs" fishharvest <- gvisMotionChart(fishdata, idvar="species", timevar="year") plot(fishharvest)
Data: fishdata, Chart ID: MotionChart_2011-01-17-08-09-24
R version 2.12.1 (2010-12-16),
Google Terms of Use
fishdatagg2 <- ddply(fish2,.(species,var),summarise, mean = mean(value), se = sd(value)/sqrt(length(value)) ) fishdatagg2 <- subset(fishdatagg2,fishdatagg2$var %in% c("quantity_1000lbs","value_1000dollars")) limit3 <- aes(ymax = mean + se, ymin = mean - se) bysppfgrid <- ggplot(fishdatagg2,aes(x=reorder(species,rank(mean)),y=mean,colour=species)) + geom_point() + geom_errorbar(limit3) + facet_grid(. ~ var, scales="free") + opts(legend.position="none") + coord_flip() + scale_y_continuous(trans="log") ggsave("bysppfgrid.jpeg")
Could you post a link to the original data please?
ReplyDeleteThanks for pointing that out. It is done.
ReplyDeleteAmazing!
ReplyDeleteThanks JA...
ReplyDeleteA few comments and questions:
ReplyDelete1. Thanks for posting the spreadsheet. Since I'm in a similar position (transition from traditionally excel based spreadsheets to using R) I would like to know - how often do you "comb" larger datasets in order to extract subsets for use in R?
for example: I will have multiple time series datasets that need to be combined into (one?) spreadsheet for analysis. From what I am reading/learning in R, seems the program is capable of this. I find it harder to look at the data when it's in an R data frame vs. just in a bulk spreadsheet in excel.
2. The Google visualization API is interesting. My current time-series work may benefit from an animated visualization such as this. Great job!
This information is meaningful when applied to fisheries, but the data I'm using might require some pooling before I could illustrate the (currently) separate categories.
Hi hawright,
ReplyDelete1. What do you mean by comb larger datasets? Do you mean from the internet, a URL? Or after you import the dataset into R? You can combine datasets in R quite easily with functions like merge() or match(), e.g.
2. You could send me (myrmecocystus@gmail.com) a dataset and I could try to help with what format it should be in to use the google vis api. An alternative to google vis api is using animated series of plots, where the function creates a figure for each time step, and then you can visualize the plots in order of time steps. I can't remember where I saw the code for the time series plots though.
Scott
Ciao Scott
ReplyDelete1. Cleaning up null values, missing points, etc. from the original data file. Yes, you can combine datasets in R, but if they aren't created in a standardized format in the first place, you might have problems.
In this case, I refer to combing the dataset as "trunkating" the complete set - or taking a subset.
Re:
2. Not ready to send a dataset (yet) and would have to determine if it's appropriate. However, timeseries in R is probably the next step. Is that what you are speaking of - using a function to create a figure and then animating it in R?! :) beginner language for me please.
Thanks! and I enjoy reading your blog.
Right, you don't want to throw a messy excel type table into R. But there are easy ways to clean up the data. For example:
ReplyDelete-NA's (when there are missing values in a cell, R assigns an "NA")
This: dataframe2 <- subset(dataframe, !is.na(column1))
will remove all rows in the dataframe if the cell in column1 has an NA. The "!" says the opposite of the command you call (so instead of keeping the rows with NA's in column1, you get rid of those rows. I think na.omit(dataframe) would remove ANY rows with NA's
-Infinite values
This: dataframe2 <- subset(dataframe, !is.finite(column1))
will remove all rows in the dataframe if the cell in column1 has an infinite ("Inf" in R).
I'll see if I can find the time series thing I was talking about...
Thanks! I saw your blog. Not my area, but cool stuff. -Scott
Next step is using R for all of these intermediate steps because I'm still relying on 'xl' to do. I want to cut the umbilical cord from that program altogether.
ReplyDeleteOk, missing data values = NA
What if you have <1 or values that don't make sense to even consider? Is it simply a matter of writing another line to say omit all values between [0.5:-0.5] ?
ti<-# command for time series format (?)
I worked a little bit with 'format' yesterday to configure dates. Tricky, but once you establish the ~format then you can break down the data analysis into months, weeks, days, etc. Still learning on this!
Thank for the advice. Venturing into this time series analysis is an extension of all the work I've done up until now. This is providing context for the data. Biting off a large portion, but should be the 'next step' for me.
Let's cut that cord!
ReplyDeleteRight, NA is a missing value. If you have an empty cell in an excel file you import it will replace i with an NA.
For values that you want to exclude from analysis or plotting:
-If you have a dataframe:
df <- data.frame(var1 = c("a","b","c","d","e"), value1 = c(1,2,3,4,5), value2 = c(10,20,30,40,50))
subset(df, value1 < 3)
subset(df, value1 < 4 & value2 > 20)
subset(df, value1 == 3)
subset(df, var1 == "c")
Often you might have mixed types of data in a single column in excel, but you can't really work efficiently with mixed data types in R in a single column. If you read in some data set like "it", and then look for certain characters if there aren't that many and replace them with something, convert to a numeric column and recreate dataframe:
it <- read.csv("it.csv")
> it
column1 column2
1 d 1
2 q 2
3 b b
4 d 5
5 d 6
6 g 7
column2_ <- gsub("b", 3, it$column2)
it2 <- data.frame(column1 = it[,1], column2_ = as.numeric(column2_))
> str(it2)
'data.frame': 6 obs. of 2 variables:
$ column1 : Factor w/ 4 levels "b","d","g","q": 2 4 1 2 2 3
$ column2_: num 1 2 3 5 6 7