Sunday, July 31, 2011

Blogging/tweeting from #ESA11

Recology has moved, go to http://recology.info/2011/07/bloggingtweeting-from-esa11

I will be blogging about the upcoming Ecological Society of America meeting in Austin, TX. I will focus on discussing talks/posters that:

  1. Have taken a cool approach to using data, or
  2. Have focused on open science/data, or
  3. Done something cool with R software, or
  4. Are just exciting in general

I will also tweet throughout the meeting from @recology_ (yes the underscore is part of the name, recology was already taken). 

The hashtag for the meeting this year is #ESA11

Friday, July 15, 2011

Archiving ecology/evolution data sets online


We now have many options for archiving data sets online:

DryadKNBEcological ArchivesEcology Data PapersEcological Data, etc.

However, these portals largely do not communicate with one another as far as I know, and there is no way to search over all data set sources, again, as far as I know. So, I wonder if it would ease finding of all these different data sets to get these different sites to get their data sets cloned on a site like Infochimps, or have links from Infochimps.  Infochimps already has APIs (and there's an R wrapper for the Infochimps API already set up here: http://cran.r-project.org/web/packages/infochimps/index.html by Drew Conway), and they have discussions set up there, etc.

Does it make sense to post data sets linked to published works on Infochimps? I think probably not know that I think about it. But perhaps it makes sense for other data sets, or subsets of data sets that are not linked with published works to be posted there as I know at least Dryad only accepts data sets linked with published papers.

One use case is there was a tweet from someone recently that his students were excited about getting their data sets on their resume/CV, but didn't think there was a way to put them any place where there wasn't a precondition that the data set was linked with a published work. Seems like this could be a good opportunity to place these datasets on Infcohimps, and at least they are available then where a lot of people are searching for data sets, etc.

What I think would be ideal is if Dryad, KNB, etc. could link their datasets to Infochimps, where they could be found, then users can either get them from Infochimps, or perhaps you would have to go to the Dryad site, e.g. But at least you could search over all ecological data sets then.

Thursday, July 14, 2011

CRdata vs. Cloudnumbers

Cloudnumbers and CRdata are two new cloud computing services.


I tested the two services with a very simple script. The script simply creates a dataframe of 10000 numbers via rnorm, and assigns them to a factor of one of two levels (a or b). I then take the mean of the two factor levels with the aggregate function.


In CRdata you need to put in some extra code to format the output in a browser window. For example, the last line below needs to have '<crdata_object>' on both sides of the output object so it can be rendered in a browser. And etc. for other things that one would print to a console. Whereas you don't need this extra code for using Cloudnumbers.

 
dat <- data.frame(n = rnorm(10000), p = rep(c('a','b'), each=5000))
 
out <- aggregate(n ~ p, data = dat, mean)
 
#<crdata_object>out</crdata_object>


Here is a screenshot of the output from CRdata with the simple script above.

This simple script ran in about 20 seconds or so from starting the job to finishing. However, it seems like the only output option is html. Can this be right? This seems like a terrible only option.


In Cloudnumbers you have to start a workspace, upload your R code file.
Then, start a session...
choose your software platform...
choose packages (one at a time, very slow)...
then choose number of clusters, etc.
Then finally star the job.
Then it initializes, then finally you can open the console, and
Then from here it is like running R as you normally would, except on the web.


Who wins (at least for our very minimal example above)

  1. Speed of entire process (not just running code): CRdata
  2. Ease of use: CRdata
  3. Cost: CRdata (free only)
  4. Least annoying: Cloudnumbers (you don't have to add in extra code to run your own code)
  5. Opensource: CRdata (you can use publicly available code on the site)
  6. Long-term use: Cloudnumbers (more powerful, flexible, etc.)

I imagine Cloudnumbers could be faster for larger jobs, but you would have to pay for the speed of course. 

What I really want to see is a cloud computing service that accepts code directly run from R or RStudio. Hmmm...that would be so tasty indeed. I think Cloudnumbers may be able to do this, but haven't tested it yet.  

Perhaps using the server version of RStudio along with Amazon's EC2 is a better option than both of these. See Karthik Ram's post about using RStudio server along with Amazon's EC2. Even just running RStudio server on your Unbuntu machine or virtual machine is a pretty cool option, even without EC2 (works like a charm on my Parallels Ubuntu vm on my Mac).