Thursday, December 29, 2011

Weecology can has new mammal dataset

Recology has moved - go to

So the Weecology folks have published a large dataset on mammal communities in a data paper in Ecology.  I know nothing about mammal communities, but that doesn't mean one can't play with the data...

Their dataset consists of five csv files:  communities, references, sites, species, and trapping data.

Where are these sites, and by the way, do they vary much in altitude?

Let's zoom in on just 'the states'?

What phylogenies can we get for the species in this dataset?
We can use the rOpenSci package treebase to search the online phylogeny repository TreeBASE.  Limiting to returning a max of 1 tree (to save time), we can see that X species are in at least 1 tree on the TreeBASE database.  Nice. 

So there are 321 species in the database with at least 1 tree in the TreeBASE database.  Of course there could be many more, but we limited results from TreeBASE to just 1 tree per query. 

Here's the code:

Friday, December 23, 2011

Recology is 1 yr old...

Recology has moved, go to

This blog has lasted a whole year already.  Thanks for reading and commenting.

There are a couple of announcements:
  1. Less blogging:  I hope to put in many more years blogging here, but in full disclosure, I am blogging for Journal of Ecology now, so I am going to be (and already have been) blogging less here. 
  2. More blogging:  If anyone wants to write guest posts at Recology on the topics of using R for ecology and evolution, or open science, please contact me! 
  3. Different blogging:  I was going to roll out the new dynamic views for this blog, but Google doesn't allow javascript, which is how I include code using GitHub gists. Oh well... 

Anywho, here is the breakdown of visits to this blog, visualized using #ggplot2, of course.  There were a total of about 23,000 pageviews in the first year of this blog.      Here is the pie chart code I used:

Visits to top ten posts:

Visits by by pages:

Visits by top referring sites:

Visits by country:

Visits by browsers:

Visits by operating system:

Thursday, December 22, 2011

Tuesday, December 13, 2011

I Work For The Internet !

Recology has moved, go to

UPDATE: code and figure updated at 647 AM CST on 19 Dec '11.  Also, see Jarrett Byrnes (improved) fork of my gist  here.

The site I WORK FOR THE INTERNET is collecting pictures and first names (last name initials only) to show collective support against SOPA (the Stop Online Piracy Act).  Please stop by their site and add your name/picture.

I used the #rstats package twitteR, created by Jeff Gentry, to search for tweets from people signing this site with their picture, then plotted using ggplot2, and also used Hadley's lubridate to round timestamps on tweets to be able to bin tweets in to time slots for plotting.

Tweets containing the phrase 'I work for the internet' by time:

Here's the code as a GitHub gist.   Sometimes the searchTwitter fxn doesn't returns an error, which I don't understand, but you can play with it:

Wednesday, November 30, 2011

rOpenSci won 3rd place in the PLoS-Mendeley Binary Battle!

Recology has moved go to

I am part of the rOpenSci development team (along with Carl Boettiger, Karthik Ram, and Nick Fabina).   Our website:  Code at Github:

We entered two of our R packages for integrating with PLoS Journals (rplos) and Mendeley (RMendeley) in the Mendeley-PLoS Binary Battle.  Get them at GitHub (rplosRMendeley).

These two packages allow users (from R! of course) to search and retrieve data from PLoS journals (including their altmetrics data), and from Mendeley.  You could surely mash up data from both PLoS and Mendeley.  That's what's cool about rOpenSci - we provide the tools, and leave it up to users vast creativity to do awesome things.

3rd place gives us a $1,000 prize, plus a Parrot AR Drone helicopter.

Friday, November 18, 2011

My talk on doing phylogenetics in R

Recology has moved, go to

I gave a talk today on doing very basic phylogenetics in R, including getting sequence data, aligning sequence data, plotting trees, doing trait evolution stuff, etc.

Please comment if you have code for doing bayesian phylogenetic inference in R.  I know phyloch has function mrbayes, but can't get it to work...

Tuesday, November 1, 2011

Check out a video of my research at RocketHub

Recology has moved, go to

Okay, so this post isn't at all about R - but I can't resist begging my readers for some help.

I’m trying to get some crowdfunding for my research on the evolution of native plants in agricultural landscapes. My campaign is part of a larger project by about 50 other scientists and me to see how well it works to go straight to the public to get funding for science research. All these projects, including mine, are hosted at a site called RocketHub - a site that hosts crowdfunding projects of all sorts – and now they have science.

It is important to get a few bucks at the beginning so that the people that don’t know me with deep pockets will hopefully chip in once they see the money ball rolling.

The funding will go towards paying some students to collect data in the lab for me.

Here’s the link if you want to donate, or just to check out the video I made about my research!

And watch the video here too:

Thursday, October 27, 2011

Two new rOpenSci R packages are on CRAN

Recology has moved, go to

Carl Boettiger, a graduate student at UC Davis, just got two packages on CRAN.  One is treebase, which which handshakes with the Treebase API.  The other is rfishbase, which connects with the Fishbase, although I believe just scrapes XML content as there is no API.  See development on GitHub for treebase here, and for rfishbase here.  Carl has some tutorials on treebase and rfishbase at his website here, and we have an official rOpenSci tutorial for treebase here.

Basically, these two R packages let you search and pull down data from Treebase and Fishbase - pretty awesome.  This improves workflow, and puts your data search and acquisition component into your code, instead of being a bunch of mouse clicks in a browser.

These two packages are part of the rOpenSci project.

Wednesday, October 26, 2011

Two-sex demographic models in R

Recology has moved, go to

Tom Miller (a prof here at Rice) and Brian Inouye have a paper out in Ecology (paper, appendices) that confronts two-sex models of dispersal with empirical data.

They conducted the first confrontation of two-sex demographic models with empirical data on lab populations of bean beetles Callosobruchus.

Their R code for the modeling work is available at Ecological Archives (link here).

Here is a figure made from running the five blocks of code in 'Miller_and_Inouye_figures.txt' that reproduces Fig. 4 (A-E) in their Ecology paper (p = proportion female, Nt = density).  Nice!
A: Saturating density dependence
B: Over-compensatory density dependence
C: Sex-specific gamma's (but bM=bF=0.5)
D: Sex-specific b's (but gammaM=gammaF=1)
E: Sex-specific b's (but gammaM=gammaF=2)

Friday, October 14, 2011

New food web dataset

Recology has moved, go to

So, there is a new food web dataset out that was put in Ecological Archives here, and I thought I would play with it. The food web is from Otago Harbour, an intertidal mudflat ecosystem in New Zealand. The web contains 180 nodes, with 1,924 links.

Fun stuff...

igraph, default layout plot

igraph, circle layout plot, nice

My funky little gggraph function plot
get the gggraph function, and make it better, here at Github

Thursday, October 13, 2011

Phylogenetic community structure: PGLMMs

Recology has moved, go to

So, I've blogged about this topic before, way back on 5 Jan this year.

Matt Helmus, a postdoc in the Wootton lab at the University of Chicago, published a paper with Anthony Ives in Ecological Monographs this year (abstract here).  The paper addressed a new statistical approach to phylogenetic community structure.

As I said in the original post, part of the power of the PGLMM (phylogenetic generalized linear mixed models) approach is that you don't have to conduct quite so many separate statistical tests as with the previous null model/randomization approach.

Their original code was written in Matlab.  Here I provide the R code that Matt has so graciously shared with me.  There are four functions and a fifth file has an example use case.  The example and output are shown below.

Look for the inclusion of Matt's PGLMM to the picante R package in the future.

Here are links to the files as GitHub gists:


The example

..and the figures...

Thursday, October 6, 2011

R talk on regular expressions (regex)

Recology has moved, go to

Regular expressions are a powerful in any language to manipulate, search, etc. data.

For example:

> fruit <- c("apple", "banana", "pear", "pineapple")
> fruit
[1] "apple"     "banana"    "pear"      "pineapple"
> grep("a", fruit) # there is an "a" in each of the words
[1] 1 2 3 4
> strsplit("a string", "s") # strsplit splits the string on the "s"
[1] "a "    "tring"

R base has many functions for regular expressions, see slide 9 of Ed's talk below.  The package stringr, created by Hadley Wickham, is a nice alternative that wraps the base regex functions for easier use. I highly recommend stringr.

Ed Goodwin, the coordinator of the Houston R Users group, gave a presentation to the group last night on regular expressions in R. It was a great talk, and he is allowing me to post his talk here.

Enjoy!  And thanks for sharing Ed!

Friday, September 30, 2011

R tutorial on visualizations/graphics

Recology has moved, go to

Rolf Lohaus, a Huxley postdoctoral fellow here in the EEB dept at Rice University, gave our R course a talk on basic visualizations in R this morning.


Tuesday, September 27, 2011

Short on funding? Can't get a grant? Crowdfunding! #SciFund

Recology has moved, go to

Crowdsourced funding is becoming a sustainable way for various artists, entrepreneurs, etc. to get their idea funded from individuals. For example, think of Kickstarter and RocketHub.

Jai Ranganathan and Jarrett Byrnes have started an experiment to determine how well crowdfunding can work for scientists: The SciFund Challenge. Go here to signup and here for their website.

The deadline to sign up is Oct. 1

Thursday, September 22, 2011

@drewconway interview on @DataNoBorders at the Strata conference

Recology has moved, go to

The O'Reilly Media Strata Summit has many interviews on YouTube (just search YouTube for it)

Drew Conway is the author of a R packages, including infochimps, an R wrapper to the Infochimps API service.

The YouTube video:

Open science talk by Carl Boettiger

Recology has moved, go to

Carl Boettiger gave a talk on the topic of open science to incoming UC Davis graduate students.

Here is the audio:

Here are the slides:

Friday, September 9, 2011

My take on an R introduction talk

Recology has moved, go to

UPDATE: I put in an R tutorial as a Github gist below.

Here is a short intro R talk I gave today...for what it's worth...

A Data Visualization Book

Recology has moved, go to

Note: thanks to Scott for inviting me to contribute to the Recology blog despite being an ecology outsider; my work is primarily in atomic physics. -Pascal
A part of me has always liked thinking about how to effectively present information, but until the past year, I had not read much to support my (idle) interest in information visualization. That changed in the spring when I read Edward Tufte's The Visual Display of Quantitative Information, a book that stimulated me to think more deeply about presenting information. I originally started with a specific task in mind--a wonderful tool for focusing one's interests--but quickly found that Tufte's book was less a practical guide and more a list of general design principles. Then, a few months ago, I stumbled upon Nathan Yau's blog, FlowingData, and found out he was writing a practical guide to design and visualization. Conveniently enough for me, Yau's book, Visualize This, would be released within a month of my discovery of his blog; what follows are my impressions of Visualize This.
I have liked Visualize This a lot.  Yau writes with much the same informal tone as on his blog, and the layout is visually pleasing (good thing, too, for a book about visualizing information!).  The first few chapters are pretty basic if you have done much data manipulation before, but it is really nice to have something laid out so concisely.  The examples are good, too, in that he is very explicit about every step: there is no intuiting what that missing step should be.  The author even acknowledges in the introduction that the first part of the book is at an introductory level.
Early in the book, Yau discusses where to obtain data. This compilation of sources is potentially a useful reference for someone, like me, who almost always generates his own data in the lab. Unfortunately, Yau does not talk much about preparation of (or best practices for) your own data.  Additionally, from the perspective of a practicing scientist, it would have been nice to hear about how to archive data to make sure it is readable far into the future, but that is probably outside the scope of the book.
Yau seems really big into using open source software for getting and analyzing data (e.g. Python, R, etc…), but he is surprisingly attached to the proprietary Adobe Illustrator for turning figures into presentation quality graphics.  He says that he feels like the default options in most analysis programs do not make for very good quality graphics (and he is right), but he does not really acknowledge that you can generate nice output if you go beyond the default settings.  For me, the primary advantage of generating output programmatically is that it is easy to regenerate when you need to change the data or the formatting on the plot.  Using a graphical user interface, like in Adobe Illustrator, is nice if you are only doing something once (how often does that happen?), but when you have to regenerate the darn figure fifty times to satisfy your advisor, it gets tedious to move things around pixel by pixel.
By the time I reached the middle chapters, I started finding many of the details to be repetitive. Part of this repetition stems from the fact that Yau divides these chapters by the type of visualization. For example, "Visualizing Proportions" and "Visualizing Relationships" are two of the chapter titles. While I think these distinctions are important ones for telling the right story about one's data, creating figures for the different data types often boils down to choosing different functions in R or Python. People with less analysis and presentation experience should find the repetition helpful, but I increasingly skimmed these sections as I went along.  
Working through Yau's examples for steps you do not already know would probably be the most useful way of getting something out of the book.  So, for example, I started trying to use Python to scrape data from a webpage, something I had not previously done.  I followed the book's example of this data-scraping just fine, but as with most things in programming, you find all sorts of minor hurdles to clear when you try your own thing. In my case, I am re-learning the Python I briefly learned about 10 years ago--partly in anticipation of not having access to Matlab licenses once I vacate the academy--since I have forgotten a lot of the syntax.  A lot of this stuff would be faster if I were working in Matlab which I grew more familiar with in graduate school.
Overall, Visualize This is a really nice looking book and will continue to be useful to me as a reference. Yau concludes his book with a refreshing reminder to provide context for the data we present. This advice is particularly relevant when presenting to a wider or lay audience, but it is still important for us, as scientists, to clearly communicate our findings in the literature. Patterns in the data are not often self-evident, and therefore we should think carefully about which visualization tools will best convey the meaning of our results.
Edited to add a link to Visualize This here and in the introductory paragraph.

Thursday, September 8, 2011

FigShare Talk

Recology has moved, go to

FigShare - I very much like this idea of a place to put your data online that is NOT published. Dryad is a nice place for datastes linked with published papers, but there isn't really a place for datasets that perhaps did not make the cut for a published paper, and if known to the scientific community, could potentially help resolve the "file-drawer" effect in meta-analyses. (wow, run on sentence)

"Figshare - Why don't you publish all your research?" Mark Hahnel Imperial College London from London Biogeeks on Vimeo.

Wednesday, August 31, 2011

rnpn: An R interface for the National Phenology Network

Recology has moved, go to

The team at rOpenSci and I have been working on a wrapper for the USA National Phenology Network API. The following is a demo of some of the current possibilities. We will have more functions down the road. Get the publicly available code, and contribute, at Github here. If you try this out look at the Description file for the required R packages to run rnpn. Let us know at Github (here) or at our website, or in the comments below, or on twitter (@rOpenSci), what use cases you would like to see with the rnpn package.

Method and demo of each:
Get observations for species by day
From the documentation: "This function will return a list of species, containing all the dates which observations were made about the species, and a count of the number of such observations made on that date."

#### Note, the data below is truncated for blogging brevity...
> getobsspbyday(c(1, 2), '2008-01-01', '2011-12-31') # Searched for species 1 and 2 from Jan 1, 2008 to Dec 31, 2011
          date count   species
1   2009-03-08     2 species 1
2   2009-03-15     1 species 1
3   2009-03-22     1 species 1
4   2009-03-24     1 species 1
5   2009-03-26     1 species 1
6   2009-04-17     1 species 1
7   2009-04-24     1 species 1
8   2009-05-12     1 species 1
9   2009-05-20     1 species 1
10  2009-11-24     1 species 1
11  2009-12-07     1 species 1
12  2010-01-18     1 species 1
13  2010-01-23     1 species 1
62  2011-05-29     1 species 1
63  2011-06-27     1 species 1
64  2011-06-30     2 species 1
65  2009-03-17     1 species 2
66  2009-04-03     3 species 2
67  2009-04-05     3 species 2
68  2009-04-10     3 species 2
69  2009-04-17     3 species 2

Get individuals at specific stations
From the documentation: "This function returns all of the individuals at a series of stations."

> getindsatstations(c(507, 523)) # Searched for any individuals at stations 507 and 523
   individual_id individual_name species_id kingdom
1           1200         dogwood         12 Plantae
2           1197    purple lilac         36 Plantae
3           1193         white t         38 Plantae
4           3569     forsythia-1         73 Plantae
5           1206            jack        150 Plantae
6           1199      trout lily        161 Plantae
7           1198           dandy        189 Plantae
8           1192           red t        192 Plantae
9           1710    common lilac         36 Plantae
10          1711  common lilac 2         36 Plantae
11          1712       dandelion        189 Plantae

Get individuals of species at stations
From the documentation: "This function will return a list of all the individuals, which are members of a species, among  any number of stations."

> getindspatstations(35, c(60, 259), 2009)  # Search for individuals of species 35 at stations 60 and 259 in year 2009
  individual_id individual_name number_observations
1          1715            west                   5
2          1716            east                   5

Get observation associated with particular observation
From the documentation: "This function will return the comment associated with a particular observation."

> getobscomm(1938) # The observation for observation number 1938
[1] "some lower branches are bare"

Monday, August 22, 2011

Tenure track position in systematics at the University of Vermont

Recology has moved, go to

There is an awesome position opening up for an assistant professor in systematics at the University of Vermont. Below is the announcement, and see the original post at the Distributed Ecology blog. Why is this related to R? One can do a lot of systematics work in R, including retrieving scientific collections data through an upcoming package handshaking with VertNet (part of the rOpenSci project), managing large data sets, retrieval of GenBank data through the ape package (see fxn read.genbank), phylogenetic reconstruction and analysis, and more. So I am sure a systematist with R ninja skills will surely have a head up on the rest of the field. 

Assistant Professor in Systematics

Department of Biology
University of Vermont
Burlington, Vermont

The Department of Biology of the University of Vermont seeks applications for a tenure- track Assistant Professor position in Systematics and Evolutionary Biology of arthropods, especially insects. The position will be open in the fall of 2012. The successful candidate will have expertise in classical and molecular systematics, including analysis of complex data sets. Candidates pursuing phylogenomics and innovative methods in bioinformatics in combination with taxonomy are especially encouraged to apply. Department information at:

All applicants are expected to: 1) hold a Ph.D. degree in relevant disciplines and have two or more years of postdoctoral experience; 2) develop a competitively funded research program; 3) teach undergraduate courses (chosen from among general biology, evolution, systematic entomology, and others in the candidate's expertise); 4) teach, mentor and advise undergraduate and graduate students; and 5) oversee a natural history collection of historic significance.

Candidates must apply online: On left see "Search Postings" then find "Biology" under "HCM Department" then posting 0040090 (first posting). Sorry, but we cannot supply the direct link because it will time out.

Attach a cover letter with a statement of research focus and teaching interests (one document), a curriculum vitae, representative publications, and the contact information of three references.

Review of applications will begin on September 15, 2011, and will continue until the position is filled. Questions and up to three additional publications may be directed to Dr. Jos. J. Schall:

The University of Vermont recently identified several "Spires of Excellence" in which it will strategically focus institutional investments and growth over the next several years. One spire associated with the position is Complex Systems. Candidates whose research interests align with this spire are especially encouraged to apply
The University seeks faculty who can contribute to the diversity and excellence of the academic community through their research, teaching, and/or service. Applicants are requested to include in their cover letter information about how they will further this goal. The University of Vermont is an Affirmative Action/Equal Opportunity employer. The Department is committed to increasing faculty diversity and welcomes applications from women, underrepresented ethnic, racial and cultural groups, and from people with disabilities.

Friday, August 12, 2011

Thursday at #ESA11

Recology has moved, go to

Interesting talks/posters:

Richard Lankau presented research on trade-offs and competitive ability. He suggests that during range expansion selection for increased intraspecific competitive ability in older populations leads to loss of traits for interspecific competitive traits due to trade-offs between these traits.

Ellner emphatically states that rapid evolution DOES matter for ecological responses, and longer-term evolutionary patterns as well. [His paper on the talk he was giving came out prior to his talk, which he pointed out, good form sir]

Lauren Sullivan gave an interesting talk on bottom up and top down effects on plant reproduction in one site of a huge network of sites doing similar nutrient and herbivory manipulations around the globe - NutNet (go here:

Laura Prugh shows in California that the engineering effects (i.e., the mounds that they make) of giant kangaroo rats are more important for the associated food web than the species interaction effects (the proxy used was just density of rats).

Kristy Deiner suggests that chironomids are more phylogenetic similar in lakes with stocked fish relative to fishless lakes, in high elevation lakes in the Sierra Nevada. She used barcode data to generate her phylogeny of chironomids. If you have barcode data and want to search BOLD Systems site, one option is doing it from R using rbold, a package under development at rOpenSci (code at Github).

Jessica Gurevitch presented a large working group's methods/approach to a set of reviews on invasion biology. We didn't get to see a lot of results from this work, but I personally was glad to see her explaining to a packed room the utility of meta-analysis, and comparing to the medical field in which meta-analysis is sort of the gold standard by which to draw conclusions.

Following Jessica, Jason Fridley told us about the Evolutionary Imbalance Hypothesis (EIH) (see my notes here). He posed the problem of, when two biotas come together, what determines which species are retained in this new community and which species are left out. He listed a litany of traits/responses to measure to get at this problem, but suggested that with a little bit of "desktop ecology", we could simply ask: Is the invasability of X region related to the phylogenetic diversity of that region? In three destination regions (Eastern Deciduous Forests, Mediterranean California, and the Czech Republic) out of four there was a positive relationship between proportion of invasive plant species in a source region and the phylogenetic diversity of the source regions.

    Thursday, August 11, 2011

    Wednesday at #ESA11

    Recology has moved, go to

    Interesting talks/posters:

    • Ethan White's poster describing was of course awesome given my interest in getting data into the hands of ecologists over at Ethan also has software you can download on your machine to get the datasets you want easily - EcoData Retriever. [rOpenSci will try to take advantage of their work and allow you to call the retriever from R]
    • Carl Boettiger's talk was awesome. He explained how we need better tools to be able to predict collapses using early warning signals. He developed a way to estimate the statistical distribution of probabilities of system collapse. 
    • Jennifer Dunne: Explained how she put together an ancient network from Germany. Bravo. 
    • Carlos Melian explained his model of network buildup that starts from individuals, allows speciation, and other evolutionary processes. 
    • Rachel Winfree told us that in two sets of mutualistic plant-pollinator networks in New Jersey and California, that the least connected pollinator species were the most likely to be lost from the network with increasing agricultural intensity. 
    • Dan Cariveau suggests that pollination crop services can be stabilized even with increasing agriculture intensity if in fact pollinator species respond in different ways. That is, some pollinators may decrease in abundance with increasing ag intensity, while other species may increase - retaining overall pollination services to crops.

      Monday, August 8, 2011

      Monday at #ESA11

      Recology has moved, go to

      Monday was a good day at ESA in Austin. There were a few topics I promised to report on in my blogging/tweeting.

      ...focused on open source data. Carly Strasser's presentation on guidelines for data management was awesome (including other talks in the symposium on Creating Effective Data Management Plans for Ecological Research). Although this was a good session, I can't help but wish that they had hammered home the need for open science more. Oh well. Also, they talked a lot about how, and not a lot of why we should properly curate data. Still, a good session. One issue Carly and I talked about was tracking code in versioning systems such as Github. There doesn't seem to be a culture of versioning code for analyses/simulations in ecology. But when we get will be easier to share/track/collaborate on  code.

      ...used R software. David Jennings talked about a meta-analysis asking if phylogenetic distance influences competition strength in pairwise experiments. David used the metafor package in R to do his meta-analysis. Good form sir.

      ...did cool science. Matt Helmus presented a great talk on phylogenetic species area curves (likely using R, or Matlab maybe?).

      p.s. We launched rOpenSci today.

      • The Tilman effect - Tilman's talk was so packed it looked like there was a line waiting to get into a trendy bar. Here's a picture (credit: Jaime Ashander). Bigger room next time anyone? 
      • Wiley came out with an open source journal called Ecology and Evolution. This brings them to 3 open source journals (the other two are in other fields). We (rOpenSci) will attempt to hand-shake with these journals. 
      • The vegetarian lunch option was surprisingly good. Nice. 

      (#ESA11) rOpenSci: a collaborative effort to develop R-based tools for facilitating Open Science

      Recology has moved, go to

      Our development team would like to announce the launch of rOpenSci. As the title states, this project aims to create R packages to make open science more available to researchers.

      What this means is that we seek to connect researchers using R with as much open data as possible, mainly through APIs. There are a number of R packages that already do this (e.g., infochimpstwitteR), but we are making more packages, e.g., for MendeleyPLoS Journals, and taxonomic sources (ITISEOLTNRSPhylomaticUBio).

      Importantly, we are creating a package called rOpenSci, which aims to integrate functions from packages for individual open data sources.

      If you are somewhat interested, follow our progress on our website, on Twitter, or contact us. If you are really^2 interested you could go to Github and contribute. If  you are really^3 interested, join our development team.

      Sunday, July 31, 2011

      Blogging/tweeting from #ESA11

      Recology has moved, go to

      I will be blogging about the upcoming Ecological Society of America meeting in Austin, TX. I will focus on discussing talks/posters that:

      1. Have taken a cool approach to using data, or
      2. Have focused on open science/data, or
      3. Done something cool with R software, or
      4. Are just exciting in general

      I will also tweet throughout the meeting from @recology_ (yes the underscore is part of the name, recology was already taken). 

      The hashtag for the meeting this year is #ESA11

      Friday, July 15, 2011

      Archiving ecology/evolution data sets online

      We now have many options for archiving data sets online:

      DryadKNBEcological ArchivesEcology Data PapersEcological Data, etc.

      However, these portals largely do not communicate with one another as far as I know, and there is no way to search over all data set sources, again, as far as I know. So, I wonder if it would ease finding of all these different data sets to get these different sites to get their data sets cloned on a site like Infochimps, or have links from Infochimps.  Infochimps already has APIs (and there's an R wrapper for the Infochimps API already set up here: by Drew Conway), and they have discussions set up there, etc.

      Does it make sense to post data sets linked to published works on Infochimps? I think probably not know that I think about it. But perhaps it makes sense for other data sets, or subsets of data sets that are not linked with published works to be posted there as I know at least Dryad only accepts data sets linked with published papers.

      One use case is there was a tweet from someone recently that his students were excited about getting their data sets on their resume/CV, but didn't think there was a way to put them any place where there wasn't a precondition that the data set was linked with a published work. Seems like this could be a good opportunity to place these datasets on Infcohimps, and at least they are available then where a lot of people are searching for data sets, etc.

      What I think would be ideal is if Dryad, KNB, etc. could link their datasets to Infochimps, where they could be found, then users can either get them from Infochimps, or perhaps you would have to go to the Dryad site, e.g. But at least you could search over all ecological data sets then.

      Thursday, July 14, 2011

      CRdata vs. Cloudnumbers

      Cloudnumbers and CRdata are two new cloud computing services.

      I tested the two services with a very simple script. The script simply creates a dataframe of 10000 numbers via rnorm, and assigns them to a factor of one of two levels (a or b). I then take the mean of the two factor levels with the aggregate function.

      In CRdata you need to put in some extra code to format the output in a browser window. For example, the last line below needs to have '<crdata_object>' on both sides of the output object so it can be rendered in a browser. And etc. for other things that one would print to a console. Whereas you don't need this extra code for using Cloudnumbers.

      dat <- data.frame(n = rnorm(10000), p = rep(c('a','b'), each=5000))
      out <- aggregate(n ~ p, data = dat, mean)

      Here is a screenshot of the output from CRdata with the simple script above.

      This simple script ran in about 20 seconds or so from starting the job to finishing. However, it seems like the only output option is html. Can this be right? This seems like a terrible only option.

      In Cloudnumbers you have to start a workspace, upload your R code file.
      Then, start a session...
      choose your software platform...
      choose packages (one at a time, very slow)...
      then choose number of clusters, etc.
      Then finally star the job.
      Then it initializes, then finally you can open the console, and
      Then from here it is like running R as you normally would, except on the web.

      Who wins (at least for our very minimal example above)

      1. Speed of entire process (not just running code): CRdata
      2. Ease of use: CRdata
      3. Cost: CRdata (free only)
      4. Least annoying: Cloudnumbers (you don't have to add in extra code to run your own code)
      5. Opensource: CRdata (you can use publicly available code on the site)
      6. Long-term use: Cloudnumbers (more powerful, flexible, etc.)

      I imagine Cloudnumbers could be faster for larger jobs, but you would have to pay for the speed of course. 

      What I really want to see is a cloud computing service that accepts code directly run from R or RStudio. Hmmm...that would be so tasty indeed. I think Cloudnumbers may be able to do this, but haven't tested it yet.  

      Perhaps using the server version of RStudio along with Amazon's EC2 is a better option than both of these. See Karthik Ram's post about using RStudio server along with Amazon's EC2. Even just running RStudio server on your Unbuntu machine or virtual machine is a pretty cool option, even without EC2 (works like a charm on my Parallels Ubuntu vm on my Mac). 

      Tuesday, June 28, 2011

      rbold: An R Interface for Bold Systems barcode repository

      Have you ever wanted to search and fetch barcode data from Bold Systems?

      I am developing functions to interface with Bold from R. I just started, but hopefully folks will find it useful.

      The code is at Github here. The two functions are still very buggy, so please bring up issues below, or in the Issues area on Github. For example, some searches work and other similar searches don't. Apologies in advance for the bugs.

      Below is a screenshot of an example query using function getsampleids to get barcode identifiers for specimens. You can then use getseqs function to grab barcode data for specific specimens or many specimens.
      Screen shot 2011-06-28 at 9.24.00 AM.png

      Wednesday, June 22, 2011

      iEvoBio 2011 Synopsis

      We just wrapped up the 2011 iEvoBio meeting. It was awesome! If you didn't go this year or last year, definitely think about going next year.

      Here is a list of the cool projects that were discussed at the meeting (apologies if I left some out):
      1. Vistrails: workflow tool, awesome project by Claudio Silva
      2. Commplish: purpose is to use via API's, not with the web UI
      3. Phylopic: a database of life-form silouhettes, including an API for remote access, sweet!
      4. Gloome
      5. MappingLife: awesome geographic/etc data visualization interace on the web
      6. SuiteSMA: visualizating multiple alignments
      7. treeBASE: R interface to treebase, by Carl Boettiger
      8. VertNet: database for vertebrate natural history collections
      9. RevBayes: revamp of MrBayes, with GUI, etc. 
      10. Phenoscape Knowledge Base
        • Peter Midford lightning talk: talked about matching taxonomic and genetic data
      11. BiSciCol: biological science collections tracker
      12. Ontogrator 
      13. TNRS: taxonomic name resolution service
      14. Barcode of Life data systems, and remote access
      15. Moorea Biocode Project
      16. Microbial LTER's data
      17. BirdVis: interactive bird data visualization (Claudio Silva in collaboration with Cornell Lab of Ornithology)
      18. Crowdlabs: I think the site is down right now, another project by Claudio Silva
      19. Phycas: Bayesian phylogenetics, can you just call this from R?
      20. RIP MrBayes!!!! replaced by RevBayes (see 9 above)
      21. Slides of presentations will be at Slideshare (not all presentations up yet)          
      22. A birds of a feather group I was involved in proposed an idea (TOL-o-matic) like Phylomatic, but of broader scope, for easy access and submission of trees, and perhaps even social (think just pushing a 'SHARE' button within PAUP, RevBayes, or other phylogenetics software)! 
      23. Synopses of Birds of a Feather discussion groups:

      Tuesday, June 21, 2011

      PLoS journals API from R: "rplos"

      The Public Libraries of Science (PLOS) has an API so that developers can create cool tools to access their data (including full text papers!!).

      Carl Boettiger at UC Davis and I are working on R functions that use the PLoS API. See our code on Github here. See the wiki at the Github page for examples of use. We hope to deploy rplos as a package someday soon. Please feel free to suggest changes/additions rplos in the comments below or on the Github/rplos site.

      Get your own API key here.

      Friday, June 10, 2011

      OpenStates from R via API: watch your elected representatives

      I am writing some functions to acquire data from the OpenStates project, via their API. They have a great support community at Google Groups as well.

      On its face this post is not obviously about ecology or evolution, but well, our elected representatives do, so to speak, hold our environment in a noose, ready to let the Earth hang any day.

      Code I am developing is over at Github.

      Here is an example of its use in R, in this case using the Bill Search option (billsearch.R on my Github site), and in this case you do not provide your API key in the function call, but instead put it in your .Rprofile file, which is called when you open R. We are searching here for the term 'agriculture' in Texas ('tx'), in the 'upper' chamber.

      > temp <- billsearch('agriculture', state = 'tx', chamber = 'upper')
      > length(temp)
      [1] 21
      > temp[[1]]
      [1] "Congratulating John C. Padalino of El Paso for being appointed to the United States Department of Agriculture."
      [1] "2010-08-11 07:59:46"
      [1] "2010-09-02 03:34:39"
      [1] "upper"
      [1] "tx"
      [1] "81"
      [1] "resolution"
      [1] "Resolutions"
      [1] "Other"
      [1] "SR 1042"
      Created by Pretty R at

      Apparently, the first bill (SR 2042, see $bill_id at the bottom of the list output) that came up was to congratulate John Paladino for being appointed to the USDA.

      The other function I have ready is getting basic metadata on a state, called statemetasearch.

      I plan to develop more functions for all the possible API calls to the OpenStates project.

      Tuesday, June 7, 2011

      How to fit power laws

      A new paper out in Ecology by Xiao and colleagues (in press, here) compares the use of log-transformation to non-linear regression for analyzing power-laws.

      They suggest that the error distribution should determine which method performs better. When your errors are additive, homoscedastic, and normally distributed, they propose using non-linear regression. When errors are multiplicative, heteroscedastic, and lognormally distributed, they suggest using linear regression on log-transformed data. The assumptions about these two methods are different, so cannot be correct for a single dataset.

      They will provide their R code for their methods once they are up on Ecological Archives (they weren't up there by the time of this post).

      Friday, June 3, 2011

      searching ITIS and fetching Phylomatic trees

      I am writing a set of functions to search ITIS for taxonomic information (more databases to come) and functions to fetch plant phylogenetic trees from Phylomatic. Code at github.

      Also, see the examples in the demos folder on the Github site above.

      Wednesday, May 18, 2011

      phylogenetic signal simulations

      I did a little simulation to examine how K and lambda vary in response to tree size (and how they compare to each other on the same simulated trees). I use Liam Revell's functions fastBM to generate traits, and phylosig to measure phylogenetic signal.

      Two observations: 

      First, it seems that lambda is more sensitive than K to tree size, but then lambda levels out at about 40 species, whereas K continues to vary around a mean of 1.

      Second, K is more variable than lambda at all levels of tree size (compare standard error bars).

      Does this make sense to those smart folks out there?

      Tuesday, May 17, 2011

      A simple function for plotting phylogenies in ggplot2

      UPDATE: Greg jordan has a much more elegant way of plotting trees with ggplot2. See his links in the comments below.

      I wrote a simple function for plotting a phylogeny in ggplot2. However, it only handles a 3 species tree right now, as I haven't figured out how to generalize the approach to N species.

      Any ideas on how to improve this?

      Friday, May 13, 2011

      plyr's idata.frame VS. data.frame

      I had seen the function idata.frame in plyr before, but not really tested it. From the plyr documentation:

      "An immutable data frame works like an ordinary data frame, except that when you subset it, it
      returns a reference to the original data frame, not a a copy. This makes subsetting substantially
      faster and has a big impact when you are working with large datasets with many groups."

      For example, although baseball is a data.frame, its immutable counterpart is a reference to it:

      > idata.frame(baseball)
      <environment: 0x1022c74e8>
      [1] "idf"         "environment"

      Here are a few comparisons of operations on normal data frames and immutable data frames. Immutable data frames don't work with the doBy package, but do work with aggregate in base functions.  Overall, the speed gains using idata.frame are quite impressive - I will use it more often for sure.

      Get the github code below here.

      Here's the comparisons of idata.frames and data.frames:

      > # load packages
      require(plyr); require(reshape2)
      > # Make immutable data frame
      baseball_i <- idata.frame(baseball)
      > # Example 1 - idata.frame more than twice as fast
      system.time( replicate(50, ddply( baseball, "year", summarise, mean(rbi))) )
         user  system elapsed 
       14.812   0.252  15.065 
      > system.time( replicate(50, ddply( baseball_i, "year", summarise, mean(rbi))) )
         user  system elapsed 
        6.895   0.020   6.915 
      > # Example 2 - Bummer, this does not work with idata.frame's
      > colwise(max, is.numeric) ( baseball ) # works year stint g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp 1 2007 4 165 705 177 257 64 28 73 NA NA NA 232 NA NA NA NA NA NA > colwise(max, is.numeric) ( baseball_i ) # doesn't work Error: is not TRUE
      > # Example 3 - idata.frame twice as fast
      system.time( replicate(100, baseball[baseball$year == "1884", ] ) )
         user  system elapsed 
        1.155   0.048   1.203 
      > system.time( replicate(100, baseball_i[baseball_i$year == "1884", ] ) )
         user  system elapsed 
        0.598   0.011   0.609 
      > # Example 4 - idata.frame faster
      system.time( replicate(50, melt(baseball[, 1:4], id = 1) ) )
         user  system elapsed 
       16.587   1.169  17.755 
      > system.time( replicate(50, melt(baseball_i[, 1:4], id = 1) ) )
         user  system elapsed 
        0.869   0.196   1.065 
      > # And you can go back to a data frame by 
      d <-
      'data.frame': 21699 obs. of  23 variables:
       $ id   : chr  "ansonca01" "forceda01" "mathebo01" "startjo01" ...
       $ year : int  1871 1871 1871 1871 1871 1871 1871 1872 1872 1872 ...
       $ stint: int  1 1 1 1 1 1 1 1 1 1 ...
       $ team : chr  "RC1" "WS3" "FW1" "NY2" ...
       $ lg   : chr  "" "" "" "" ...
       $ g    : int  25 32 19 33 29 29 29 46 37 25 ...
       $ ab   : int  120 162 89 161 128 146 145 217 174 130 ...
       $ r    : int  29 45 15 35 35 40 36 60 26 40 ...
       $ h    : int  39 45 24 58 45 47 37 90 46 53 ...
       $ X2b  : int  11 9 3 5 3 6 5 10 3 11 ...
       $ X3b  : int  3 4 1 1 7 5 7 7 0 0 ...
       $ hr   : int  0 0 0 1 3 1 2 0 0 0 ...
       $ rbi  : int  16 29 10 34 23 21 23 50 15 16 ...
       $ sb   : int  6 8 2 4 3 2 2 6 0 2 ...
       $ cs   : int  2 0 1 2 1 2 2 6 1 2 ...
       $ bb   : int  2 4 2 3 1 4 9 16 1 1 ...
       $ so   : int  1 0 0 0 0 1 1 3 1 0 ...
       $ ibb  : int  NA NA NA NA NA NA NA NA NA NA ...
       $ hbp  : int  NA NA NA NA NA NA NA NA NA NA ...
       $ sh   : int  NA NA NA NA NA NA NA NA NA NA ...
       $ sf   : int  NA NA NA NA NA NA NA NA NA NA ...
       $ gidp : int  NA NA NA NA NA NA NA NA NA NA ...
       $ teamf: Factor w/ 132 levels "ALT","ANA","ARI",..: 99 127 51 79 35 35 122 86 16 122 ...
      > # idata.frame doesn't work with the doBy package
      summaryBy(rbi ~ year, baseball_i, FUN=c(mean), na.rm=T)
      Error in as.vector(x, mode) : 
        cannot coerce type 'environment' to vector of type 'any'
      > # But idata.frame works with aggregate in base (but with minimal speed gains)
      # and aggregate is faster than ddply of course 
      system.time( replicate(100, aggregate(rbi ~ year, baseball, mean) ) )
         user  system elapsed 
        4.117   0.423   4.541 
      > system.time( replicate(100, aggregate(rbi ~ year, baseball_i, mean) ) )
         user  system elapsed 
        3.908   0.383   4.291 
      > system.time( replicate(100, ddply( baseball_i, "year", summarise, mean(rbi)) ) )
         user  system elapsed 
       14.015   0.048  14.082 
      Created by Pretty R at