Big Data/Analytics Zone is brought to you in partnership with:

Kay Cichini (MSc, Biology), Nature Protection Expert, Tyrol/Austria. I'm author of theBioBucket-Blog, where I share examples for data management, data processing, statistical analysis and report generation using the R-Software. Kay is a DZone MVB and is not an employee of DZone and has posted 28 posts at DZone. You can read more from them at their website. View Full User Profile

Taxonomy with R: Exploring the Taxize-Package

12.24.2012
| 947 views |
  • submit to reddit

First off, I'd really like to give a shout-out to the brave people who have created and maintain this great package - the fame is yours!

So, while exploring the capabilities of the package some issues with the ITIS-Server arose and with large datasets things weren't working out quite well for me.

I then switched to the NCBI API and saw that the result is much better here (way quicker, on first glance also a higher coverage). 

At the time there is no taxize-function that will pull taxonomic details from a classification returned by NCBI, that's why I plugged together a little wrapper - see here:

# some species data:
spec <- data.frame("Species" = I(c("Bryum schleicheri", "Bryum capillare", "Bryum argentum", "Escherichia coli", "Glis glis")))
spl <- strsplit(spec$Species, " ")
spec$Genus <- as.character(sapply(spl, "[[", 1))
 
# for pulling taxonomic details we'd best submit higher rank taxons
# in this case Genera. Then we'll submit Genus Bryum only once and 
# save some computation time (might be an issue if you deal 
# with large datasets..)
 
gen_uniq <- unique(spec$Genus)
 
# function for pulling classification details ("phylum" in this case)
get_sys_level <- function(x){ require(taxize)
                              a <- classification(get_uid(x))
                              y <- data.frame(a[[1]])                                        # if there are multiple results, take the first..
                              z <- tryCatch(as.character(y[which(y[,2] == "phylum"), 1]),    # in case of any other errors put NA
                                            error = function(e) NA)
                              z <- ifelse(length(z) != 0, z, NA)                             # if the taxonomic detail is not covered return NA
                              return(data.frame(Taxon = x, Syslevel = z))
                             }
 
# call function and rbind the returned values 
result <- do.call(rbind, lapply(gen_uniq, get_sys_level))
print(result)
#         Taxon       Syslevel
# 1       Bryum   Streptophyta
# 2 Escherichia Proteobacteria
# 3        Glis       Chordata
 
# now merge back to the original data frame
spec_new <- merge(spec, result, by.x = "Genus", by.y = "Taxon")
print(spec_new)
#         Genus           Species       Syslevel
# 1       Bryum Bryum schleicheri   Streptophyta
# 2       Bryum   Bryum capillare   Streptophyta
# 3       Bryum    Bryum argentum   Streptophyta
# 4 Escherichia  Escherichia coli Proteobacteria
# 5        Glis         Glis glis       Chordata
#

Published at DZone with permission of Kay Cichini, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)