I had the privilege of being sent to the Biodiversity Information Standards (TDWG) group annual conference a few weeks ago, in December 2016, held in Costa Rica. The event was hosted by what must be the most friendly and kind group of people in the country (either that or all Costa Ricans are incredibly kind), at the Tecnologico de Costa Rica.
Costa Rica was as amazing as when I last visited, although La Fortuna (nearby which the conference was held) was extremely rainy. I’m talking pretty much constant rain; the whole week we were there I think I saw the sun perhaps twice? The conference included a really cool field trip into the rainforest, where my colleague and I got absolutely soaked (no surprise there!) but we did see an awesome tarantula in its hole and a beautiful waterfall.
The conference itself was one of the best experiences I’ve had of conservation biology since I completed my CB course at the Fitzpatrick Institute. Everyone I met was intelligent, knowledgeable, turned on, engaged and curious about the field. It felt like a very meaningful conference: biodiversity data might not be the sexiest subject ever (and “Biodiversity Information Standards” sounds downright boring), but it’s absolutely crucial for solving our modern environmental problems and that’s something which was at the forefront of everyone’s minds.
I must admit to not being incredibly interested in the nuances and ins and outs of the standardisation of terms, although these are obviously important so that sharing data is easier. Things I found really interesting were:
- There is a bit of a gap between the techy types and the more traditional biologists who just ended up working with data. You had people who weren’t sure what github was vs people who were talking about nosql and graph based databases. Coming from a full stack development background (as in, I read techy blogs and news sites and keep up with the latest trends in web dev) it was a bit jarring. There are also a lot of biological concepts that have missed me by in my education, so things which are obvious to almost everyone weren’t obvious to me. I did get the feeling it was a sector which had definite trends though, some things were cool (github, ontologies) and some things were not (relational databases).
- Some things I thought were very difficult and I didn’t really understand how they could be done or what the point of them was. For example, there was a group of people interested in somehow standardising web APIs. Surely one would just follow best practice when it comes to designing the API anyway? I don’t see how it can be standardised or what the point would be, I mean if you want your data harvested by GBIF you use their IPT. It’s not like anyone has set up crawlers to try and combine massive datasets from lots of different APIs.Another example was the group who wanted to standardise species traits. So come up with a standard list of plant traits (e.g. height of plant, colour of flowers, etc) and animal traits (e.g. … ? I don’t know, number of legs or something?). I’m just not sure what the point of this is either. I suppose if you don’t try and come up with an exhaustive standardised list and just have a set of recommended traits to capture data for you could make it useful ones, like sensitivity to climatic change, etc.
- One thing I’ve had difficulty tackling at work (and which there was a great deal of discussion regarding in the conference) is the best way to represent a scientific name taxonomic hierarchy. I actually shouldn’t use the word hierarchy, because there are complexities in the taxonomic tree which are difficult to express hierarchically. For example, two or three separate nodes might merge into one and change to a different branch of your tree at the same time, and your system needs to be able to keep track of both the old nodes and the new single merged node. And it needs to understand the relationship between these new and old nodes too.Because of this complexity, a lot of people at the conference suggested using an ontology created in the web ontology language (OWL, such a cute acronym which makes me think of winnie the pooh). I’d never heard of this thing before, but it apparently stores ‘triples’ (node/object, vector/relationship, node/subject) and kind of works by inheritance – the child node inherits the attributes of the parent like with OO programming. As far as I understand, this data is written in a kind of xml in plain text files which you use some kind of software to query using a language called SPARQL. Presumably there are different clients you can use – I’m guessing you just tell the client where your xml files are and then you can write queries as you like.Most of the super techy types seemed to be using Neo4j as a graph database to store their ontology rather than OWL. I had a quick look and it seems there are a few frameworks you can use to easily build web applications – https://github.com/mchengal/ANNE-stack looks promising. It’s going to be a bit of a wrench though, coming from a very well used framework like Django.The other problem is that apart from the taxonomic tree I would say all the rest of our data is relational. I have the same problem that I do with other nosql databases; they’re very trendy but I’m not convinced they are appropriate for my data.
- Sort of similar to the above, a lot of people were wanting some kind of universal system for referring to taxon concepts. As in, something which defines nodes on the taxonomic hierarchy – i.e. that the Wahlenbergia undulata you are talking about is the same thing as the Wahlenbergia undulata I am talking about. This would ensure data validity upon amalgamation.
- There was an amazing spirit of togetherness and exchange – only 1 proprietary piece of software that people were talking about. Apart from that everyone had their code up on github and everyone asked for collaborators and suggestions and wanted to build things together as a community. There was a massive open source (Java, R and python mostly) technology push, which is completely opposite to what is happening at my own institute.
- Something which this conference had in common with the pycon2016 Cape Town conference I went to is that there was a surprising amount of interest in machine learning and neural networks. I was particularly impressed by the Cornell eBird web application which uses machine learning to automatically identify birds. I think it still needed a bit of help in identifying birds in an image – you have to draw a bounding box around the bird, but still, amazing. They went from 18% accuracy in 2011 to 90% accuracy in 2016, so it’s obviously possible to do this kind of thing with enough training. I think they were using Google’s Tensorflow, which was getting a lot of interest from the pycon folks as well.Another app aiming to automatically identify plants using machine learning techniques was something called Pl@ntnet, developed by the French. I’m sure there were other ones as well, but I don’t remember now.
- There was a lot of emphasis on automatic extraction of data from digitised texts. So for example a lot of the projects focused on parsing the Biodiversity Heritage Library books. This is something that SANBI is spending quite a lot of effort on as well, except we haven’t attempted to automate it. People were using regex and natural language processing , sometimes together with crowdsourcing.
Oh, I should also mention that I did a super quick 5 minute lightning talk with my colleague on a web application I built to aid georeferencing in my organisation. It was basically just a presentation zooming around a poster which I made:
All in all this conference was extremely stimulating and thought provoking. I don’t think I’ll be lucky enough to go to the next one for a few reasons, but I’m very grateful and glad I got to attend this one. Pura vida!