Googling species and how to find them

Ricardo Correia asks if considering synonyms is essential for digital data mining of species with multiple names?

Check out the full paper here

Human-nature interactions are going digital

In the past decade, access to the internet and other forms of digital technologies, such as smartphones and social media, has greatly increased across the world. As a result, more and more people are engaging with these technologies as they go about their daily lives. Activities such as shopping online for that special gift, rating a recently visited restaurant or writing a blog or social media post have become the norm in modern life. These actions generate digital information – a ‘digital footprint’ – detailing how we interact with, and perceive, the world around us.

Interestingly, this also applies to how we interact with our natural environment. Despite evidence that we are not spending as much time ‘in nature’ as in the past1, we are increasingly using technology to interact with the natural environment. Whether by engaging with mobile applications developed for nature enthusiasts (like eBird or iNaturalist), participating in dedicated forums or social media groups or simply posting online pictures of that nice bird that showed up in the garden, more and more of our interactions with nature are being digitally and publicly recorded. This impulse is not new – some of the earliest recorded expressions of human culture (Fig 1) include images of animals2. What is unique about the recent digital representations of human-nature interactions is the enormous volume of data generated, its widespread availability, and its global geographical range. Such representations clearly hold great potential to contribute towards a better understanding of how different people and cultures view, value and engage with nature.

1024px-Chauvet´s_cave_horses

Fig 1: Horse drawings in Chauvet cave, dated 30000-32000 years ago

What is unique about the recent digital representations of human-nature interactions is the enormous volume of data generated and its global geographical range. Such representations clearly hold great potential to contribute towards a better understanding of how different people and cultures view, value and engage with nature.

Culturomics: one way to study digital human-nature interactions

Enter culturomics: the quantitative study of human culture through the analysis of word frequencies in large bodies of digital text3. Could culturomics provide a way to use these data to better understand the way we interact with nature? Originally applied in 2011 to data emerging from Google’s efforts to digitize English language books, researchers have since expanded the reach of such analyses into other forms of digital text, including information from internet pages, social networks, search engines and blogs like this one. Since then, culturomics has already provided fascinating insights into aspects of human culture. These include phenomena such as collective memory, the adoption of technology, the pursuit of fame, and historical epidemiology, to name a few. For example, culturomics has been used to highlight the suppression of artists and writers (particularly of Jewish origins) during the Nazi regime in Germany3. Because the success of conservation actions depends strongly on cultural and social perceptions, the capacity of culturomics to contribute towards conservation science and policy has been quickly recognized4. Potential applications include, for example, demonstrating public interest in nature by analysing internet searches for nature related topics5, or identifying potential conservation icons6,7 by assessing how often people read or write about specific species on the internet.

Technical challenges: when is a jaguar not a jaguar?

There are numerous technical challenges to the implementation of culturomics approaches. For example, gaining access to data may not always be straightforward and can change through time, depending on provider or users’ privacy settings. There are also questions associated with language complexity which require researchers to validate their data before analysis. This is often the case when addressing animal or plant species, as many have more than one popular name or their names have gained multiple meanings over time.  A clear example of this issue is the adoption of the word ‘Jaguar’ as the name for a luxury car brand in the 1940s, resulting that several instances where the word appears in digital text being associated with the brand rather than the species. Thus, looking for the presence of the word ‘jaguar’ in digital texts would result in hits referring to both the predatory big cat and the elite car manufacturer.

One possible way around this problem is to use the species’ scientific name instead. When a species is first described, it is assigned a unique name using a system originally developed by Carl Linnaeus in the 1700s. This approach avoids some of the problems associated with popular language highlighted above and allows scientists to communicate more precisely the identity of species by using a binomial name. For example, the jaguar’s scientific name is Panthera onca. Scientific names also feature frequently in digital texts and are strongly associated with the occurrence of common names8. While this suggests they could present an alternative to using popular names while searching for species in digital texts, scientific name synonyms can also emerge over time. This is generally due to either incorrect spelling or changes in taxonomy, for example as a result of updating evolutionary relationships. Thus, in contrast to the inclusion of spurious car references, failure to consider scientific name synonyms in digital searches may omit a considerable amount of relevant digital text. The extent to which this might be a problem has never been quantified, hence the motivation behind our present study!

How important is it to consider species scientific name synonyms in digital data mining?

In order to answer this question, we first have to know how many common species name synonyms exist for each species. Doing so for all living organisms is a difficult task, so (naturally) we decided to focus our study on birds. We took the list of bird species recognized by BirdLife International and the International Union for the Conservation of nature (IUCN) as our baseline reference list and compared it to seven other taxonomic references. We searched through more than 11 thousand bird species, and found at least one synonym for 3984 of these species. In other words, searches for more than one third of bird species could be affected by the presence of synonyms. Although most species had only one, up to five scientific name synonyms were identified for a single species (Fig 2).

Titbabbler_Chestnut-vented_2010_07_18_09_Alan_Manson_Weenen

Fig 2: Up to five scientific name synonyms were found for the Chestnut-vented Warbler (Sylvia subcoerulea)!

We then wanted to explore the potential impact of these synonyms on culturomics methods. To do this we quantified how often mentions of a species in digital texts are missed when searching only for the recognized scientific name from the reference list. As with many problems in the modern world we solved this by googling! We decided to use Google’s search engine as it provides a simple yet powerful tool to search and index digital texts from a variety of sources. We considered only species for which at least one scientific synonym had been identified. For each of these species, we searched for the scientific name present in the reference list only and recorded the number of results returned by the search engine. Often, these results also included mentions of synonyms, somewhat ameliorating the sensitivity problem expected from searching only for the reference name. Next, we carried out additional searches using synonyms but excluding any results mentioning the reference name. This allowed us to identify how many mentions could only be obtained using name synonyms.

The results were quite striking: a massive 98% of all species evaluated failed to identify at least a few webpages when their synonyms were not considered. Furthermore, for 24% of species the reference name was not used in the majority of pages returned. This means that failing to consider synonyms in internet searches may result in a smaller and potentially biased subset of texts. As you might expect, as the number of synonyms for a species increased, so did the percentage of texts missed when only the reference name was searched for.

In summary, these results highlight the need to consider scientific name synonyms when searching for species in digital texts, a need which becomes more important the more synonyms a species has. Missing texts associated with scientific name synonyms can have an impact on study results, particularly if the aim is to evaluate how frequently species are mentioned over time. Our results likely apply to any type of text search, including searches for academic works, images or videos, so make sure you always check for synonyms when you go searching for “digital species”!

 

References and/or further reading:

1 Soga, M., & Gaston, K. J. (2016). Extinction of experience: the loss of human-nature interactions. Frontiers in Ecology and the Environment, 14(2), 94-101. doi:10.1002/fee.1225

2 https://www.nature.com/news/cave-of-forgotten-dreams-may-hold-earliest-painting-of-volcanic-eruption-1.19177?WT.mc_id=TWT_NatureNews

3 Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Pickett, J. P., . . . Orwant, J. (2011). Quantitative analysis of culture using millions of digitized books. Science, 331(6014), 176-182. doi:10.1126/science.1199644

4 Ladle, R. J., Correia, R. A., Do, Y., Joo, G. J., Malhado, A. C. M., Proulx, R., . . . Jepson, P. (2016). Conservation culturomics. Frontiers in Ecology and the Environment, 14(5), 270-276. doi:10.1002/fee.1260

5 Funk, S. M., & Rusowsky, D. (2014). The importance of cultural knowledge and scale for analysing internet search data as a proxy for public interest toward the environment. Biodiversity and Conservation, 23(12), 3101-3112. doi:10.1007/s10531-014-0767-6

6 Correia, R. A., Jepson, P. R., Malhado, A. C. M., & Ladle, R. J. (2016). Familiarity breeds content: assessing bird species popularity with culturomics. Peerj, 4, e1728.

7 Roll, U., Mittermeier, J. C., Diaz, G. I., Novosolov, M., Feldman, A., Itescu, Y., . . . Grenyer, R. (2016). Using Wikipedia page views to explore the cultural importance of global reptiles. Biological Conservation, 204, 42-50. doi:10.1016/j.biocon.2016.03.037

8 Correia, R. A., Jepson, P., Malhado, A. C. M., & Ladle, R. J. (2017). Internet scientific name frequency as an indicator of cultural salience of biodiversity. Ecological Indicators, 78, 549-555. doi:10.1016/j.ecolind.2017.03.052

About the author

ric2Ricardo Correia completed his PhD at the university of East Anglia and is a currently post-doctoral researcher at the University of Aveiro (Portugal) and the Federal University of Alagoas (Brazil). He is interested in all aspects of human-nature interactions, particularly from a conservation perspective. His interdisciplinary research covers a broad range of topics including environmental change impacts on ecosystems, protected area management and the application of new technological developments to conservation science.

Links:

http://www.cesam.ua.pt/racorreia

https://scholar.google.com/citations?user=sU-JXvwAAAAJ&hl=en

https://www.researchgate.net/profile/Ricardo_Correia