| |
Dear All,
Thank you very much to take time to participate to the 2nd e-conference within the framework of the European project ENBI on Data Validation and Access Restrictions in the GBIF net (www.enbi.info).
I currently work for Fauna Europaea (www.faunaeur.org) at the MNHN in Paris, where we have in charge among others the so-called validation process. I worked also for ENHSIN for which I wrote a paper on the topics that interest us in this forum (www.nhm.ac.uk/science/rco/enhsin). I also work for BioCASE as the French National Node, for FishBase (I am currently the vice-president of the Consortium FishBase), and another ENBI WP (11: translation).
I have performed many checks on various name and specimen databases, incl. the MNHN fish collection database.
Below is the open statement document summarising the topics that we would like to discuss during these 2 weeks.In addition, I will post every day some questions on a specific restricted topic between 07:30 and 08:30, sometimes a little bit earlier, to be discussed during the day.
I hope we will have constructive discussions and useful suggestions for the GBIF databasing and networking activities.
Finally I would like to thank Francisco Pando from the GBIF Spanish node at the Botanical Garden in Madrid to have invited me to moderate this forum, and his ENBI team, Joaquín Hortal then Marisa Esteban, the ENBI Forums Project Manager, for their very good and hard work, their help and patiene during discussion on the topics (having involved other Spanish colleagues), and for their technical preparation of the forum.
My very best wishes.
Nicolas Bailly
Opening Statement by Nicolas Bailly (moderator)
The raise of the internet and the web has allowed an enormous amount of data to be accessible by all public.When before, the biodiversity data were available only through paper publications, specimens and labels, direct requests to the specialists or dedicated agencies; whatever they had local databases, non-disseminated technical reports or only oral expertise.
Before the web mainly, the quality of data was more or less insured by the peer review system in scientific publications, which was less true about the books for large public, and the data availability was moderated by the fact that the grey literature was not accessible to the large public, not to speak about local databases or manuscript catalogues in collections.
The development of GBIF needs to take into account these changes. One the one hand, to be useful, biodiversity data needs a quality control process to assess the correctness of taxonomic identification, nomenclature and the accuracy of the spatial reference, etc.. How to assess and communicate "quality indicators" on data is still an open question. On the other hand, restrictions may be applied to the information provided in order to i) protect both endangered species or habitats, ii) Prevent doubtful data or data to be published from being accessed and then misused, or iii) protect Intelectual Property (IP) rights.
Is it possible to propose a peer-review system for biodiversity data? Is it possible to have automatic quality assessment for huge databases?
Data validation
The fundamental concept here is the quality of data. When discussing with co-authors of a paper on a procedure to clean collection data bases (Froese et al., 1999), I proposed that at least, quality = reliability + accuracy. Hence, the NIACC (Name, Identifier, Area FAO, Country, Coordinates) index we proposed for collection databases was "only" a reliability index.
But it is possible that it is more than that. The discussion should start by trying to define more clearly the concept, even briefly. Maybe we will not reach a complete definition, but at least we can propose a start for further considerations.
The following issue is how to evaluate this quality. In two oral communications (Bailly, 2001; Bailly et al., 1999), I showed that the quality control needs to be revisited when disseminating data through databases, because they may contain non-published data like collection data, but also, if data are extracted from peer-review publications, errors can be made during the data entry including misinterpretations when standardising to the database format. Hence, the specialist have to check data after they are entered. In other words, how organise a peer-review system for databases?
Assessing quality of data from paper publication, especially from peer-reviewed journals is something that we can imagine. Assessing the identification of a collection specimen as well. But assessing quality of data of non-published associated information, sometimes only hand-written labels, or worse observational data is still a challenge.
We have to discuss what are possible ways, if there are already tries of such assessments. Assessment grids have already been proposed to assess collections, and sometimes they include the associated data, but not in a complete way. We have still to revisit them.
Maybe we have to explore the possibility to have a GBIF quality certification, just like the ISO quality certification for industrial and services domains. We have to start a discussion on standard procedures to be developed for each type of data, and up to where it is possible to make such processes automatic.
This may imply important modifications in the database structure, new fields, new requests, maybe new tables? This may imply as well the development and dissemination of standards whatever they are new, already there, or picked up from other domains.
How to involve specialists in validation process more than currently? The problem is that if the same publications and non-published data are entered in various databases, the specialist will have to check all the databases, not to speak about regular checks of the same database when new entries and corrections are made. This work model is not sustainable as such. Proposals to change the taxonomic work organisation like Godfray (2002) have to be seriously considered and applied to include collaborating databasing from the beginning of a work and not only as a final repository of data.
Moreover, this work organisation may imply that we find a solution to have a unique source of validated data to avoid dissemination of inconsistent information throughout the web. This is not to say that we have to avoid contradictory information to be disseminated. Two different interpretations of the same facts may be validated in absence of further work evidencing the correct one.
Restrictions to data access
Five issues can be considered under the topic of access criteria, the first three being more relevant to all users, the last two particularly to taxonomists:
- Controlled access to sensitive data, such as those on endangered species and exploitable natural resources.
- Controlled access to imprecise, uncertain, and unreliable data, such as identification and location; also of importance here is the question as to how to proceed with missing data and warn users.
- Different access for different users, do we need a user registry system?
- Special specimens: e.g., types, historically important material. It is a convention of the codes of nomenclature that types "are the property of science", which implies that access to data about them should always be free to bona fide researchers.
- Specimens in the process of being described as new taxa: a specialized taxonomic type of sensitive data.
Points 1, 3, 5 address the type of data that are to be made accessible to whom, while points 2 and 4 refer to the means by which data are made accessible. Should the decision to make data available remains the authority of the data providers (although the portal may be used secondarily as a filter)?
In addition, if observational and collection data, it may be useful to request only one of those type of data. Especially when looking for one specimen, e.g. a type, your request returns thousands of observation records.
The access to poor quality data may be restricted depending on the quality of data vs user level. It implies that we are able to calculate and use various forms of quality indices, from one simple digit index for large public, to multi digit index for specialists, each digit meaning different components of the quality.
Recomended References |