Early in December I attended a workshop in Budapest for a project called the Comparative Mind Database, a component of the CompCog European Research Networking Programme. The objective of CompCog is to promote the development of “real” comparative cognition, seen to involve a coherent theoretical background, unified terminology, and standard methods. The role of the Comparative Mind Database is to develop applications of advanced information technologies and methods to support comparative cognition research. The CMD project is at an early stage of development, and the purpose of the workshop was to present the initial ideas and some pilot studies, and get feedback from relevant researchers with a view to shaping future directions.
From the perspective of a philosopher with an interest in the evolution of cognition and multidisciplinary integration the CMD project is a fascinating venture, not just because it may become a valuable resource, but also conceptually, because it raises interesting questions about the design of mechanisms to promote scientific integration. What follows are some ideas on the design of a comparative mind database that have arisen during and since the workshop. I’m not a member of the CMD team so this doesn’t reflect internal thinking, and I’m interested primarily in conceptual design rather than technical nuts and bolts.
Why a comparative mind database might be a good thing
Comparative cognition research has some extremely difficult problems: it aims to investigate and compare the cognitive abilities of different species in circumstances where the conceptualization of the cognitive abilities is uncertain and changing, methods are evolving, and the major physical and behavioral differences between species make it necessary to modify methods for different species even when attempting to measure the same cognitive ability. A database that integrates comparative cognition research could help in a variety of ways. XML-based data codification schemes and software tools might serve as a good way to promote standardization of methods, and at the same time facilitate sophisticated large-scale analyses. Advanced data mining and visualization techniques could help to detect subtle patterns largely invisible at the level of the individual study. “Community” tools like wikis have the potential to help researchers interact in richer and more dynamic ways, further promoting conceptual integration.
But although there are some interesting possibilities, it’s not obvious what specific shape a comparative mind database might have. What I’ll do now is pose some design questions. Many of these are in an “X vs Y” form, but since a likely answer is often “both” the point is usually to highlight a distinction rather than suggest that the distinction corresponds to a discrete choice.
Value adding service vs repository
Not all databases are storehouses for original data; some harvest data from existing sources and provide value-adding services. ISI Web of Knowledge is an example of the latter approach, as is the just-launched PhilPapers. An advantage of harvesting is that it is a relatively easy way to obtain a large amount of data in a short amount of time, which means that the database can be up and running quickly. On the other hand, the services provided need to be reasonably compelling. If the database is a repository for a unique kind of data it has a more obvious value as a resource, but it could take a while to acquire enough data to be useful.
Clearly a mixed approach is possible. Harvesting could enable the database to begin providing a service relatively quickly, and the development of value-adding services might be a way to explore what functions the comparative cognition community will find most useful. In the meantime, the repository could be developed.
Type of data
If the database is to be a repository then there is the question of what kind of data to store. Options include metadata structured to a metadata scheme crafted for the specific domain and provided directly by the researchers in some way, full text papers along the lines of a preprint archive, experimental data, or some combination.
Rich vs sparse data coding
Whatever kind of data is stored, it needs to be coded in some way. Here a basic choice is between rich and sparse coding schemes. A rich coding scheme formally codes many attributes of the data, whilst a sparse scheme captures only a few attributes. A rich coding scheme can provide more power and can therefore support more informative analyses. For example, Poldrack 2006 conducted a meta-analysis to evaluate the strength of ‘reverse inferences’ in fMRI research using neural imaging data held by the BrainMap database. These reverse inferences involve taking activation of a particular brain area during a task to indicate the involvement of a particular cognitive process, on the basis that other research has found an association between that cognitive process and the brain area in question. Poldrack found that reverse inferences are relatively weak, but noted that stronger inferences could be drawn if imaging databases used more fine-grained cognitive coding. The databases code imaging data using very broad cognitive categories, whereas the researcher will usually be interested in much more specific cognitive processes.
So rich coding can make a database more valuable as a resource, but it also raises problems. Just by providing many more decision points at formulation and during the coding process a rich scheme provides more opportunities for error to creep in. The more complex the coding scheme is the harder it will be to gain community acceptance, and if the coding categories reflect the conceptual distinctions being drawn in current research they will also almost inevitably tend to be somewhat controversial. These kinds of factors can reduce the value of the database as a resource; scientists may be reluctant to base research on a disputed coding scheme, may face resistance during peer review if they do, and errors in the data can seriously taint the database. This last problem has arisen as an issue for DNA databases. Sparse coding reduces the exposure to error and controversy, but at the expense of representational power.
Effectively, the rich vs sparse coding scheme choice faces a kind of type I/type II error tradeoff. As such, there are factors pushing in both directions. However there are some reasons favoring a conservative approach. If the database is to be used for publishable research its coding scheme and data will need to be robust against a wide range of challenges. If the database becomes widely used then any problems that arise have the potential to compromise large swathes of the literature.
Controlled vs open coding schemes
Another key choice is whether the coding scheme should be controlled or open. A controlled vocabulary is a centrally managed coding scheme, whereas open schemes have a more folksonomic character, allowing individuals to add new terms as they see fit. Controlled coding schemes can have the advantages of being consistent and well-organized, but they impose a significant management burden because mechanisms for formulation and revision are needed. The more ambitious the coding scheme is, the more onerous the management requirements will be. DSM is a well known example of a controlled coding scheme that illustrates the value a controlled scheme can have, and just how demanding the management process can be. Because the revision cycle can be long, a controlled classification scheme may lag well behind the categories used by current research.
An open coding scheme can respond rapidly to current developments, but at the expense of the consistency of terms and coherent organization of the scheme. To some extent tools like social tagging systems can ameliorate the chaos by creating a central record of terms and suggesting terms during the tagging process (Delicious is an example of how this kind of thing can work). These methods are unlikely to produce the consistency of a controlled scheme, however.
Again, though, this doesn’t have to be a strictly either/or choice. One kind of mixed strategy would be to use both a sparse controlled vocabulary and a rich and flexible open scheme. From the point of view of data quality, the controlled scheme would be “gold standard” and the open scheme would be “use with caution”, but when used with appropriate caution the open scheme might still be very valuable. Moreover, information derived from trends in the open coding scheme could be used to inform revisions of the controlled scheme.
Minimalist vs maximalist software tools for data coding
The kind of coding system chosen has an impact on the mechanisms needed to code the data and get it into the database. At one end of the spectrum authors could provide simple metadata to the database using a basic web form. At the other end of the spectrum a complete software suite would take the author from the first stages of experiment design to the final paper, smoothly adding a multitude of codes along the way. Somewhere in the middle, plugin software something like Endnote could work with existing word processor and statistics programs.
Open vs focused functional objectives
A further type of question concerns the kinds of uses envisaged for the data. The most agnostic approach is to simply leave this open; ‘low level’ data is made available to researchers to do with as they will. At the other extreme the database is built around a very specific high level purpose. An internet-based taxonomy database, as envisaged by Godfray 2002, is an example of the latter possibility.
Taking the taxonomy example as a model, a comparative mind database might adopt a high level representational framework designed to efficiently capture the information of most interest to comparative cognition research. For instance, a species-oriented ‘view’ showing phylogenetic relations mapped with cognitive abilities might be treated as a core function. A cognitive ability-oriented view might display all the species in which a particular ability has been demonstrated, together with key variations. A method-oriented view might show variations in the application of a particular paradigm within and across species.
Warehouse vs knowledge environment
The database might function as a warehouse, being primarily oriented to storing information and providing only a simple interface for accessing the data. On the other hand, it might be more like a ‘knowledge environment’, providing textual resources like annotation, reference information, and conceptual and methodological discussion.
Closed vs ‘crowd-sourced’ content creation
If the database is to be something like a knowledge environment, the source of content could be closed, e.g. using an editor/solicited contribution model, or it could be ‘crowd-sourced’ by the community. The latter option would have a strong ‘Web 2.0’ flavor.
One interesting possibility is that users could add tags and annotations to existing items in the database. For instance, Reader and Laland (2002) conducted a meta-analysis examining relations between brain size, behavioral innovation, social learning and tool use. They examined more than 1000 articles, and part of the analysis effectively involved re-coding papers, such that behavioral descriptions using keywords such as ‘‘novel’’ and ‘‘never seen before” were counted as instances of behavioral innovation. If these papers had been held in a CMD database Reader and Laland could have performed their recoding within the database, with those tags being associated with the original papers and available to other researchers for further use or critical scrutiny. In this case the recoding was quite straightforward, but in other cases it can involve more complex interpretation (with an important class of interpretations being “not really a case of x after all”). Appending such interpretations to papers would in effect be a modern recreation of the classical commentary.
This kind of possibility connects back to the issue of the controlled coding scheme. Zsófia Virányi has pointed out in conversation that conducting meta-analyses would be an effective way of zeroing in on the kind of information that should be included in a controlled coding scheme. More generally, as noted earlier, an open coding system could serve as a useful source of information guiding the formulation and revision of a controlled vocabulary. A system that allowed appended recoding and commentary would provide a direct form of ongoing evaluation for the controlled scheme.
Conclusion: two paths to integration
Returning to the larger objectives, in principle a CMD could help bring coherence to comparative cognition research in a variety of ways: it can be a vehicle for the standardization of terminology and methods, by collating data it can facilitate research that takes into account a wider range of the total amount of information available, and by means such as wikis it can provide a forum for conceptual and theoretical integration. A final set of questions concerns the pros and cons of each of these kinds of goals; I’ll focus on the first and third.
The standardization of terminology and methods is a worthy general objective, but an issue that came up in several of the talks at the workshop is that methods need to be adapted and revised, so enforcing standardization too strictly can be counterproductive. This debate has happened before: Cassman and colleagues gave a caustic assessment of the disorganization of systems biology, together with a recommendation for the creation of:
…a central organization that would serve both as a software repository and as a mechanism for validating and documenting each program, including standardizing of the data input/output formats. …
This repository would serve as a central coordinator to help develop uniform standards, to direct users to appropriate online resources, and to identify — through user feedback — problems with the software. The repository should be organized through consultation with the community, and will require the support of an international consortium of funding agencies.
This call to action prompted a stern response from Quackenbush, who argued that such standardized systems are appropriate for mature research fields, but not for emerging fields, where innovation and diversity are essential. Quackenbush concludes:
We believe that the centralized approach proposed by Cassman and colleagues would not fare well compared with more democratic, community-based approaches that understand and include research-driven development efforts. Creating a rigid standard before a field has matured can result in a failed and unused standard, in the best of circumstances, and, in the worst, can have the effect of stifling innovation.
The point is important to consider for a CMD since comparative cognition is still a relatively immature field. But it need not count against any kind of attempt at centralized integration. It does count in favor of a sparse, cautious approach to controlled coding and method standardization, but open coding and wiki systems are compatible with innovation and diversity, whilst still promoting overall integration.
One way to think about it is in terms of two paths to the integration of a field: a ‘low path’ centred on standardized terminology and methods, and a ‘high path’ centered on concepts and theory. In an immature field both paths have many difficulties, but somewhat counterintuitively the high path may be more feasible and important in the earlier phases of development. More feasible because some degree of qualitative conceptual integration is still possible even without precisely defined terms and methods. More important because a reasonably coherent high level understand of the field is needed in order to decide how to standardize basic terms and methods. For example, Cajal’s neuron doctrine provided a basic conceptual framework that profoundly shaped modern neuroscience, but it flowed from a keen attention to the ‘big picture’, including global brain organization and evolutionary and ecological context (Llinás 2003). It is hard to imagine that he could have developed such a productive conceptualization without this larger understanding. Seen as a vehicle for promoting change, low path integration is the more obvious strategy for a database to pursue, but current technologies make it possible for a database to also or instead aim at facilitating high path integration.