The iPlant collaborative will have a core infrastructure of hardware, software tools, and staff to support the work of grand challenge teams and the larger plant sciences community. The collaborative will also have an Integrated Solutions Team (IST), which will function as a bridge between biologists and the core infrastructure.
The Integrated Solutions Team (IST): Bridging Biology and Infrastructure
The Integrated Solutions Team is responsible for helping in the design and development of “Discovery Environments” (DE). Each DE is a software platform custom-designed to help biologists in the community address and solve a grand challenge (GC) problem. It provides both a virtual meeting place for a GC team, and it allows the team to access the underlying physical infrastructure. In particular, a DE allows GC team participants to access relevant data sets, integrate across them to identify connections, visualize them in ways that allow the ‘big picture’ to appear, manipulate the data with analytic tools, and share results by facilitating computational steering. The DEs designed for different grand challenges will overlap and coalesce into a comprehensive cyberinfrastructure.
Our model for Discovery Environments are Internet “mashups,” also known as Web 2.0 applications, which allow community members to build content in a democratic way, to make and label connections between different types of content, and to integrate a variety of different types of information into a single user interface. Beneath their surface simplicity, Discovery Environments will support sophisticated systems for scientific analyses as well as semantic integration, description, and manipulation of diverse biological data and. DEs will be integrated into the growing infrastructure of the iPC, becoming in time an open source resource that is expanded and maintained by the community as a whole.
IST members will have experience in biology, computer science, enterprise data management, bioinformatics, genomics and software engineering. Other IST key personnel will add expertise in image analysis, machine learning, workflow management, cluster computing, statistical analysis, and large-scale data management. Two types of staff members will contribute to the IST: the research staff, consisting of postdoctorals, graduate students, and their mentors, and the software engineering staff, which helps the research staff design and prototype the software applications that make up the Discovery Environments needed to support GC projects.
IS research team members will be drawn from diverse backgrounds, including bioinformatics, computational biology, computer science, information systems, statistics, physics, and mathematics. The majority of research team members will be trainees at the graduate student or postdoctoral levels. We believe that having students and postdocs at the center of IS research, contributing directly to the GCP projects is the most direct way to foster a new generation of plant scientists skilled in quantitative, computational, and integrative thinking. Team members will attend grand challenge symposia and will become members of the resulting GC teams providing expertise in exploratory algorithm development and data management and mining. They will also participate in the design of Discovery Environments to support their grand challenge projects and work with IS software engineers to prototype the Discovery Environments. When the prototyped Discovery Environments are sufficiently stable, IS research team members will hand over the problem to the core Infrastructure Development team who will turn the ideas developed by the GC teams into production-quality, portable software, databases, and visualization engines. IS research team members will continue to liaise between the core infrastructure staff and grand challenge collaborators to ensure that the software gets into the hands of the collaborators and does what is intended.
IS software engineering staff will be professional software engineers, whose role is to provide support to research staff for data mining, algorithm implementation, data management, and application development. These software engineers are distinguished from those who work in the infrastructure core by having skills in agile software development, a paradigm that emphasizes rapid development, flexible requirements analysis, and extensive early prototyping and testing. Software engineers with more traditional training easily get frustrated when dealing with biological applications due to the fluid and underspecified nature of the problems. Agile software developers, who often come from open source software development backgrounds, are more temperamentally suited to the fluid environment of the GCP projects, but typically poorly suited to the task of creating finished, hardened software that is the domain of the infrastructure core.
Using this model of a combination of research and development staff, the IST will help develop state-of-the-art data management capabilities for the iPC, design and implement new algorithms, and build and integrate workflow management components into the DEs. The core infrastructure team will then take the prototype DEs from the IST and turn them into production quality systems to be field tested and deployed.
The Core Infrastructure
The core infrastructure will contain computational facilities to support software development as well as the computing and visualization requirements of scientists doing computational modeling, analysis, data discovery, and other computing-intensive experiments. The core will contain shared-memory multiprocessors and clusters and provide an interface to grid computing facilities. It will also contain large storage systems to provide persistent, reliable, and effectively unbounded storage for plant science data. The repository will ensure that key data sets are preserved beyond the lifetime of the projects that produced them. It will also support reproducibility of experimental results by providing mechanisms to archive snapshots of experimental configurations, including all software and data used to generate a given set of results. The infrastructure will be developed and managed so that it is kept at the leading edge of technologies required to solve grand challenge problems in plant science.
The iPC Infrastructure Development team will have a full-time staff to install, develop, document, and maintain software tools in support of Grand Challenge teams, administer the physical infrastructure, and provide help-desk support for users. The staff will also have a small research and development team to design, prototype, and eventually deploy software systems that are needed but not available elsewhere. For example, creating a ‘reproducible experiment’ archive as specified above is a hard, as yet unsolved problem. One key function of the infrastructure staff will be to insure scalability of the software to both large numbers of processors and large datasets.





