I often enjoy thinking about how much software has changed our lives, how much software exists in the world, and how much is being written. I also like to consider how quickly the rate at which we write software is changing and the implications that this has for society. This is especially important for science where publications tend to summarize work done from some perspective, but the real record of the work may be a the software. But what do we really do today to preserve and, if you will, curate collections of software, especially scientific software and the business software that supports science?
About two years ago I was asked to join an effort by the US Department of Energy’s (DOE) Office of Scientific and Technical Information (OSTI) that, in part, looked at this question. The effort was to develop a new software services and search platform for the DOE’s vast – and I do mean vast – software collection, including both open and closed source projects. This effort came to be known as DOE CODE and the Alpha version of the platform was released in November 2017.
How vast is vast?
DOE CODE is the latest in a long line of names for a software center that has supported the scientific community since 1960 and was started by Margaret Butler at Argonne National Laboratory. At the time it was called the Argonne Code Center and later became the National Energy Software Center. In 1991, the center moved from Argonne National Laboratory to OSTI headquarters in Oak Ridge, Tennessee, and was renamed the Energy Science and Technology Software Center (ESTSC). The ESTSC website was launched in 1997 and the effort to develop DOE CODE as the new public facing platform for the software center started in 2017. In the 58 years between then and now, over 3600 software products have been submitted by national laboratories and DOE grantees, many of which are still active. Each record includes all of the metadata about the software, as described by DOE Order 241.4, as well as the code, either in binary or source form.
3600 software packages is a truly vast collection of software. However, when we started the project, we noticed that searching around on GitHub that many projects supported by DOE funds were not catalog. How many? Well, based on the fact that a GitHub search of “Department of Energy” returned over one million hits at the time and using the assumption that a file or class would be between one hundred or one thousand lines, we estimated that the number of DOE software packages on GitHub alone that were also not in the existing catalog was between one thousand and ten thousand packages. Further investigation by Ian Lee from LLNL suggested that it was closer to the lower of the two numbers. This does not include projects on other sites such as BitBucket or Sourceforget.net, but if we assume that there are roughly as many packages on those sites, then our estimate of the total number of DOE software packages is somewhere between 4000-7000 packages. While we may never catalog all of those packages, it is clear that open source software was a very important part of the DOE’s software development community and that the effort to redevelop the software services and search platform needed to strongly consider this point.
It became clear over the course of initial requirements gathering exercises that the DOE needed a new software services and search platform that could simultaneously meet the needs of both open and closed source projects. The platform also needed to assist with OSTI’s continuing mission to collect, preserve, and disseminate software, which is considered by the DOE to be another type of scientific and technical information. (This post will not address the topic of limited and/or classified software.) Figuring out exactly how the DOE community worked with open source software would be a challenge on its own, but establishing a balance between the needs of both open and closed software projects required significantly more effort. This new effort and the new service it would spawn were distinct enough that a new name was warranted, thus the adoption of the much simpler “DOE CODE” over previous names.
DOE CODE supports OSTI’s efforts to collect, preserve and disseminate software artifacts by acting as a single point of entry for those who need to discover, submit, or create projects. Instead of mandating that all DOE software exist in one place, DOE CODE embraces the reality that most projects exist somewhere on the internet and are generally accessible in one way or the other. DOE CODE reaches out to these repos directory or, in the case of GitHub, BitBucket, and Sourceforget.net, integrates directly with their programming interfaces. DOE CODE itself exposes a programming interface so it can be used the same way by libraries or similar services around the world.
Users can provide their own repositories or use repositories hosted by OSTI through GitLab or through a dedicated DOE CODE github community. DOE CODE also centralizes information on software policy for the DOE and links to developer resources from, for example, the Better Scientific Software project. The platform can also mint Digital Object Identifiers (DOIs) for software projects, which was a big request from the community early in development. To date, many of the projects which exist in OSTI’s full catalog have been migrated to DOE CODE and many of these projects have been assigned DOIs as well.
All of this is on top of an interface that is streamlined and easy to use. Adding project metadata is often as simple as providing the repository address and letting DOE CODE do the rest to scrape it from the repo!
These features combine to provide an experience that is focused on enabling social coding while simultaneously integrating software, publications, data, and researcher details to create a holistic picture of DOE development activities. Part of this includes embracing social media and allowing users to share what they find through their favorite social media platform.
Searching is as simple as using the search bar, but advanced options such as language and license are also available. The Alpha release of DOE CODE contained about 700 open source software packages and the total number of packages currently has grown to 874, which is about 1.5 new additions per day since the launch.
My favorite feature of DOE CODE, as its lead architect, is that it is open source itself. It is, in the words of my nephew, “epically meta” to build a service like DOE CODE that can list itself as an open source project. In fact, throughout the development process we used the DOE CODE repo on GitHub as our primary test case for working with source code repositories.
The open source nature of DOE CODE is my favorite feature because it means that the code can be reused and that this level of software project curation can be adopted, modified, and explored wherever it is needed. OSTI’s deployment of DOE CODE fits into their existing infrastructure as a plugin of sort. It feeds information to their ELink service, which ingests the metadata and executes a number of data processing workflows in the background to process the information according to a number of DOE orders, policies, business rules, and basic technical requirements. ELink then publishes this information to the main OSTI.gov site and provides some additional metadata to DOE CODE. It doesn’t have to work that way though. Oak Ridge National Laboratory (ORNL) is in the process of deploying a DOE CODE clone, called ORNL Code, that leaves out the backend processing, restyles the site, and adds Single Sign-On (SSO) authentication to integrate with ORNL’s other applications.
What we have found with the deployment of ORNL Code, which I am also leading, is that it is relatively straightforward to do custom deployments of DOE CODE. That was by design, but it is always good to verify it! We are also taking the next step at ORNL by putting ORNL Code in the cloud on Amazon Web Services. I remain hopeful that other organizations will try this too.
Building the Platform with Strong Community Backing
The effort to build DOE CODE was one of the most vibrant and fast paced projects I’ve worked on in my time in the National Laboratories. Yes, I have definitely worked on projects that were shorter than sixteen months from conception to Alpha launch, but I have rarely worked on projects with such a large amount of engagement and scope that launched on time sixteen months later. The key to this success, in my opinion, was that we engaged as many people from the DOE community as possible and we kept every possible line of communication open. Part of this included, as previously discussed, releasing DOE CODE itself as an open source project.
Early in our development process we established about eighteen separate requirements teams that we used throughout our development process for guidance and testing. I lost count of the number of people that we interviewed when it was around eighty eight, and that was early in January 2017. These teams were composed of members from various communities of interest from within the US national laboratories. Each team was about five to eight people, to start, but some of the teams quickly swelled to eight to ten people. One team went from eight people to twenty seven, which was the phone call where I learned the consequences of saying “Sure, invite your friends!” We also had good community interactions on the GitHub site, Twitter, and conferences during the development cycle. I personally presented a talk on the project many times and some times multiple times in a single day. By the end of the year, we had presented thirteen invited talks on DOE CODE, which is the most invited talks I have ever presented in a single year.
To say that the DOE CODE team is grateful and indebted to the broader DOE community is an understatement, but it is a good start. We certainly could not have built the platform without their help and the many great people behind the scenes at OSTI and ORNL as well.
If you are interested in getting involved or learning more, you should check out the DOE CODE site or the GitHub community. You can also reach out on Twitter: OSTI maintains an active Twitter account (@OSTIgov) and I am always available (@jayjaybillings).