Update: Preserving Electronic Government Information (PEGI)
Feb 01, 2018
Guest Blogger: Tim Baker, Maryland State Archivist and CoSA President
In October 2017, the PEGI Project Team converged at GPO’s FDLP annual meeting in Arlington, Virginia to facilitate a mini forum for government information librarians. PEGI leaders pitched a series of questions designed to explore the nature and breadth of problems concerning preservation of born-digital government information. Participants were seated in small groups, and each table took an interesting and unique direction. Here is a recap of discussion from Table #2.
WHAT BORN-DIGITAL INFORMATION IS ALREADY PRESERVED?
At first, we quickly rattled off some websites where one can find government information, but we weren't sure whether the host organizations were actually "preserving" it. We quickly realized we needed a common understanding of what it means for digital information to be preserved.
There was a vague assumption that the simple fact of online availability is a sort of de facto “preservation,” and this impression could have arisen from experience using the Internet Archive’s Wayback Machine. But it is not technically possible to capture some web sites, and many of those that can be captured are still missed. We return to the mantra that Access is Not the Same as Preservation. Lacking web preservation experts at our table, none of us could say definitively how rigorous efforts must to be in order to qualify a website for trustworthy, long-term access.
At this point we naturally segued into a discussion of archival formats. We were aware that some digital formats are considered archival while others are not, such as PDF/A versus PDF. We realized that it will probably be necessary to continually migrate information to newer formats whenever current formats are dropped by advancing technology. This will require an ongoing commitment, funds and IT expertise. It may be tempting to migrate information just one time into the most lasting format available. But this is not really a solution. Data must be usable in the future information environment. For example, it wouldn’t make sense to preserve datasets by printing them out on archival paper, even though the pages might last hundreds of years. Whatever format is selected must be practical for actual use.
Despite ambiguities regarding the definition of “preservation,” our table answered the forum’s first question by rattling off a list of collections that are probably being preserved for future use:
- Certain types of legal information, including laws and regulations, but with some exceptions. For example, certain information from the U.S. Supreme Court is not made public.
- We presume that any title having a PURL in the U.S. GPO’s Catalog of Government Publications (CGP) database is being preserved by the GPO.
- Born-digital government information that was on the internet in the 1990s may still be available on agency websites under links for “historic” data.
WHAT ARE THE BIGGEST GAPS IN BORN-DIGITAL PRESERVATION?
- Is it reasonable or possible to have a gap-less digital collection? Years ago, some libraries worked to collect every printed item in a defined subject area because they could not predict what future historians would consider important. But the sheer quantity of digital information generated today makes this sort of unlimited collection development approach impractical.
- Web-based information can change daily, hourly, or even more frequently. Failure to document one of these edits would leave a gap. Websites that are dynamically generated or edited with a Content Management System represent a special problem. How often would information have to be recaptured in order to document turning points in history? Like detectives in an investigation, future historians may need to know what happened over a 5-minute period of time on a crucial day at a certain government agency.
- Technology to completely collect and preserve web information does not yet exist. If not even one day of the New York Times can be successfully crawled in 24 hours, then what is our hope of crawling the entire federal government web? Current technology for crawling the web is 5-6 years old. The web itself is advancing faster than technology needed to preserve it.
- The word “gap” suggests a small missing piece. We may be looking at a “gap” of 95% of all we would wish to preserve. Prospects for born-digital preservation will be bleak unless there is a major turnaround in the way our government handles openness. It almost seems that that every word and action taken by our government would need to be open by default, with classification and privacy settings placed very selectively, rather than the other way around.
WHAT ELECTRONIC INFORMATION DO YOU THINK IS MOST AT RISK TODAY?
WHAT IS THE NATURE OF THOSE RISKS?
Here are examples of at-risk information that immediately sprang to our minds:
- Information that could be politicized is at risk. For example, the commercial sector might pressure the government to suppress data if it would interfere with business or profits.
- Overly specific laws and regulations describing government information can inadvertently put digital information at risk. Presently, if federal information is not in a "publication" and not considered a "permanent record," then it falls into the gap between GPO and NARA and is not subject to any laws for collection or preservation.
- Special formats can present a problem for preservation. Audio-visual files, databases and GIS data are all at risk. The GAO has its pdfs in a database but web crawlers cannot reach them.
- Any government information that has versions may be at risk.
Regarding the nature of risks, some of this can be ascribed to the actions (or inactions) of those who create it. Agencies probably have an audience in mind, but are not necessarily thinking of future users.
We also noted a big disconnect between laws requiring agencies to preserve their information, and the appropriations they receive to carry out this function. Federal records managers remain among the most underfunded of positions. Many times, the reason why digital information is not made available to the public is that agencies do not have sufficient personnel to attend to these tasks in addition to doing what they consider to be their primary responsibilities. When time and resources fall short they must prioritize, and preservation might fall too far down on the list. Some projects will drop off the radar, especially those that have no consequences if they are not carried out.
We found it useful to consider the difference between the culture of IT (information technologists) and that of librarians/archivists. Collecting, organizing and managing information is something librarians and archivists feel is central to their work. Yet work involving digital assets is often assigned to IT departments. The two cultures have completely different approaches, and so far communication between them is very low.
We discussed how misconceptions sometimes undermine our attempts to educate people about the born-digital preservation problem. When we hear in FDLP circles that "95% of all government information is on the internet," it can leave the sanguine impression that only 5% of government information is unaccounted for. In reality, that statement is true only of tangible items shipped to depository libraries in boxes. It is true that 95% of physical documents shipped through the FLDP are also available on the internet. But what the FDLP distributes in boxes is an extraordinarily tiny fraction of all government information.
The fact is that most government information never gets posted to the web in the first place. Rather, it sits on hard drives in government employees' offices. It gets uploaded to Sharepoint servers for limited access to others in the same agency. It might be the sort of data that resides in a personnel system, protected by privacy laws. To focus only what ultimately reaches the internet would miss the vast majority of it. NARA’s best estimate is that they receive only 5% of all information that the government generates. NARA receives what the agency itself has officially defined as "permanent records." Understanding the definition of “permanent records” is key to understanding gaps in preservation. Agencies are free to discard whatever the law does not require them to keep. Once the agency discards the information, it is gone forever.
Looking back to the 20th century, we recalled examples of print-based information that was lost for a variety of reasons. Some of those same issues threaten digital information today. Understanding what went wrong in the past could help us avoid similar errors as we move forward.