Processing and Preserving Governor’s Office Emails using Predictive Coding Tools
Dec 04, 2019
The Illinois State Archives (ISA) has served as the primary repository for Illinois government records of permanent value since its establishment in 1921. Illinois has a well-established legal framework for the management of government records of all types and the ISA has successfully overseen the efficient management, disposition, and preservation of government records for many years. Until recently, this process has been almost entirely paper-based. However, with the increased use of electronic records by government agencies at all levels, the ISA has made great strides to update and adapt its procedures and processes to accommodate records in this new environment.
In 2016, the Illinois State Archives collaborated with the Records and Information Management Services (RIMS) office of the University of Illinois at Urbana-Champaign, in conjunction with the Library at the University of Illinois, on a test project to acquire, process, and provide access to a collection of email messages that have enduring value from senior state government officials. This project, Processing Capstone Email Using Predictive Coding, is generously funded by a three-year grant through the National Historical Publications and Records Commission (NHPRC). A unique aspect of this project is the evaluation of commercial tools which may help the archives community effectively process large collections of email messages. Furthermore, the Illinois State Archives sees collaborating with the University of Illinois as an opportunity to develop a partnership that, in turn, can provide a more sustainable digital program with access to electronic records for both institutions.
For this project, e-discovery tools were reviewed using email messages secured through the implementation of the Capstone approach developed by the National Archives and Records Administration. Illinois State Archives Director, Dr. David Joens, considers email messages from senior administrators in key state agencies to be the modern equivalent of subject or general correspondence files that have long been held to have permanent value. However, email continues to present unique accessioning challenges due to a variety of factors, including: large volume and duplication; conversation threads; diverse file formats and attachments; links to external documents and resources; inconsistent classification and mixing of personal, informal, and official communications; and the prevalence of sensitive content.
The ISA secured support from the Office of the Governor to give the project team access to email messages previously identified as having potential archival value. All email selected for use by the project came from persons working in offices that report up through Office of the Governor. Created primarily between 2003 and 2015, the messages consisted of approximately 500 GB of content, or approximately 5.3 million individual messages (not counting attachments). After removing duplicate messages we were left with approximately 3.6 million messages and another 1.1 million attachments for a total of approximately 4.7 million digital objects. Upon approval from the Office of the Governor, the IT department produced an initial set of email as PST files and transferred them to the project team on an external hard drive. Ownership of the records remained with the Office of the Governor, and, for purposes of this test program, the emails are considered user copies. Once the dataset was secured, text recognition and extraction processes were applied.
Once we began reviewing various e-discovery tools, we learned that some offer specific functionality while others offer a suite of services and features. We selected two open source tools and four commercial tools for hands-on analysis. The open source tools are ePADD and TAR Evaluation Toolkit, while the commercial tools include Microsoft Office 365, Advanced eDiscovery, Open Text’s Recommind, FTI Consulting’s Ringtail (now owned by Nuix), and Luminoso Analytics.
Features desired or required for an email analysis workflow were identified for three categories: Pre-processing, content analysis, and content preparation.
When assessing ePADD, we did not focus on the tool for the analysis of the email, as we wanted to test predictive coding as a method of quickly analyzing a large email dataset. Our review of ePADD revealed that while it does not have predictive coding capabilities, the tool does provide a user-friendly public access interface. Therefore, additional exploration is planned to review ePADD as a possible end-user access to the emails. Upon review of the TAR Evaluation Toolkit, we found it to be very effective when used in close coordination with its creators and its programmers to manage the work behind the scenes. The open source version of Toolkit without input from the creators does not have the same capabilities and does not include an end-user interface that would allow someone to work with the tool on their own. While our initial review of Toolkit ruled it out for the purpose of the project, we believe it is a useful tool worth further exploring as an addition to existing tools used by the archival community.
Our assessment of the commercial tools availed the most access to test active learning and predictive coding. Our preliminary focus of assessing Advanced eDiscovery was to learn how the tool labels emails based on “themes” automatically identified through analysis of the content found in the body of the messages. With this particular dataset, we found too few relevant messages for each theme in order to create a useful training set. A review to identify documents in a broader theme such as “restricted” was also unsuccessful. Upon reviewing Recommind, we learned that it is an extremely robust tool and the company provides ample access to tutorials and “white glove” support. The level of support required us to reach out for help any time we had a question or wanted to do something beyond tagging documents. This made it difficult to get a good understanding of how the tool operated behind the scenes. Luminoso is a tool mostly used for identifying data with respect to keywords and words related to the keyword. It provides a word cloud display of the content of the documents (email) highlighting words that are most prevalent. The user can also add additional concepts and request a download of matching documents based on concepts. The tool did not actually provide a predictive coding/active learning option; therefore, we chose not to pursue it based on the focus of the grant.
Ultimately, Ringtail was the product with predictive coding capabilities that provided us the most direct access to its inner workings. We were able to manage its setup and create cases. It provides a comfortable user interface and keeps track of statistics about the documents and tagging efforts. The predictive models in Ringtail provide a high level of insight into the performance of a model, allowing one to fine tune the desired level of recall, precision, and accuracy based on the manual review effort required and tolerance mistakes.
Ringtail predictive model
Ringtail predictive model projections
Working with predictive coding tools means working with tools that are learning what content is most responsive to one’s inquiry, often doing so using an iterative training process. Initially, the process requires human reviewers to visually analyze email messages and tag them according to pre-determined criteria indicating if the messages are responsive. Using this iterative learning process, the project team focused its initial efforts on identifying sensitive content. Staff reviewers looked for content that contained personally identifiable information or sensitive communications, such as what may be found in a personnel file. Additionally, potentially sensitive concepts such as “Family Matters,” “Endearments,” and “Health Concerns and Conditions” were envisioned so as to flag email messages that may be personal in nature. Once sensitive content was identified, the tools could be used to further review and identify concepts that may be of interest to researchers. This approach has highlighted the effect that variances in human reviewer’s judgment can have on the results due to the nuances of language in the email messages. As a result, every time a decision is made to change the criteria being used to code the documents, however subtle, the accuracy of the algorithm is affected.
We originally received 5.3 million messages for this project. Through the application of predictive coding, we reduced the total amount of materials to be made publicly available to approximately 1.8 million, or approximately 38% of the original 4.7 million digital objects we had after de-duplication and after extracting attachments. Approximately .5 million objects were identified as archival but also restricted and will not be made publicly available at this time; however, these will be preserved for later review.
In preparation for moving content out of Ringtail and into a stand-alone computer that will be used to provide public and staff access, we have identified the metadata fields associated with each email message that we would like to retain when we export the data. There are hundreds of metadata fields potentially available for consideration. Our focus is on those that preserve information about the digital objects and provide a reasonable level of trust regarding the provenance and file format information. Retained metadata include information such as the content creator or receiver(s), the date the message was sent, whether or not the digital object was an attachment, and if so, its native file format, and others.
In order to provide public access to the archival emails, we investigated several options and ruled out the previously considered Microsoft Outlook. Due to the complexities of locking the tool, it provided many degrees of freedom for the end user when managing the email. For this reason, we have decided instead to use a simple MBOX viewer that does not allow any tampering or erasure of the email.
In October 2019, NHPRC approved a six-month no-cost extension to our grant that has allowed our team an opportunity to focus on ways to ensure this project remains sustainable after the grant has ended. Thus, we have continued exploring methods for ingesting the archival email into Preservica, the preservation repository used by the Illinois State Archives. This exploration has proven challenging due to various metadata elements that must be taken into consideration from the original email as well as Ringtail, and how that metadata is ingested into Ringtail. Because providing direct or easy access to preservation copies of the email is beyond the scope of the original objectives of this project, we have uploaded the email and its metadata as insurance against losing the content.
To foster transparency and accountability in governance, researchers and the public must have access to information from government officials that provides insight into their actions and decisions. For archivists to reliably preserve large collections of digital documentation of diverse government operations, the need to leverage scalable technology is increasing. Although not perfect, preliminary findings from this project support the use of e-discovery tools to more efficiently complete archival workflows and enhance access. As is often the case, more research is needed.