Web Harvesting at the Maryland State Archives

Nov 01, 2016



The Maryland State Archives was originally founded in 1935 as the Hall of Records, integrated into the Department of General Services in 1970, and in 1984 was assigned its current name and status of independent agency within the Office of the Governor. It functions as the central depository for permanent government records from Maryland's municipalities, counties and statewide agencies. Records in its custody are as old as the foundation of the colony in 1634, and as new as digital land records from 2016, which are transferred nightly to the Archives.

The History of Web Harvesting

While Maryland's work with electronic records began over a decade ago, its efforts to capture web data has been a comparatively recent phenomenon. This reflects broader trends - websites have been publicly accessible since the early 1990s, and a form of web archiving was being developed by 1996, but serious efforts to archive them only picked up steam over the past ten years.

Part of the hesitance to archive this content has been an argument against the permanence of websites. Much like transitory correspondence, websites are sometimes believed to be an inherently disposable, short-term medium for the delivery of information. Additionally, archiving them is a fairly complex task. Websites are dynamic accumulations of interrelated data, and their content and structures are updated frequently. Even in the simplest scenarios, the value of a website is a combination of presentation and substance; without maintaining the appearance of a website, and the continuity of its hyperlinks, the integrity of the archived record is easily compromised.

Over time, a consensus emerged among archivists that web content was valuable, and experts began to dedicate their efforts to saving this data. In 1996, Brewster Kahle founded the Internet Archive, a 501c3 non-profit which aimed to be the internet’s library, and co-founded Alexa Internet, which provided tools to analyze web usage and content. By 1999, Alexa Internet was sold to Amazon and the Internet Archive began to actively seek a larger range of collections, including text, sound, video, and software. This led to the publication of The Wayback Machine in 2001, a web-based service allowing internet users to view previous versions of websites organized by date, and ultimately resulted in the release of Archive-It in 2006. Archive-It, a subscription web archiving service, offers software, storage, and support for institutions looking to archive and present collections of websites.

Archive-It and The Maryland State Archives

When the Maryland State Archives turned its attention to preserving the government's web presence in March 2011, Archive-It was believed to be the most efficient way to address two major concerns: the transitory nature of state government web presence and the difficulty encountered in capturing government publications. State Archivist Tim Baker (then Deputy) and a staff member of the Administrative Services Department, Stephanie Smith, established a partnership with Archive-It and laid much of the groundwork for the agency's current efforts.

Web collections were designed at the start to mirror the structure of Maryland's government at the state, county and municipal levels. In particular, collections were organized first by branch of government, then by government entity within those categories. Quality Assurance was manually carried out by Baker and Smith to confirm that the web captures were meeting the standard of displaying websites that were apparently and functionally similar to their live presence. The Maryland State Archives' IT Department was brought into the project to produce a landing page, which facilitates access to the state's harvested websites from the Archives' home page, rather than requiring users to navigate to it externally through the Internet Archive.

A two-year subscription to Archive-It's harvesting services was purchased in 2011. This subscription permitted the annual harvesting of 300 seeds across one to three active collections, with a maximum of 13 million URLs and 2 terabytes of data. The most recent subscription was activated in March 2016, and the terms were modified to reflect Maryland's experience over the past five years - the agency reduced its data cap to .75 terabytes of data per year, but retained the limit of 13 million URLs. The data cap and associated costs decreased because the annual data accrual had never approached the original 2 terabyte limit over the first five years of the program.

In fact, since March 2011, only 3.2 terabytes of web data has been collectively archived. To break this down more specifically, the Archives collected 10,202,282 documents in 2012, 11,574,831 documents in 2013, 12,467,798 documents in 2014, 14,231,489 documents in 2015, and 11,396,910 documents in 2016; since the year designations are based on the contract established in March 2011, the 2016 period represents a range of March 2015 to February 2016).

Several broad trends can be identified from this data and a closer look at the final results of each individual web crawl conducted by the agency. Overall, documents increased in the period of 2012 to 2015, which reflects an increasing commitment by Maryland's government to presenting information and resources online for citizens. The decrease from 2015 to 2016, however, represents a major improvement in the way that Archive-It harvests websites, since crawls are being more accurately scoped and duplicate data is being excluded from harvest. Web development and web archiving are both rapidly evolving, so it is expected that trends will continue to change in the years ahead.

Ongoing Development

After the initial work was completed by Baker and Smith, more staff was brought into the web harvesting program on a part-time basis. While their primary duties were elsewhere in the agency, archivists Megan Craynon, Christopher Schini, and Christian Skipper took over the ongoing monthly maintenance of the program. This involves setting up new URLs or hosts for crawling when needed, and troubleshooting incomplete crawls. Web harvesting is a delicate process, and the barriers to having a fully realized end product are numerous; these include web scripts that intentionally prevent Archive-It from capturing a snapshot of the site, an overabundance of documents that simply takes too long to process, and complex applications like calendars or search engines in which user input is necessary to organize data. These issues are often flagged by the Archive-It software and staff attention is directed accordingly.

Marylanders Grow Oysters Website - April 2011.jpg

Happily, the Maryland State Archives has found the effort to be very worthwhile. In particular, capturing websites offers a unique perspective into the public face of changing government programs. One such government initiative is the Marylanders Grow Oysters project, managed by the state's Department of Natural Resources. In its earliest capture (April 2011), the program was in its infancy - its introduction explains that hundreds of waterfront property owners were participating, and a note at the right of the page indicates that the program was in the process of expanding from 11 to 18 tributaries. In the most recent capture (October 2016), a number of significant changes were immediately apparent: the introduction states that over 1,500 property owners had joined the program, the tributaries had grown from 18 to 30, Spanish-language services were made available, and even the format had been altered - the updated website scales to the size of the device or window being used to view it, reflecting a major push for government content being optimized on smart phones and tablets. Images were even captured effectively on these crawls, as a map of participating tributaries is available on both iterations of the website; this presents the user with a quick visualization of the program's adoption throughout the region. Not all website changes are so easily identified, but many show similar evidence of an agency or program's evolution over time.


While this article offers only a glimpse into why web harvesting is valuable, it clearly illustrates the Archives' motivation. Government websites function very similarly to government publications, but the richness of interactivity offers an even more complete window into how citizens were served by their government. A brief review of web content can tell us how effective programs were, how those programs were framed to be attractive to a constituency, to whom those programs were being directed, and how users were accessing the information. The history of Maryland in the early 2010s, therefore, is a history represented and reflected by the development of its institutions' web presence. When properly archived and cataloged, web data will allow historians of the future to tell the story of how government agencies evolved to reflect the needs of our changing communities.


Get the CoSA News Brief

Stay Connected