FDLP Web Archive

About

The FDLP Web Archive provides point in time captures of U.S. Federal agency websites. Unlike archiving and hosting individual documents, a web archive preserves the functionality of the entire website to the extent possible. The aim is to provide permanent public access to content found on Federal agency websites. GPO harvests and archives the websites with Archive-It, a subscription-based web harvesting and archiving service offered by the Internet Archive.

Ways to Access the Archived Sites

Archive-It Website

Search ‘GPO’ or ‘FDLP’ on the Internet Archive’s Archive-It page to get to the FDLP Web Archive collection. This is the most direct way to search for and access archived websites in the FDLP Web Archive collection. All archived content in the collections is full-text searchable.

Catalog of U.S. Government Publications (CGP)

Bibliographic records are available for the archived websites, which describe the sites and link to them via PURL (Persistent URL). They are searchable and accessible through the Catalog of U.S. Government Publications (CGP) FDLP Web Archive page. A list of all FDLP Web Archive records is also available in the CGP.

Internet Archive’s Wayback Machine

FDLP Web Archive content is discoverable when a URL is searched in the Internet Archive’s Wayback Machine.

Frequently Asked Questions

Archive-It uses a combination of crawling tools they have developed, including Heritrix and Umbra, to gather content. The crawler searches and captures an entire content rich website, creating a working facsimile of the site as it appeared when it was crawled. This helps preserve the website content as it appeared at a particular point in time. After the first crawl, the website is then re-crawled on a scheduled frequency. In that process, the crawler searches and captures the entire website again, creating a new working facsimile of the website as it appeared at the time of the re-crawl.

Determine if the website is in the scope of the FDLP.
Determine if Archive-It is the best tool for harvesting the website’s achievability.
Notify the agency of intent to harvest data from their website, for new collections.
Review the website to create or edit a seed list of domains that instruct the crawl.
Run a test crawl and then perform Quality Assurance (QA) looking for any out of scope or missing content.
Run any additional test crawls as needed to ensure the crawl will be effective and efficient. Before saving a crawl, typically multiple test crawls are done.
Save successful test crawl(s) and QA the saved crawl(s).
Run patch crawl(s).
Create a record for the website in the CGP, for new collections.

Steps 3-7 are repeated for each re-crawl of the website.

Videos in more simple formats such as WMV or MPEG4 can easily be captured and played back, however it can vary with more complex formats. Archive-It crawling technology can capture videos in other formats and platforms, such as Flash or Vimeo, however playback can vary due to the complexity of the make-up or how the video is embedded on a page. Archive-It is continuously working to improve video playback for all formats and platforms, and regular enhancements are made.

The initial collection development strategy to build the collection was to harvest all websites in the Y3 SuDoc classification of the Superintendent of Documents (SuDocs) classification scheme, which includes commissions, committees, and independent agencies.

From here, there was a concentration on a curated selection of non-standard Government sites, such as cio.gov. There was also a concept of topical collections, the first being Federal Native American resources on the web. This has been expanded to other topics of interest to the FDLP community, done in collaboration with GPO Collection Development Librarians. Also, nominations come from the FDLP community through askGPO.

To avoid duplication of effort, GPO refers to the Federal Web Archiving Interest Group for information about other existing or planned Federal Government web archive collections.

In an attempt to avoid duplicative effort, content found in GPO’s GovInfo is not harvested or archived, nor is anything already archived by other Archive-It partners, or anything already archived by our FDLP partners who are digitizing specific content from their FDLP collections (FDLP Partnerships). Nothing outside the scope of the FDLP is harvested.

Additionally, some websites, such as those with extensive databases or datasets are difficult for capture and playback. In these instances, partnerships with the providing agencies are sought to ensure permanent public access to their web content.

To avoid duplication of effort, GPO refers to the Federal Web Archiving Interest Group for information about other existing or planned Federal Government web archive collections.

In the beginning the focus was on building the web archive. Then the focus shifted to maintaining and enhancing what had been built. After a crawl is complete the site is analyzed to determine frequency of updates according to how often the site is updated, either annual, biannual, or quarterly. Re-crawls are not automatically run and follow a workflow very much like what is done for any new site. For all re-crawls the site is fully analyzed, to evaluate if there are any changes to it, if the seed list needs to be updated, or if any new modifications need to be made before the new crawls are run.

The main collection development practice is to archive content that would traditionally be included in the FDLP. As such, only content that is publicly available is sought. It is never intended to harvest any material that was copyrighted, proprietary, or that contained PII. If you suspect that such content has been harvested, please contact us through askGPO, and provide us with the information, including the Wayback Machine URL, and it will be reviewed for possible removal following Superintendent of Documents policy.

Yes. The websites are classified under the agency’s general publications category from the List of Classes, and then INTERNET is added to the end of the class. An archived website is assigned the regular item number that accompanies the general publications class for each agency.

For example, the SuDocs class for “NARAtions: the blog of the United States National Archives” is AE 1.102:INTERNET, and the Item Number is 0569-B-02 (online).

The FDLP Web Archive is used for permanent access to entire Federal agency websites. The web archive was created using a variety of crawling technologies used by Archive-It. The harvested sites are then stored on Archive-It’s servers. The archived sites are full-text searchable and accessible through the Archive-It User Interface.

The CGP provides MARC bibliographic records. Records for digital format resources include PURLs, which are links to the digital content stored in repositories or hosted online. GPO began archiving or storing a copy of some web-based resources in 1998. Publications are saved to GPO’s Permanent server, and a PURL to that content is added to the bibliographic record. As needed, a variety of tools are used to capture monographs, serials, and some video and audio recordings.

GovInfo is a searchable content repository comprised of deposited content ingested by agreement with Federal agencies. GovInfo includes resources from all three branches of Government and includes most of the Congressional publications that are cataloged. GovInfo users can search within the repository across the full text and content metadata, as well as browse for content, view it, and download it.

Policies Related to the FDLP Web Archive

Harvesting Digital Federal Government Information Dissemination Products for GPO’s Superintendent of Documents Programs, SOD-PP-2016-5 (effective 12/19/2016)
Withdrawal of Federal information products from the National Collection of U.S. Government Public Information and GPO’s online U.S. Government Bookstore, SOD-PPS-8-2024 (effective 7/8/2024)

Training

Web Archiving for the FDLP (Video, 60 minutes, recorded in 2014)
Archiving & Cataloging Federal Agency Web Sites - GPO's Web Archiving Project (Video, 54 minutes, recorded in 2014)
A Time Machine for Federal Information - Using Web Archive content in government information reference work (Video, 59 minutes, recorded in 2017)
Tangible and Digital Preservation: Bridging the Divide by Preserving Government Information in All Formats (Video, 57 minutes, update on FDLP Web Archive begins 30 minutes in, slides are available, recorded in 2017)

About

Ways to Access the Archived Sites

Frequently Asked Questions

Why is GPO archiving Federal websites?

How does web archiving work with Archive-It?

Why Archive-It?

What is the workflow for harvesting and archiving using Archive-It?

Where is the harvested data stored?

Who owns the harvested data?

What file format is used?

How is the FDLP Web Archive data backed-up?

Can Archive-It capture and playback video?

What Federal websites are in the FDLP Web Archive?

Are there any Federal websites GPO will not archive?

Can I recommend a Federal website to be archived?

After a website is harvested and archived, how frequently is it re-crawled for new content?

When websites are harvested for the FDLP Web Archive, does copyrighted, non-government, or other extraneous material ever get captured?

How does GPO handle copyrighted, proprietary, or Personally Identifiable Information (PII) in an archived website?

At what level does GPO catalog the websites?

If I think that a harvested website needs more granular cataloging, can I suggest that?

Does GPO update the catalog/bibliographic record every time a website is re-crawled?

Are websites in the FDLP Web Archive given SuDocs classes and item numbers?

How can libraries obtain the bibliographic records in MARC format for the websites in the FDLP Web Archive?

Why are records for the FDLP Web Archive cataloged?

What do Archive-It’s error messages mean?

What’s the difference between the FDLP Web Archive and GPO’s other archiving and cataloging tools?

Policies Related to the FDLP Web Archive

Training