Overview
The Federal Depository Library Program (FDLP) Web Archive comprises a collection of harvested websites in which each website itself is a “collection” (in the parlance of the host repository, Archive-It). The content of a website is harvested by a tool called a “crawler,” and organized into “seeds.” Each seed is a URL that serves as the crawler’s entry point to the website, and the crawler follows links from the seed page to subsequent pages, saving each unique page it comes across (within the scope of the crawl).
The main seed for a collection is the website’s primary domain (for example, gpo.gov). Additional seeds in a collection include subdomains (a URL with an additional word added to the left of the primary domain, for example, bensguide.gpo.gov), and web resources outside of the primary domain, such as social media. Each collection has a page in the FDLP Web Archive listing all of the seeds crawled, known as a “seed list.”
Some seeds are additionally cataloged as “subcollections.” A subcollection is a part of a website that is a standalone web resource in itself and warrants its own record. For example, 911.gov is a subcollection from the National Highway Traffic Safety Administration (nhtsa.gov) collection. Decisions to create a subcollection catalog record are approved by the Archives Specialists and Collection Development Librarian. Subcollection records are distinct from Web Archive analytic records (records for resources that have been harvested as part of the FDLP Web Archive); documentation for Web Archive analytic record cataloging is in process.
All FDLP Web Archive collections and selected subcollections are described in Catalog of U.S. Government Publications and OCLC database records by full- , collection-level metadata conformant with the MARC21 structural format and the RDA content standard.
Features of FDLP Web Archive collection records include:
- PURLs that link to:
- index pages of harvested versions of websites by archived dates
- lists of seeds
- Abstracts of contents
- Notes indicating provenance of archived sites
- Linking entries to records for current versions
- Superintendent of Documents classification numbers
Subcollection records omit the PURL that links to the seed list.
Like other updating websites, U.S. Government websites are cataloged as integrating resources because they are published over time and changed by seamless integration of additional and edited content. Despite the presence of capture dates for each harvest, which could be interpreted as chronological designations suggesting serial treatment, GPO recognizes the archived versions as representative of seamlessly integrated content; hence their description as integrating resources.
For metadata descriptions, LSCM pursued MARC records via our largest consortial catalog, OCLC, because OCLC provides:
- An efficient mechanism for production of bibliographic records, because it is part of the existing cataloging workflow in LTS
- Continuity for records’ distribution to Federal Depository Libraries
- MARC 21 data fields that accommodate multiple customized data elements that are provided for these collections
Note that some Archive-It collections are cataloged as part of the FDLP General Electronic Collection, but are not harvested by GPO and thus not in the FDLP Web Archive. These collection records follow the MARC field guidelines below, with some exceptions noted where applicable. Examples include the following CGP records:
For more information
- FDLP Web Archive Project documentation: https://www.fdlp.gov/project-list/web-archiving
- FDLP Web Archive in CGP: https://catalog.gpo.gov/F/?func=direct&doc_number=000931763&local_base=GPO01PUB
- Collection and subcollection records in CGP: https://catalog.gpo.gov/F/?func=file&file_name=find-webarch&local_base=WEBARCH
- FDLP Web Archive Archive-It repository: http://purl.fdlp.gov/GPO/gpo50155
- Archive-It homepage: https://archive-it.org/
- FDLP Web Archive analytic records in CGP: https://catalog.gpo.gov/F/?func=find-c&ccl_term=wlts%3D+%28+waanalytic
MARC Fields
This chapter covers only MARC fields where cataloging practice varies from general Integrating Resource cataloging; please see the GPO Cataloging Guidelines on Integrating Resources for more information.
ELvl/Srce
Web Archive records cannot be coded as PCC. Code Elvl as blank and Srce as d. Do not include 042 field.
040 – Cataloging Source
Web Archive records are not coded for Provider-neutral status (PN).
074/086 - Item Number/Government Document Classification Number
Web Archive resources are typically classed with the General Publications class of the issuing agency/bureau, with the word INTERNET as the Book number value. Follow general SuDoc classification practices for inserting slash numbers when there are multiple collections within the same bureau.
Subcollections are generally given the same classification as the parent collection, with a slash number.
130/240 – Main Entry/Uniform title
Always provide either a 130 or 240 uniform title to distinguish the archived version from the current, live version of the site, whether or not an OCLC record exists, whether or not the AAPs conflict, and whether or not it is held by GPO. Use the qualifier (Archived version). See also instructions for MARC field 775, Other Edition Entry.
If the archived website has the same AAP as another record that describes a resource other than the live website (e.g., a textual monograph with the same title as the website), then add another additional identifying element, generally (Website : Archived version).
Already in database
Naval History and Heritage Command, General publication (CGP 001022134)
110 2_ ǂa Naval History & Heritage Command (U.S.), ǂe author.
245 10 ǂa Naval History and Heritage Command.
Newly cataloged
Naval History and Heritage Command, Web Archive collection (CGP 001107090)
110 2_ ǂa Naval History & Heritage Command (U.S.), ǂe author.
240 10 ǂa Naval History and Heritage Command (Website : Archived version)
245 10 ǂa Naval History and Heritage Command.
246 – Varying Form of Title
If recording a former variant title in 246, include the corresponding Archive-It capture date in subfield ǂf. Optionally add a subfield ǂi to describe the title. If changes are numerous, you may make a general note instead of retaining the previous variant titles.
U.S. International Development Finance Corporation (CGP 000938685)
245 10 U.S. International Development Finance Corporation.
246 1 DFC
246 1 OPIC ǂf <October 27, 2014>
The Networking and Information Technology Research and Development Program (CGP 001086951)
245 14 The Networking and Information Technology Research and Development Program.
246 1 ǂi Also known as: ǂa NITRD
247 – Former Title
If recording a former title found in an earlier FDLP Web Archive capture, record the Archive-It capture date in subfield $f.
U.S. International Development Finance Corporation (CGP 000938685)
245 10 U.S. International Development Finance Corporation.
247 10 Overseas Private Investment Corporation ǂf <October 27, 2014>
250 – Edition Statement
Supply the edition statement in brackets: [Archived version].
LongTermCare.gov (CGP 001082513)
250 ǂa [Archived version].
264 – Publication/Distribution Statement
Record a publication statement as usual. When recording multiple publication statements, add to each 264 field a subfield ǂ3 with the corresponding capture date in angle brackets.
U.S. International Development Finance Corporation (CGP 000938685)
264 1 ǂ3 <October 27, 2014>: ǂa [Washington, D.C.] : ǂb Overseas Private Investment Corporation
264 31 ǂ3 <January 31, 2020>: ǂa Washington, DC : ǂb U.S. International Development Finance Corporation
Also record a distribution statement; the starting year should be the date of the first harvest. Do not include in records for Archive-It collections that are not part of the FDLP Web Archive; GPO self-identifies as distributor only for documents that it sells or curates.
LongTermCare.gov (CGP 001082513)
264 1 [Washington, D.C.] : ǂb U.S. Department of Health and Human Services, Administration on Aging
264 2 [Washington, D.C.] : ǂb Government Publishing Office, ǂc 2018-
336 – Content Type
Code for predominant content types, not necessarily all content types. For content type terms, see RDA 6.9.1.3.
If videos number fewer than ten (10), do not record this content type. Ensure formats can be played; if captured videos cannot be played, then do not record this content type. In addition to videos on the main website seed, potential YouTube seeds can be found in the collection's seed list (see MARC field 856 – Electronic Location & Access).
500 – General Notes
Include the below notes. Do not include for Archive-It collections not harvested by GPO.
LongTermCare.gov (CGP 001082513)
500 The content is made available by the U.S. Government Publishing Office in accordance with Title 44 of the US Code.
500 Digital collection: Federal Depository Library Program Web Archive.
520 – Abstract
Describe contents generally and specifically, including the intended audience where relevant. Do not limit to a general description of the corporate body's history or functions.
If the scope of the website changes, revise the abstract to account for changes to the website. A brief explanation of the agency or title changes may be necessary. Keep in mind that the collection will contain captures of both the old and new website content.
U.S. International Development Finance Corporation (CGP 000938685)
520 The U.S. International Development Finance Corporation (DFC) was formed in 2019 and combines the functions of the Overseas Private Investment Corporation (OPIC) and USAID's Development Credit Authority. It is the U.S. government's development finance institution, mobilizing private capital to help solve development issues. The DFC website features details on current projects, including an interactive world map. In addition there is extensive media and information on becoming an applicant. The FDLP Web Archive collection contains harvested webpages from both the DFC website and the previous iteration, OPIC.gov.
For further examples, see the CGP records for Farm Credit System Insurance Corporation (CGP 000918898), Distraction.gov (CGP 000919309), and ChooseMyPlate.gov (CGP 000950703).
588 - Source of Description Note
Description Based On date should be latest harvest date on the Archive-It index page of captures, at the time of cataloging.
National Human Genome Research Institute (CGP 001080506)
588 Description based on: archived web page captured August 20, 2018 (page last modified February 16, 2018); title from resource home page (viewed September 19, 2018).
If no statement for “page last modified” or “last updated” or similar exists on the latest captured site, then omit this part of the note.
Oak Ridge National Laboratory (CGP 001102427)
588 Description based on: archived web page captured June 6, 2019; title from resource home page (viewed July 15, 2019).
710 – Added Entry – Corporate Body
Include responsible corporate bodies not recorded in 110.
For Archive-It sites harvested by GPO, also include added entries for the FDLP as the collector and GPO as distributor.
GlobalChange.gov (CGP 001000554)
710 2 U.S. Global Change Research Program, ǂe issuing body.
710 2 Federal Depository Library Program, ǂe collector.
710 1 United States. ǂb Government Publishing Office, ǂe distributor.
Do not include these fields for Archive-It sites not harvested by GPO. Consider including an added entry for another agency collector, if the agency is not the same as the author and/or issuing body.
773 – Host Item Entry
For Archive-It sites harvested by GPO, include the below Host Item Entry field. Do not include for collections not harvested by GPO.
GlobalChange.gov (CGP 001000554)
773 0 ǂt Federal Depository Library Program Web Archive ǂw (OCoLC)883856932
775 – Other Edition Entry
If an OCLC record for the current version of the website exists, link to it in a 775 field, regardless of whether it is held by GPO.
Oak Ridge National Laboratory (CGP 001102427)
775 08 ǂi Current version: ǂt ORNL ǂw (OCoLC)34403189
When updating an existing Web Archive record with a 775 field linking to the current version of the website, ensure that the content of the 775 field is accurate.
- If the linked record has not been updated, and it is a GPO record, update the linked record as necessary, as well as the 775 text in this record.
- If the linked record is not a GPO record, you may choose to update the linked record; or to leave the linked record as-is, still retaining the 775 in this record.
856 – Electronic Location & Access
Collections and subcollections will have a PURL pointing to the Wayback URL for the index page of harvested dates for the primary seed. Sample structure of index page target URL: https://wayback.archive-it.org/4225/*/http://www.ncua.gov/
Collections, but not subcollections, will have an additional PURL pointing to the collection page seed list. Sample structure of seed list target URL: https://archive-it.org/collections/4225
Also include a historic URL field for the primary seed URL at time of first capture.
Collection
National Credit Union Administration (CGP 000923496)
856 40 ǂz Home page by archived dates ǂu https://purl.fdlp.gov/GPO/gpo47435
856 40 ǂz Collection page seed list ǂu https://purl.fdlp.gov/GPO/gpo76745
856 4 ǂz URL at time of capture ǂu http://www.ncua.gov/
Subcollection
MyCreditUnion.gov (CGP 000967186)
856 40 ǂz Home page by archived dates ǂu https://purl.fdlp.gov/GPO/gpo63623
856 4 ǂz URL at time of capture ǂu http://www.mycreditunion.gov/
If the primary seed URL for the resource changes, create a PURL to the new index page. In both the old and new index page PURL fields, add a subfield ǂ3 giving the dates of coverage. Note that there is no need to alter the seed list PURL.
Also add a new 856 field for the URL at time of capture. Indicate the date of the change in angle brackets in subfield ǂz in the “URL at time of capture” fields.
U.S. International Development Finance Corporation (CGP 000938685)
856 40 ǂz Home page by archived dates ǂ3 2014-2019 ǂu http://purl.fdlp.gov/GPO/gpo53359
856 40 ǂz Home page by archived dates ǂ3 2020- ǂu https://purl.fdlp.gov/GPO/gpo132765
856 40 ǂz Collection page seed list ǂu http://purl.fdlp.gov/GPO/gpo76760
856 4 ǂz URL at time of capture, <-2019> ǂu http://www.opic.gov
856 4 ǂz URL at time of capture, <2020-> ǂu http://www.dfc.gov
If the archived version lacks any of the functionality of the live version, make a note in subfield ǂz of the 856 field for the index page PURL. Sample notes have been developed in consultation with staff members from the Office of Archival Management; please consult with them for any additional notes that may be required.
Child welfare information gateway (CGP 000961482)
856 40 ǂz Home page by archived dates ǂu http://purl.fdlp.gov/GPO/gpo61833 ǂz Some views may be obscured by embedded survey form; efforts underway for improved capture and playback of the site.
African Development Foundation (CGP 000965849)
856 40 ǂz Home page by archived dates ǂu http://purl.fdlp.gov/GPO/gpo63244 ǂz If viewing this content is difficult, scroll down and click on “Low Bandwidth.”
Oak Ridge National Laboratory (CGP 001102427)
856 40 ǂz Home page by archived dates ǂu https://purl.fdlp.gov/GPO/gpo122996 ǂz Some datasets and dynamic content may not be viewable in archived version.
955 – Local Note
Include a WEBARCH local note, with the date of initial export, for all collection and subcollection level records. Do not include for Archive-It collections not harvested by GPO.
Peace Corps (CGP 000936041)
955 ǂa WEBARCH 20141001