The Federal Depository Library Program (FDLP) Web Archive comprises a collection of harvested websites in which each website itself is a “collection” (in the parlance of the host repository, Archive-It). The content of a website is harvested by a tool called a “crawler,” and organized into “seeds.” Each seed is a URL that serves as the crawler’s entry point to the website, and the crawler follows links from the seed page to subsequent pages, saving each unique page it comes across (within the scope of the crawl).
The main seed for a collection is the website’s primary domain (for example, gpo.gov). Additional seeds in a collection include subdomains (a URL with an additional word added to the left of the primary domain, for example, bensguide.gpo.gov), and web resources outside of the primary domain, such as social media. Each collection has a page in the FDLP Web Archive listing all of the seeds crawled, known as a “seed list.”
Some seeds are additionally cataloged as “subcollections.” A subcollection is a part of a website that is a standalone web resource in itself and warrants its own record. For example, 911.gov is a subcollection from the National Highway Traffic Safety Administration (nhtsa.gov) collection. Decisions to create a subcollection catalog record are approved by the Archives Specialists and Collection Development Librarian. Subcollection records are distinct from Web Archive analytic records (records for resources that have been harvested as part of the FDLP Web Archive); documentation for Web Archive analytic record cataloging is in process.
All FDLP Web Archive collections and selected subcollections are described in Catalog of U.S. Government Publications and OCLC database records by full- , collection-level metadata conformant with the MARC21 structural format and the RDA content standard.
Features of FDLP Web Archive collection records include:
- PURLs that link to:
- index pages of harvested versions of websites by archived dates
- lists of seeds
- Abstracts of contents
- Notes indicating provenance of archived sites
- Linking entries to records for current versions
- Superintendent of Documents classification numbers
Subcollection records omit the PURL that links to the seed list.
Like other updating websites, U.S. Government websites are cataloged as integrating resources because they are published over time and changed by seamless integration of additional and edited content. Despite the presence of capture dates for each harvest, which could be interpreted as chronological designations suggesting serial treatment, GPO recognizes the archived versions as representative of seamlessly integrated content; hence their description as integrating resources.
For metadata descriptions, LSCM pursued MARC records via our largest consortial catalog, OCLC, because OCLC provides:
- An efficient mechanism for production of bibliographic records, because it is part of the existing cataloging workflow in LTS
- Continuity for records’ distribution to Federal Depository Libraries
- MARC 21 data fields that accommodate multiple customized data elements that are provided for these collections
Note that some Archive-It collections are cataloged as part of the FDLP General Electronic Collection, but are not harvested by GPO and thus not in the FDLP Web Archive. These collection records follow the MARC field guidelines below, with some exceptions noted where applicable. Examples include the following CGP records:
For more information
- FDLP Web Archive Project documentation: https://www.fdlp.gov/project-list/web-archiving
- FDLP Web Archive in CGP: https://catalog.gpo.gov/F/?func=direct&doc_number=000931763&local_base=GPO01PUB
- Collection and subcollection records in CGP: https://catalog.gpo.gov/F/?func=file&file_name=find-webarch&local_base=WEBARCH
- FDLP Web Archive Archive-It repository: http://purl.fdlp.gov/GPO/gpo50155
- Archive-It homepage: https://archive-it.org/
- FDLP Web Archive analytic records in CGP: https://catalog.gpo.gov/F/?func=find-c&ccl_term=wlts%3D+%28+waanalytic
This chapter covers only MARC fields where cataloging practice varies from general Integrating Resource cataloging; please see the GPO Cataloging Guidelines on Integrating Resources for more information.
Web Archive records cannot be coded as PCC. Code Elvl as blank and Srce as d. Do not include 042 field.
040 – Cataloging Source
Web Archive records are not coded for Provider-neutral status (PN).
074/086 - Item Number/Government Document Classification Number
Web Archive resources are typically classed with the General Publications class of the issuing agency/bureau, with the word INTERNET as the Book number value. Follow general SuDoc classification practices for inserting slash numbers when there are multiple collections within the same bureau.
Subcollections are generally given the same classification as the parent collection, with a slash number.
130/240 – Main Entry/Uniform title
Always provide either a 130 or 240 uniform title to distinguish the archived version from the current, live version of the site, whether or not an OCLC record exists, whether or not the AAPs conflict, and whether or not it is held by GPO. Use the qualifier (Archived version). See also instructions for MARC field 775, Other Edition Entry.
If the archived website has the same AAP as another record that describes a resource other than the live website (e.g., a textual monograph with the same title as the website), then add another additional identifying element, generally (Website : Archived version).
246 – Varying Form of Title
If recording a former variant title in 246, include the corresponding Archive-It capture date in subfield ǂf. Optionally add a subfield ǂi to describe the title. If changes are numerous, you may make a general note instead of retaining the previous variant titles.
247 – Former Title
If recording a former title found in an earlier FDLP Web Archive capture, record the Archive-It capture date in subfield $f.
250 – Edition Statement
Supply the edition statement in brackets: [Archived version].
264 – Publication/Distribution Statement
Record a publication statement as usual. When recording multiple publication statements, add to each 264 field a subfield ǂ3 with the corresponding capture date in angle brackets.
Also record a distribution statement; the starting year should be the date of the first harvest. Do not include in records for Archive-It collections that are not part of the FDLP Web Archive; GPO self-identifies as distributor only for documents that it sells or curates.
336 – Content Type
Code for predominant content types, not necessarily all content types. For content type terms, see RDA 126.96.36.199.
If videos number fewer than ten (10), do not record this content type. Ensure formats can be played; if captured videos cannot be played, then do not record this content type. In addition to videos on the main website seed, potential YouTube seeds can be found in the collection's seed list (see MARC field 856 – Electronic Location & Access).
500 – General Notes
Include the below notes. Do not include for Archive-It collections not harvested by GPO.
520 – Abstract
Describe contents generally and specifically, including the intended audience where relevant. Do not limit to a general description of the corporate body's history or functions.
If the scope of the website changes, revise the abstract to account for changes to the website. A brief explanation of the agency or title changes may be necessary. Keep in mind that the collection will contain captures of both the old and new website content.
588 - Source of Description Note
Description Based On date should be latest harvest date on the Archive-It index page of captures, at the time of cataloging.
If no statement for “page last modified” or “last updated” or similar exists on the latest captured site, then omit this part of the note.
710 – Added Entry – Corporate Body
Include responsible corporate bodies not recorded in 110.
For Archive-It sites harvested by GPO, also include added entries for the FDLP as the collector and GPO as distributor.
Do not include these fields for Archive-It sites not harvested by GPO. Consider including an added entry for another agency collector, if the agency is not the same as the author and/or issuing body.
773 – Host Item Entry
For Archive-It sites harvested by GPO, include the below Host Item Entry field. Do not include for collections not harvested by GPO.
775 – Other Edition Entry
If an OCLC record for the current version of the website exists, link to it in a 775 field, regardless of whether it is held by GPO.
When updating an existing Web Archive record with a 775 field linking to the current version of the website, ensure that the content of the 775 field is accurate.
- If the linked record has not been updated, and it is a GPO record, update the linked record as necessary, as well as the 775 text in this record.
- If the linked record is not a GPO record, you may choose to update the linked record; or to leave the linked record as-is, still retaining the 775 in this record.
856 – Electronic Location & Access
Collections and subcollections will have a PURL pointing to the Wayback URL for the index page of harvested dates for the primary seed. Sample structure of index page target URL: https://wayback.archive-it.org/4225/*/http://www.ncua.gov/
Collections, but not subcollections, will have an additional PURL pointing to the collection page seed list. Sample structure of seed list target URL: https://archive-it.org/collections/4225
Also include a historic URL field for the primary seed URL at time of first capture.
If the primary seed URL for the resource changes, create a PURL to the new index page. In both the old and new index page PURL fields, add a subfield ǂ3 giving the dates of coverage. Note that there is no need to alter the seed list PURL.
Also add a new 856 field for the URL at time of capture. Indicate the date of the change in angle brackets in subfield ǂz in the “URL at time of capture” fields.
If the archived version lacks any of the functionality of the live version, make a note in subfield ǂz of the 856 field for the index page PURL. Sample notes have been developed in consultation with staff members from the Office of Archival Management; please consult with them for any additional notes that may be required.
955 – Local Note
Include a WEBARCH local note, with the date of initial export, for all collection and subcollection level records. Do not include for Archive-It collections not harvested by GPO.