3 minutes
WARCing up the wrong tree - Part I
Web archiving can be hard. Luckily, these projects exist -
- https://github.com/webrecorder/pywb (a complete web archive replay and recording solution)
- https://github.com/webrecorder/warcio (a library to write/read WARC files)
- https://github.com/chfoo/warcat (a library for handling WARC files)
- https://github.com/machawk1/warcreate (a Chrome extension for capturing webpages)
Code snippets
A couple of code snippets that I’ve put together which I’ll probably refer back to.
Searching WARCs with warcio
The following uses warcio to read WARC files. A list of URLs is supplied via a .txt file and the code determines whether the URLs exist in the collection. It’s essentially a good way to cross-reference what should be and what actually is. This code would be even more useful if the .txt file was replaced with an XML sitemap, which is easily done.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""Searches a collection of WARCs for URLs supplied via .txt file"""
import os
import time
from warcio.archiveiterator import ArchiveIterator
# A .txt file containing URLs used for the search
SEARCH_FILE_PATH = './EXAMPLE.txt'
# Directory containing WARC files
WARC_COLL_DIR = '/home/craiglmccarthy/Documents/Code/pywb/collections/EXAMPLE/archive/'
# Load the .txt file and build list
with open(SEARCH_FILE_PATH, 'r') as f:
lines_file = [line.strip() for line in f]
print('Number of URLs to search for:', len(lines_file))
time.sleep(3)
# Use a set to remove duplicate URLs across WARC files
warc_set = set()
warc_files = os.listdir(WARC_COLL_DIR)
warc_files_len = len(warc_files)
# Open all WARC files and read
for index, filename in enumerate(warc_files):
WARC_PATH = WARC_COLL_DIR + filename
print('Loading WARCs from: ' + filename + '...' +
str(index + 1) + '/' + str(warc_files_len))
# Load WARC and iterate through the records
with open(WARC_PATH, 'rb') as stream:
for record in ArchiveIterator(stream):
if record.rec_type == 'response':
uri = record.rec_headers.get_header('WARC-Target-URI')
warc_set.add(uri)
print('Searching for URLs...')
# Loop through URL list to search the WARC collection
url_in_warc = []
url_not_in_warc = []
for i in lines_file:
if i in warc_set:
url_in_warc.append(i)
else:
url_not_in_warc.append(i)
print('--------------------')
print('Number of URLs searched for:', len(lines_file))
print(len(warc_set), 'URLs in the WARC collection')
print('Number of URLs FOUND from list in WARC collection:', len(url_in_warc))
print('Number of URLs MISSING from list in WARC collection:', len(url_not_in_warc))
print('URLs missing:', url_not_in_warc)
Browser automation for pywb
Pywb records webpages through the browser. While it outputs high quality WARCs, due to it actually loading the DOM - the capturing process can be very slow. The following snippet can open up hundreds of tabs at once when provided with a list of URLs. Pywb can then record all of the HTTP traffic coming in. It’s a little bit unorthodox, but it works.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""Loops through .txt file to open URLs in a browser - to be used in conjunction with pywb in recording mode"""
import webbrowser
import time
# Collection name used in pywb
COLLECTION = 'EXAMPLE'
FILE_PATH = '/home/craiglmccarthy/EXAMPLE.txt'
START = 0
END = 250 # None is until the end
# Open file and add URLs to list
with open(FILE_PATH, 'r') as f:
lines = [line.strip() for line in f]
print(len(lines), 'lines loaded.')
print(START, END)
time.sleep(2)
# Loop through to open tabs in Firefox
for i in lines[START:END]:
webbrowser.get('firefox').open(f'http://localhost:8080/{COLLECTION}/record/{i}')
This post carries on in Part II, here.