WARCing up the wrong tree - Part I

Web archiving can be hard. Luckily, these projects exist -

https://github.com/webrecorder/pywb (a complete web archive replay and recording solution)
https://github.com/webrecorder/warcio (a library to write/read WARC files)
https://github.com/chfoo/warcat (a library for handling WARC files)
https://github.com/machawk1/warcreate (a Chrome extension for capturing webpages)

Code snippets

A couple of code snippets that I’ve put together which I’ll probably refer back to.

Searching WARCs with warcio

The following uses warcio to read WARC files. A list of URLs is supplied via a .txt file and the code determines whether the URLs exist in the collection. It’s essentially a good way to cross-reference what should be and what actually is. This code would be even more useful if the .txt file was replaced with an XML sitemap, which is easily done.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""Searches a collection of WARCs for URLs supplied via .txt file"""
import os
import time

from warcio.archiveiterator import ArchiveIterator

# A .txt file containing URLs used for the search
SEARCH_FILE_PATH = './EXAMPLE.txt'
# Directory containing WARC files
WARC_COLL_DIR = '/home/craiglmccarthy/Documents/Code/pywb/collections/EXAMPLE/archive/'

# Load the .txt file and build list
with open(SEARCH_FILE_PATH, 'r') as f:
    lines_file = [line.strip() for line in f]
print('Number of URLs to search for:', len(lines_file))
time.sleep(3)

# Use a set to remove duplicate URLs across WARC files
warc_set = set()
warc_files = os.listdir(WARC_COLL_DIR)
warc_files_len = len(warc_files)
# Open all WARC files and read
for index, filename in enumerate(warc_files):
    WARC_PATH = WARC_COLL_DIR + filename
    print('Loading WARCs from: ' + filename + '...' +
          str(index + 1) + '/' + str(warc_files_len))
    # Load WARC and iterate through the records
    with open(WARC_PATH, 'rb') as stream:
        for record in ArchiveIterator(stream):
            if record.rec_type == 'response':
                uri = record.rec_headers.get_header('WARC-Target-URI')
                warc_set.add(uri)

print('Searching for URLs...')
# Loop through URL list to search the WARC collection
url_in_warc = []
url_not_in_warc = []
for i in lines_file:
    if i in warc_set:
        url_in_warc.append(i)
    else:
        url_not_in_warc.append(i)

print('--------------------')
print('Number of URLs searched for:', len(lines_file))
print(len(warc_set), 'URLs in the WARC collection')
print('Number of URLs FOUND from list in WARC collection:', len(url_in_warc))
print('Number of URLs MISSING from list in WARC collection:', len(url_not_in_warc))
print('URLs missing:', url_not_in_warc)

Browser automation for pywb

Pywb records webpages through the browser. While it outputs high quality WARCs, due to it actually loading the DOM - the capturing process can be very slow. The following snippet can open up hundreds of tabs at once when provided with a list of URLs. Pywb can then record all of the HTTP traffic coming in. It’s a little bit unorthodox, but it works.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

"""Loops through .txt file to open URLs in a browser - to be used in conjunction with pywb in recording mode"""

import webbrowser
import time

# Collection name used in pywb
COLLECTION = 'EXAMPLE'
FILE_PATH = '/home/craiglmccarthy/EXAMPLE.txt'
START = 0
END = 250 # None is until the end

# Open file and add URLs to list
with open(FILE_PATH, 'r') as f:
    lines = [line.strip() for line in f]

print(len(lines), 'lines loaded.')
print(START, END)
time.sleep(2)

# Loop through to open tabs in Firefox
for i in lines[START:END]:
    webbrowser.get('firefox').open(f'http://localhost:8080/{COLLECTION}/record/{i}')

This post carries on in Part II, here.