I had anticipated a more exciting update on this endeavor today, but I ran into a few problems getting the corpus from HathiTrust. In case anyone uses this as a guide for building a digital corpus, I’m going to document the steps I took.
The HTRC documentation is scattered across GitHub, Atlassian, the HTRC website, the HathiTrust website, and the HTRC analytics subdomain. I couldn’t really make heads or tails of how this documentation works together, and I ran into several frustrating elements across these sites that seem to be outdated.
I would’ve preferred to use the HathiTrust Data API, but it has evidently been deprecated.
HathiTrust has a Python library called HTRC Feature Reader—this is exactly what I was looking for (or so I thought). I’m not sure if this tool has been deprecated as well, but the Python library depends on several outdated dependencies. When I started, I figured I’d just update the libraries in question, but that trapped me in dependency hell. I then decided to downgrade Pip and Python in a virtual environment, only to open a new series of dependency problems. Ultimately, I abandoned the Feature Reader too (though, as you’ll see below, you still need to install it because, despite its dependency issues, downloading the library installs an important command-line tool you’ll need later).
Just for kicks, I also tried this in the recommended Anaconda environment—no dice.
What I wound up doing was creating a collection in the standard advanced search interface of HathiTrust. I saved just over 1,000 texts with keywords, setting the date range to 1800–1899 and selecting English-language texts. I called the collection “Imperial Sentiments” and made it public. Then, I downloaded the metadata file, which includes the HTID—HathiTrust’s identifier in its metadata.
Once you download the metadata file as a CSV, copy and paste the volume ID numbers into a separate text file. Delete the column title and name it something simple, like volume_ids.txt.
If you haven’t already done so by this point, make sure you’ve installed the HTRC Feature Reader—I recommend running pip install htrc-feature-reader in the command line.
Then, run the following:
bash
htid2rsync --from-file volume_ids.txt >> paths.txt
This will take the list of HTIDs from volume_ids.txt, convert them to the paths you’ll need to download the HathiTrust files, and output those paths into a file called paths.txt.
Now you’re ready to start downloading!
To download the Extracted Features files, you’re providing a list of files you want to HathiTrust’s servers, which then check the paths you provide and download the files to the directory you specify. We’ll accomplish this with a protocol called rsync, which is native to macOS and Linux and can be downloaded for Windows.
Run the following command:
bash
rsync -av --no-relative --files-from=paths.txt data.analytics.hathitrust.org::features/ /local/destination/path/
A couple of things are important to note:
You’ll definitely want to run this in verbose mode (hence the -av) because the process can hang sometimes, and without verbose output, you won’t know whether it’s just downloading a lot at once or if something’s wrong.
The compressed JSON files are spread across nested directories on the server—there’s almost no circumstance in which you’d want to keep that nested file structure, so use –no-relative to flatten it.
Make sure you specify your local path correctly. If you don’t, rsync saves to a temp folder you’ll have to track down.
After running the script, you should have your corpus!
Before working with the files, you’ll want to extract them, either in the command line or via a GUI, and then delete the zipped files. It’s probably easiest to navigate to the folder in the command line and run something like rm *.bz2 (but make sure you know what you’re doing).
And that’s it! Hopefully, I’ve saved someone the pain of wading through outdated documentation or tinkering with Python libraries when this is really a relatively painless and simple process.
In the next post, we’ll take a look at what exactly these HTRC files are comprised of and determine whether or not they’re fit for different kinds of projects.