beyond the keywords

python

Living in the Clouds, 2/x
In the previous post, I laid out the case for creating a hybrid cloud lab along with various use cases. I’ll use Python and Azure for this project, and I’m using MacOS, but all of these steps can easily be done on Windows or Linux as well.

The first practical step of setting up this lab is installing Python and a few Python libraries we’ll use to connect to Azure. (I should also point out that it’s a good idea to work on this in Git for some kind of version control.)

On a Mac, you’ll open the terminal and run:

brew install python

This should install Python as well as Pip, Python’s package manager. On Windows, you can either install by downloading an MSI from Python’s website or using an extension in VSCode. Python comes natively with almost every Linux distro.

Now that we have Python, we’ll start by creating a virtual environment for Python–this is a step it took me too long to learn, and it’s important to ensure no conflicts, dependency nightmares, etc. So open your terminal on Mac or Linux (if you’re using Windows, I strongly recommend following these directions in WSL) and type
```
# Create a project directory (if you haven't already)
mkdir hybrid-cloud-lab
cd hybrid-cloud-lab

# Create a virtual environment called 'venv'
python3 -m venv venv

# Activate the virtual environment
source venv/bin/activate
```
Once activated, your terminal prompt will likely change to indicate you’re inside venv, the virtual environment. Now any Python package you install now will only apply to this specific project.

To confirm we’ve done this correctly, we’ll start a simple web server with a single web page. The point here is to make sure we can host services on localhost. In the terminal write
```
echo "<h1>Howdy from localhost!</h1>" > index.html

# You can check the server by running it on port 8000
python -m http.server 8000
```
If everything is working correctly, you should be able to open your web browser and navigate to http://localhost:8000. If you see “Howdy from localhost!,” you’re all set. You can stop the server by clicking into the terminal and pressing Ctrl+C.

Connecting to Azure

Now that the local environment is ready, we can install Python’s Azure libraries. The Azure SDK for Python provides a comprehensive set of libraries that allow us to interact virtually with any Azure service.

Make sure you’re still in the virtual environment, install the Azure packages:
```
pip install azure-identity azure-mgmt-resource azure-mgmt-storage azure-storage-blob
```
These packages are enabling us to complete certain functions in Azure, Specifically:
- azure-identity: handles authentication to Azure
- azure-mgmt-resource: manages Azure resource groups (which is a kind of container for Azure resources)
- azure-mgmt-storage: creates and manages Azure Storage Accounts
- azure-storage-blob: interacts with blob storage (like uploading files
Now that we have these packages, we need the Azure Command Line Interface (Azure CLI). This is a little confusing because we’re using Python for the scripting, so why do we need the Azure CLI? Basically, it makes it easier to authenticate our Python scripts and perform quick admin tasks. You can download the CLI from the Azure website for any OS, but I’m just going to use Homebrew for MacOS here because it’s easy:

brew install azure-cli

Now that we’ve installed the Azure CLI, we can login to our account with a quick command:

az login

This command opens a web browser and asks for your Azure credentials.

The last part of the setup process is to actually register an Azure resource provider to our subscription. Basically the resource provider is a service that enables Azure to offer specific resources (like a storage account, for example). Most are registered by default, but best practice is to explicitly register a resource provider for a project. In our case, we’ll register Microsoft.Storage.

az provider register --namespace Microsoft.Storage

This step can take several minutes, and you won’t get a notification in the terminal when it’s complete. Give it a few minutes and check the registration state with the following command:
```
az provider show --namespace Microsoft.Storage --query "r"registrationState"
```
Once that command returns “Registered,” you’re all set.

So that’s the setup of the local environment and the connection of your local machine to Azure via Python and the Azure Cloud.

In my next post, I’ll walk through setting up a storage account.
Jun 30, 2025
Living in the cloud(s)
It’s been a bit since my last few posts here–a new job and other issues have taken up a lot of my time lately, but I do want to dive into another blog series documenting a project. This time I’m going to shift gears from data analysis and presentation to building a cloud lab. More and more of my work involves planning and implementing infrastructure, and that’s probably true of many of us on the technical end of cultural organizations, whether it be libraries, museums and archives, digital humanities projects, etc.

There’s probably not much of a reason to spell out the virtues of the cloud these days, but, just in case, there are quite a few benefits to libraries, archives, and museums. While we’re all familiar with the arguments that the cloud provides scalability, redundancy, and stability, the hybrid cloud empowers our institutions bridge on-premises systems with the cloud, keeping control over sensitive data (user data or otherwise) while retaining the benefits of cloud computing.

The major benefits of cloud computing in research and digital cultural heritage are, among others:
- Massive Data Growth: Digital archives, from digitized manuscripts to born-digital datasets, are growing exponentially.
- Collaboration Needs: Scholars and archivists need secure, real-time access to resources across institutions, often globally.
- Cost and Complexity: Fully cloud-based solutions can be expensive and rigid, while on-premises systems lack scalability.
- Data Sensitivity: Cultural heritage data often requires strict compliance with privacy and preservation standards (e.g., GDPR, DPC).
But the hybrid cloud specifically has its benefits, offering:
- Flexibility: Store sensitive archival data on-premises while leveraging the cloud’s compute power for data processing or public-facing apps.
- Cost Efficiency: Scale cloud resources up or down based on project needs, avoiding the expense of overprovisioned local hardware.
- Collaboration: Enable global access to digital collections via secure, cloud-based tools, fostering interdisciplinary research.
- Resilience: Protect against data loss with backup and disaster recovery, critical for preserving irreplaceable cultural artifacts.
For example, a digital humanities project analyzing the personal correspondence of Anthony Trollope could process large datasets in the cloud while keeping original scans on local servers for compliance.

Tools

To demonstrate how we might set up a cloud lab, I’m going to use Python and Azure, but you could use Bash and AWS or PowerShell and the Google Cloud Platform–it doesn’t really matter, you could use whatever tools you’re most comfortable. I’m using Python for its portability and Azure because I’m already familiar with AWS and am pursuing an Azure administrator cert.

One quick plug for Python: Python’s versatility and huge number of libraries makes it ideal for automating workflows, managing data pipelines, and scripting hybrid cloud operations. Libraries like pandas for data analysis or azure-sdk for cloud management streamline tasks for systems administrators and researchers alike.

And a quick plug for Azure (as if Azure needs me to plug it): Azure has specific tools for like Azure Arc for hybrid management, Blob Storage that’s flexible and ideal for the kinds of unstructured data we often see in our field. AWS definitely has similar tools if that’s your preference (I’m less familiar with GCP).

Finally, both Python and Azure have tons of documentation, both traditional documentation, and a huge user base with YouTube videos, online courses, and the like.

Hopefully this won’t just be a technical exercise, but a method to future-proof an archive or a digital humanities project. It offers the flexibility to scale, the security to protect cultural heritage, and the automation to save time and easily repeat tasks when necessary.

In the next post, we’ll work on setting up the hybrid environment, starting with the tools and configurations.
Jun 26, 2025
Digging into the British Empire with R and HathiTrust, step 1
I had anticipated a more exciting update on this endeavor today, but I ran into a few problems getting the corpus from HathiTrust. In case anyone uses this as a guide for building a digital corpus, I’m going to document the steps I took.

The HTRC documentation is scattered across GitHub, Atlassian, the HTRC website, the HathiTrust website, and the HTRC analytics subdomain. I couldn’t really make heads or tails of how this documentation works together, and I ran into several frustrating elements across these sites that seem to be outdated.

I would’ve preferred to use the HathiTrust Data API, but it has evidently been deprecated.

HathiTrust has a Python library called HTRC Feature Reader—this is exactly what I was looking for (or so I thought). I’m not sure if this tool has been deprecated as well, but the Python library depends on several outdated dependencies. When I started, I figured I’d just update the libraries in question, but that trapped me in dependency hell. I then decided to downgrade Pip and Python in a virtual environment, only to open a new series of dependency problems. Ultimately, I abandoned the Feature Reader too (though, as you’ll see below, you still need to install it because, despite its dependency issues, downloading the library installs an important command-line tool you’ll need later).

Just for kicks, I also tried this in the recommended Anaconda environment—no dice.

What I wound up doing was creating a collection in the standard advanced search interface of HathiTrust. I saved just over 1,000 texts with keywords, setting the date range to 1800–1899 and selecting English-language texts. I called the collection “Imperial Sentiments” and made it public. Then, I downloaded the metadata file, which includes the HTID—HathiTrust’s identifier in its metadata.

Once you download the metadata file as a CSV, copy and paste the volume ID numbers into a separate text file. Delete the column title and name it something simple, like volume_ids.txt.

If you haven’t already done so by this point, make sure you’ve installed the HTRC Feature Reader—I recommend running pip install htrc-feature-reader in the command line.

Then, run the following:

bash
```
htid2rsync --from-file volume_ids.txt >> paths.txt
```
This will take the list of HTIDs from volume_ids.txt, convert them to the paths you’ll need to download the HathiTrust files, and output those paths into a file called paths.txt.

Now you’re ready to start downloading!

To download the Extracted Features files, you’re providing a list of files you want to HathiTrust’s servers, which then check the paths you provide and download the files to the directory you specify. We’ll accomplish this with a protocol called rsync, which is native to macOS and Linux and can be downloaded for Windows.

Run the following command:

bash
```
rsync -av --no-relative --files-from=paths.txt data.analytics.hathitrust.org::features/ /local/destination/path/
```
A couple of things are important to note:

You’ll definitely want to run this in verbose mode (hence the -av) because the process can hang sometimes, and without verbose output, you won’t know whether it’s just downloading a lot at once or if something’s wrong.

The compressed JSON files are spread across nested directories on the server—there’s almost no circumstance in which you’d want to keep that nested file structure, so use –no-relative to flatten it.

Make sure you specify your local path correctly. If you don’t, rsync saves to a temp folder you’ll have to track down.

After running the script, you should have your corpus!

Before working with the files, you’ll want to extract them, either in the command line or via a GUI, and then delete the zipped files. It’s probably easiest to navigate to the folder in the command line and run something like rm *.bz2 (but make sure you know what you’re doing).

And that’s it! Hopefully, I’ve saved someone the pain of wading through outdated documentation or tinkering with Python libraries when this is really a relatively painless and simple process.

In the next post, we’ll take a look at what exactly these HTRC files are comprised of and determine whether or not they’re fit for different kinds of projects.
Mar 12, 2025