beyond the keywords

Living in the Clouds, 2.1/x
Okay, looking at back at the last post, I realize how Mac-centric some of the instructions are, so I’ve decided to include another set of instructions on accomplishing the same thing. I’m going to be using Windows Subsystem for Linux to accomplish this. If you’re on a Windows machine, I highly suggest doing the same. Though there is a good argument for using PowerShell in the Azure environment as a whole, I find using PowerShell to do basic scripting to be a bit convoluted for those of us with a Bash background. And since Azure supports both PowerShell and Bash, I don’t necessarily think using a lot of PowerShell is going to be worthwhile unless you’re heavily invested in Microsoft administration.

So first things first, make sure you download WSL2 on your Windows machine. Windows has some good documentation on doing so. One of the cool things about WSL2 is that you’re able to pick your Linux distro of choice–I don’t know if every Linux distro is packaged for WSL, but quite a few of them are.

Full disclosure: normally I use Alma Linux 10 for my work. That’s because it’s a clone of Red Hat Linux, which is widely used in enterprise environments. Most people will probably prefer to use Ubuntu, though, so for this exercise I’m going to use Ubuntu 24.04.1 LTS. If you’re in a command line environment, you can always check to see what OS you’re running by typing cat /etc/os-release.

We always want to begin by updating our package manager, so in the Linux terminal type:

sudo apt update && sudo apt upgrade

Type y to accept the upgrade.

Python comes natively on most Linux distros, so once the update is complete, type python3 -V to confirm Python is installed. As of this writing, the latest update is Python 3.12.3. You can also type pip -v to see if pip, Python’s package manager, is installed. It’s not on my machine, so I’m going to type sudo apt install python3-pip.

Next, we’ll create a project directory:
```
# Create a project directory (if you haven't already)
mkdir hybrid-cloud-lab
cd hybrid-cloud-lab
```
On Debian-based systems (like Ubuntu), you have to install a separate package to create a Python virtual environment. To do that, type:

sudo apt install python3.12-venv

After that, you’ll create and start your virtual environment with the following commands:

python3 -m venv venv # Activate the virtual environment

source venv/bin/activate

Now you’re in Python’s virtual environment.

Once activated, your terminal prompt will likely change to indicate you’re inside venv, the virtual environment. Now any Python package you install now will only apply to this specific project, which can prevent conflicts and dependency conflicts that result from multiple Python projects and libraries installed across your system.

To confirm we’ve done this correctly, we’ll start a simple web server with a single web page. The point here is to make sure we can host services on localhost. In the terminal write
```
echo "Howdy from localhost" > index.html 
# You can check the server by running it on port 8000 
python -m http.server 8000
```
If everything is working correctly, you should be able to visit localhost:8000 in a browser and see the “Howdy from localhost” message displayed. Back in the terminal, press Ctrl C to kill the local server.

Connecting to Azure

Now that the local environment is ready, we can install Python’s Azure libraries. The Azure SDK for Python provides a comprehensive set of libraries that allow us to interact virtually with any Azure service.

Make sure you’re still in the virtual environment, install the Azure packages:
```
pip install azure-identity azure-mgmt-resource azure-mgmt-storage azure-storage-blob
```
These packages are enabling us to complete certain functions in Azure, Specifically:
- azure-identity: handles authentication to Azure
- azure-mgmt-resource: manages Azure resource groups (which is a kind of container for Azure resources)
- azure-mgmt-storage: creates and manages Azure Storage Accounts
- azure-storage-blob: interacts with blob storage (like uploading files
Now that we have these packages, we need the Azure Command Line Interface (Azure CLI). This is a little confusing because we’re using Python for the scripting, so why do we need the Azure CLI? Basically, it makes it easier to authenticate our Python scripts and perform quick admin tasks.

To install the Azure CLI on Linux, follow these steps in the Microsoft documentation. (Fair warning, this last step took about a minute with no feedback from the terminal, so don’t despair if it takes a bit.)

Now that we’ve installed the Azure CLI, we can login to our account with a quick command:

az login

In MacOS, this command opens a web browser and asks for your Azure credentials. However, in WSL, your browser isn’t going to open because you’re working in a virtual Linux machine inside Windows. So you what you’ll see if you ran the command above is the option to use

az login --use-device-code.

So if your terminal is hung on the last command, press Ctrl C, then type az login --use-device-code. Follow the prompts from there, and you should be authenticated to Azure.

The last part of the setup process is to actually register an Azure resource provider to our subscription. Basically the resource provider is a service that enables Azure to offer specific resources (like a storage account, for example). Most are registered by default, but best practice is to explicitly register a resource provider for a project. In our case, we’ll register Microsoft.Storage.

az provider register --namespace Microsoft.Storage

This step can take several minutes, and you won’t get a notification in the terminal when it’s complete. Give it a few minutes and check the registration state with the following command:
```
az provider show --namespace Microsoft.Storage --query "r"registrationState"
```
Once that command returns “Registered,” you’re all set.

So that’s the setup of the local environment and the connection of your local machine to Azure via Python and the Azure Cloud.

In my next post, I’ll walk through setting up a storage account (really, this time).
Jul 3, 2025
Living in the Clouds, 2/x
In the previous post, I laid out the case for creating a hybrid cloud lab along with various use cases. I’ll use Python and Azure for this project, and I’m using MacOS, but all of these steps can easily be done on Windows or Linux as well.

The first practical step of setting up this lab is installing Python and a few Python libraries we’ll use to connect to Azure. (I should also point out that it’s a good idea to work on this in Git for some kind of version control.)

On a Mac, you’ll open the terminal and run:

brew install python

This should install Python as well as Pip, Python’s package manager. On Windows, you can either install by downloading an MSI from Python’s website or using an extension in VSCode. Python comes natively with almost every Linux distro.

Now that we have Python, we’ll start by creating a virtual environment for Python–this is a step it took me too long to learn, and it’s important to ensure no conflicts, dependency nightmares, etc. So open your terminal on Mac or Linux (if you’re using Windows, I strongly recommend following these directions in WSL) and type
```
# Create a project directory (if you haven't already)
mkdir hybrid-cloud-lab
cd hybrid-cloud-lab

# Create a virtual environment called 'venv'
python3 -m venv venv

# Activate the virtual environment
source venv/bin/activate
```
Once activated, your terminal prompt will likely change to indicate you’re inside venv, the virtual environment. Now any Python package you install now will only apply to this specific project.

To confirm we’ve done this correctly, we’ll start a simple web server with a single web page. The point here is to make sure we can host services on localhost. In the terminal write
```
echo "<h1>Howdy from localhost!</h1>" > index.html

# You can check the server by running it on port 8000
python -m http.server 8000
```
If everything is working correctly, you should be able to open your web browser and navigate to http://localhost:8000. If you see “Howdy from localhost!,” you’re all set. You can stop the server by clicking into the terminal and pressing Ctrl+C.

Connecting to Azure

Now that the local environment is ready, we can install Python’s Azure libraries. The Azure SDK for Python provides a comprehensive set of libraries that allow us to interact virtually with any Azure service.

Make sure you’re still in the virtual environment, install the Azure packages:
```
pip install azure-identity azure-mgmt-resource azure-mgmt-storage azure-storage-blob
```
These packages are enabling us to complete certain functions in Azure, Specifically:
- azure-identity: handles authentication to Azure
- azure-mgmt-resource: manages Azure resource groups (which is a kind of container for Azure resources)
- azure-mgmt-storage: creates and manages Azure Storage Accounts
- azure-storage-blob: interacts with blob storage (like uploading files
Now that we have these packages, we need the Azure Command Line Interface (Azure CLI). This is a little confusing because we’re using Python for the scripting, so why do we need the Azure CLI? Basically, it makes it easier to authenticate our Python scripts and perform quick admin tasks. You can download the CLI from the Azure website for any OS, but I’m just going to use Homebrew for MacOS here because it’s easy:

brew install azure-cli

Now that we’ve installed the Azure CLI, we can login to our account with a quick command:

az login

This command opens a web browser and asks for your Azure credentials.

The last part of the setup process is to actually register an Azure resource provider to our subscription. Basically the resource provider is a service that enables Azure to offer specific resources (like a storage account, for example). Most are registered by default, but best practice is to explicitly register a resource provider for a project. In our case, we’ll register Microsoft.Storage.

az provider register --namespace Microsoft.Storage

This step can take several minutes, and you won’t get a notification in the terminal when it’s complete. Give it a few minutes and check the registration state with the following command:
```
az provider show --namespace Microsoft.Storage --query "r"registrationState"
```
Once that command returns “Registered,” you’re all set.

So that’s the setup of the local environment and the connection of your local machine to Azure via Python and the Azure Cloud.

In my next post, I’ll walk through setting up a storage account.
Jun 30, 2025
Living in the cloud(s)
It’s been a bit since my last few posts here–a new job and other issues have taken up a lot of my time lately, but I do want to dive into another blog series documenting a project. This time I’m going to shift gears from data analysis and presentation to building a cloud lab. More and more of my work involves planning and implementing infrastructure, and that’s probably true of many of us on the technical end of cultural organizations, whether it be libraries, museums and archives, digital humanities projects, etc.

There’s probably not much of a reason to spell out the virtues of the cloud these days, but, just in case, there are quite a few benefits to libraries, archives, and museums. While we’re all familiar with the arguments that the cloud provides scalability, redundancy, and stability, the hybrid cloud empowers our institutions bridge on-premises systems with the cloud, keeping control over sensitive data (user data or otherwise) while retaining the benefits of cloud computing.

The major benefits of cloud computing in research and digital cultural heritage are, among others:
- Massive Data Growth: Digital archives, from digitized manuscripts to born-digital datasets, are growing exponentially.
- Collaboration Needs: Scholars and archivists need secure, real-time access to resources across institutions, often globally.
- Cost and Complexity: Fully cloud-based solutions can be expensive and rigid, while on-premises systems lack scalability.
- Data Sensitivity: Cultural heritage data often requires strict compliance with privacy and preservation standards (e.g., GDPR, DPC).
But the hybrid cloud specifically has its benefits, offering:
- Flexibility: Store sensitive archival data on-premises while leveraging the cloud’s compute power for data processing or public-facing apps.
- Cost Efficiency: Scale cloud resources up or down based on project needs, avoiding the expense of overprovisioned local hardware.
- Collaboration: Enable global access to digital collections via secure, cloud-based tools, fostering interdisciplinary research.
- Resilience: Protect against data loss with backup and disaster recovery, critical for preserving irreplaceable cultural artifacts.
For example, a digital humanities project analyzing the personal correspondence of Anthony Trollope could process large datasets in the cloud while keeping original scans on local servers for compliance.

Tools

To demonstrate how we might set up a cloud lab, I’m going to use Python and Azure, but you could use Bash and AWS or PowerShell and the Google Cloud Platform–it doesn’t really matter, you could use whatever tools you’re most comfortable. I’m using Python for its portability and Azure because I’m already familiar with AWS and am pursuing an Azure administrator cert.

One quick plug for Python: Python’s versatility and huge number of libraries makes it ideal for automating workflows, managing data pipelines, and scripting hybrid cloud operations. Libraries like pandas for data analysis or azure-sdk for cloud management streamline tasks for systems administrators and researchers alike.

And a quick plug for Azure (as if Azure needs me to plug it): Azure has specific tools for like Azure Arc for hybrid management, Blob Storage that’s flexible and ideal for the kinds of unstructured data we often see in our field. AWS definitely has similar tools if that’s your preference (I’m less familiar with GCP).

Finally, both Python and Azure have tons of documentation, both traditional documentation, and a huge user base with YouTube videos, online courses, and the like.

Hopefully this won’t just be a technical exercise, but a method to future-proof an archive or a digital humanities project. It offers the flexibility to scale, the security to protect cultural heritage, and the automation to save time and easily repeat tasks when necessary.

In the next post, we’ll work on setting up the hybrid environment, starting with the tools and configurations.
Jun 26, 2025
Jastrow search
I’m a big user (and fan of ) Sefaria.org, the online library of Jewish texts. It’s become especially valuable lately as the commentaries on each text seem to proliferate on the platform. See, for example, the number of commentaries on Esther 1:1:

One of the best things about the site lately is the impressive list of reference works Sefaria has managed to compile. There are quite a few (aside from Jastrow), but here are some of my favorites:
- Sefer HaShorashim
- Sefer HeArukh
- Otzar La’azei Rashi
- BDB
- Machberet Menachem
While I think it would be valuable to search through any (or all) of these, I am most interested in Jastrow. It’s somewhat of a pain to search within a particular dictionary on the website, so I created a web application that uses the Sefaria API to search the Jastrow dictionary. You can check out the application at katzir.xyz/jastrow.

There’s are a variety of endpoints on the Sefaria developer site that are worth exploring, but the one I used for this is under the lexicon menu. The specific call is:
```
  SEFARIA_API_BASE = "https://www.sefaria.org/api/words"
```
Basically the application accepts a word in search, limits the reference parameter to Jastrow, and lists the corresponding definitions. The application is built on Sinatra, and the views are designed display definition details, citations, and other information.

While I’ve mostly used the application on my phone while learning Talmud (I have a Hebrew keyboard on mobile), I did create a floating keyboard for the desktop site, an idea I got from the Yiddish Book Center’s keyboard.

Anyway, check it out:

Here’s the definition view:

Eventually I’ll create a page for Jastrow’s preface as well as a list of acronyms and abbreviations from the dictionary.
Apr 7, 2025
Digging into the British Empire with R: Lexical topics, part 5
Some of the most powerful (and interesting) features of working with R on corpus analysis involve exploring technical features of texts. In this script, I use a some well-known R packages for text analysis like tidyr and tidytext, combined with ggplot2, which I used in the previous post, to analyze and visualize textual data.

This comnbination allows us to do things like find word frequences in particular subsets of the corpus. In this case, I’ve selected for Rudyard Kipling and charted the 10 words he uses most frequently in the corpus:

To accomplish this, I filtered for texts by Kipling, unnested the tokens, generated a word count, and plotted the words in a bar graph. Here’s the code:
```
# Filter for a specific author in the corpus and tokenize text into words. Here
# I've used Kipling 
word_freq <- empire_texts %>%
  filter(author == "Kipling, Rudyard") %>%
  unnest_tokens(word, text) %>%
  # Remove stop words and non-alphabetic characters
  anti_join(stop_words, by = "word") %>%
  filter(str_detect(word, "^[a-z]+$")) %>%
  # Count word frequencies
  count(word, sort = TRUE) %>%
  top_n(10, n)

# Create bar graph
ggplot(word_freq, aes(x = reorder(word, n), y = n)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 10 Words by Rudyard Kipling",
       x = "Words", y = "Frequency") +
  theme_minimal()
```
I also selected the top 10 bigrams and filtered for author, in this case H. Rider Haggard:

In this case, I was just interested in seeing any bigrams, but depending on your analysis, you might want to see, for example, what the first word in a bigram is if the second word is always “land.” You could do that by slightly modifying the script and including a line to filter word2 as land. For example:
```
bigram_freq <- empire_texts %>%
  filter(author == "Haggard, H. Rider (Henry Rider)") %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  filter(!is.na(bigram)) %>%
  separate(bigram, into = c("word1", "word2"), sep = " ") %>%
  # Filter for bigrams where word2 is "land"
  filter(word2 == "land") %>%
  filter(!word1 %in% stop_words$word) %>%
  unite(bigram, word1, word2, sep = " ") %>%
  count(bigram, sort = TRUE) %>%
  top_n(10, n)
```
Finally, I’ve used TF-IDF (Term Frequency-Inverse Document Frequency) to chart the top 5 most distinctive words by title. The titles are randomly generated from the corpus with set seed (which I’ve set to 279 in my script), but you could regenerate with random titles by changing the set seed number.

Here’s the script for that:
```
# Filter, tokenize, and calculate TF-IDF
tfidf_data <- empire_texts %>%
  filter(title %in% sample_titles) %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words, by = "word") %>%
  count(title, word, name = "n") %>%
  bind_tf_idf(word, title, n) %>%
  # Get top 5 words per title by TF-IDF
  group_by(title) %>%
  slice_max(order_by = tf_idf, n = 5, with_ties = FALSE) %>%
  ungroup()

# Create faceted bar plot
ggplot(tfidf_data, aes(x = reorder_within(word, tf_idf, title), y = tf_idf)) +
  geom_bar(stat = "identity", fill = "purple") +
  facet_wrap(~title, scales = "free_y") +
  coord_flip() +
  scale_x_reordered() +
  labs(title = "Top 5 Distinctive Words by Title (TF-IDF)",
       x = "Words", y = "TF-IDF Score") +
  theme_minimal() +
  theme(axis.text.y = element_text(size = 6))
```
Needless to say, you could run this with as many titles as you chose, though the graph gets a little wonky if you run too many–and the process can slow down considerably depending on the size of the corpus.

It’s worth noting that these plots can be written to an R document called Quarto and published on the web (with a free account) via RPubs. This can help if want to use charts for a presentation, or even if you just want the chart to display online in a more versatile environment. Maybe I’ll write a series on creating presentations in RStudio at some point.

Here’s the third and final script for our British Empire sentiment analysis.

Stay tuned for the next project.
Mar 21, 2025

Digging into the British Empire with R: Maps!, part 4

Probably the coolest thing you can do with a text analysis project in R is to create data visualizations. In this script, I’ve used our corpus to create some interesting maps based on different regions and their prevalence in the corpus (what I’ve called “narrative intensity” in the charts).

This script takes advantage of the ggplot2 package, which has a lot of flexibility in terms of data display and aesthetic customizations. Here is an example of a map of narrative intensity–in the code, I supply the regions I want to measure intensity for. For this particular map, I also have to define the coordinates, which is kind of a pain, but it also allows you customize on a granular level. In this map, I’ve used the default coordinates for the middle of each country (the generic coordinates available on most maps). But presumably most of the writing about Australia concerns the west coast of the continent and not its center; this is something I could adjust in the code.

A heat map probably makes more sense for this data, and I can make that too. Below, I’ve used a heat map with a customized aesthetic theme and defined fixed coordinates.

ggplot2 is a powerful library, and it’s easy to sink lots of time into beautiful visualizations–the first image was the default visual I didn’t change, the second one I changed a bit. Now that I’m uploading it, the colors are probably a bit brighter than I’d normally recommend using. It’s hard to see some of the gradations in the heat map. I think this is a good example of the fact that the best thing about R–seemingly infinite customizations–can also be the biggest drawback.

In any event, hopefully you can see how powerful text mining and visualizations can be–in mining a corpus of 1,000 works on the British Empire, we can see pretty clearly which locations are the object of the most narrative intensity–what places were being talked about the most. One question that arises from this for me is the place (or lack thereof!) of Latin America in this data. Economic interest in Latin American mining was prevalent in the Victorian Era, so it might be interesting to dive into Latin American regions in the corpus and see how they compare to places that do figure in the map, such as Australia, India, or Canada. And of course, by manually reviewing the corpus data, it would be possible to select texts based on genre; I could imagine novels, for example, might be more focused on one region than financial reports or newspaper articles, or vice versa.

Here’s the script I used for these:

# In this script, we're going to plot some maps with our data to look at discrete 
# regions and evaluate the "narrative intensity" of those regions--in other words,
# how often they were depicted in the corpus 

# First let's install our packages
install.packages("stringi")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("maps")
install.packages("stringr")
install.packages("tidyr")
install.packages("tidytext")

# Now we'll load them
library(stringi)
library(dplyr)
library(ggplot2)
library(maps)
library(stringr)
library(tidyr)
library(tidytext)


# Alternative approach using base R to handle problematic characters
empire_texts <- empire_texts %>%
  mutate(
    text_length = sapply(text, function(x) {
#     # Try to handle encoding issues safely
      clean_text <- tryCatch({
      clean <- iconv(x, from = "UTF-8", to = "UTF-8", sub = "")
      if(is.na(clean)) return(0)
      return(nchar(clean))
    }, error = function(e) {
      return(0)  # Return 0 for length if all else fails
    })
     return(clean_text)
   })
 )

regions <- c("India", "Africa", "Australia", "Canada", "Caribbean", "Egypt")


# calculates the proportional regional focus for texts in the corpus
# important to mutate sub out invalid utf-8, or else subsequent code will fail
regional_focus <- empire_texts %>%
  mutate(text = iconv(text, to = "UTF-8", sub = "")) %>%  # Clean at the start
  filter(gutenberg_id != 1470, gutenberg_id != 3310, gutenberg_id != 6134, 
         gutenberg_id != 6329, gutenberg_id != 6358, gutenberg_id != 6469) %>%
  group_by(gutenberg_id, title) %>%
  summarize(text = paste(text, collapse = " ")) %>%
  mutate(
    text_length = str_length(text),
    india_focus = str_count(text, regex("India|Indian|Hindustan", ignore_case = TRUE)),
    africa_focus = str_count(text, regex("Africa|African|Cape|Natal|Zulu", ignore_case = TRUE)),
    australia_focus = str_count(text, regex("Australia|Sydney|Melbourne", ignore_case = TRUE)),
    caribbean_focus = str_count(text, regex("Jamaica|Barbados|Caribbean|West Indies", ignore_case = TRUE)),
    egypt_focus = str_count(text, regex("Egypt|Egyptian|Nile|Cairo", ignore_case = TRUE)),
    canada_focus = str_count(text, regex("Canada|Canadian|Ontario|Quebec", ignore_case = TRUE))
  ) %>%
  mutate(
    india_ratio = india_focus / text_length * 10000,
    africa_ratio = africa_focus / text_length * 10000,
    australia_ratio = australia_focus / text_length * 10000,
    caribbean_ratio = caribbean_focus / text_length * 10000,
    egypt_ratio = egypt_focus / text_length * 10000,
    canada_ratio = canada_focus / text_length * 10000
  a_ratio = canada_focus / text_length * 10000)
  

# Let's start putting a map together with a focus on different regions 
regional_summary <- regional_focus %>%
  summarise(
    India = sum(india_ratio, na.rm = TRUE),
    Africa = sum(africa_ratio, na.rm = TRUE),
    Australia = sum(australia_ratio, na.rm = TRUE),
    Caribbean = sum(caribbean_ratio, na.rm = TRUE),
    Egypt = sum(egypt_ratio, na.rm = TRUE),
    Canada = sum(canada_ratio, na.rm = TRUE)
  ) %>%
  # Reshape to long format for mapping
  pivot_longer(cols = everything(), 
               names_to = "region", 
               values_to = "focus_strength")

# Creates a data frame with coordinates for each region prevalent in the data
region_coords <- data.frame(
  region = c("India", "Africa", "Australia", "Caribbean", "Egypt", "Canada"),
  lon = c(78.9629, 21.0936, 133.7751, -76.8099, 31.2357, -106.3468),
  lat = c(20.5937, 7.1881, -25.2744, 18.7357, 30.0444, 56.1304)
)

# Combine text focus with region 
map_data <- left_join(region_coords, regional_summary, by = "region")

# Get world map data
world <- map_data("world")

# Create the map! 
ggplot() +
  # World map background
  geom_polygon(data = world, 
               aes(x = long, y = lat, group = group), 
               fill = "gray90", color = "gray70", size = 0.1) +
  # Points for each region sized and colored by focus strength
  geom_point(data = map_data, 
             aes(x = lon, y = lat, 
                 size = focus_strength, 
                 color = focus_strength),
             alpha = 0.7) +
  # Optional: Add region labels
  geom_text(data = map_data,
            aes(x = lon, y = lat, label = region),
            vjust = -1, size = 3) +
  # Customize the appearance
  scale_size_continuous(range = c(3, 15), name = "Focus Strength") +
  scale_color_gradient(low = "blue", high = "red", name = "Focus Strength") +
  theme_minimal() +
  labs(title = "Regional Focus in Texts about the British Empire",
       subtitle = "Size and color intensity show relative focus strength",
       x = NULL, y = NULL) +
  coord_fixed(1.3) +  # Keeps map proportions reasonable
  theme(legend.position = "bottom")

# This is a bit easier because you don't have to supply the coordinates 
# but we do need to normalize the values in the focus strength field, so we'll
# add those values to regional_summary 

# Assuming world_subset is prepared as in previous examples
# If not, here's a quick setup (replace with your actual data prep):
world <- map_data("world")
# Example map_data from earlier
region_coords <- data.frame(
  region = c("India", "Africa", "Australia", "Caribbean", "Egypt", "Canada"),
  lon = c(78.9629, 21.0936, 133.7751, -76.8099, 31.2357, -106.3468),
  lat = c(20.5937, 7.1881, -25.2744, 18.7357, 30.0444, 56.1304),
  focus_strength = c(50, 30, 20, 15, 25, 40)  # Replace with your actual focus_strength
)
country_regions <- data.frame(
  region = c("India", "Africa", "Africa", "Australia", "Caribbean", "Caribbean", "Egypt", "Canada"),
  country = c("India", "South Africa", "Nigeria", "Australia", "Jamaica", "Barbados", "Egypt", "Canada")
)
map_data_countries <- left_join(country_regions, region_coords[, c("region", "focus_strength")], by = "region")
world_subset <- world %>% left_join(map_data_countries, by = c("region" = "country"))

# Create the heatmap
ggplot() +
  # World map with heatmap fill
  geom_polygon(data = world_subset, 
               aes(x = long, y = lat, group = group, fill = focus_strength),
               color = "#2a4d69", size = 0.05) +  # Thin, dark borders for contrast
  # Landmass background (middle)
  geom_polygon(data = world, 
               aes(x = long, y = lat, group = group), 
               fill = "#5A5A5A", color = "#B0B0B0", size = 0.3) +  # Even lighter gray
  # Heatmap layer (top, fully opaque)
  geom_polygon(data = world_subset, 
               aes(x = long, y = lat, group = group, fill = focus_strength),
               color = "#D0D0D0", size = 0.4, alpha = 1) +  # No transparency
  # Bright, adjusted gradient
  scale_fill_gradientn(
    colors = c("#80CFFF", "#CCFF99", "#FFFF99", "#FFCCCC", "#FF99CC"),  # Super bright palette
    name = "Focus Strength",
    na.value = "transparent",
    limits = c(min(world_subset$focus_strength, na.rm = TRUE), 
               max(world_subset$focus_strength, na.rm = TRUE)),  # Full data range
    breaks = seq(min(world_subset$focus_strength, na.rm = TRUE), 
                 max(world_subset$focus_strength, na.rm = TRUE), length.out = 5),
    guide = guide_colorbar(barwidth = 15, barheight = 0.5, title.position = "top")
  ) +
  # Theme
  theme_void() +
  theme(
    plot.background = element_rect(fill = "#2A4D69", color = NA),
    panel.background = element_rect(fill = "#2A4D69", color = NA),
    plot.title = element_text(family = "Arial", size = 16, color = "#FFFFFF", 
                              face = "bold", hjust = 0.5),
    plot.subtitle = element_text(family = "Arial", size = 12, color = "#E0E0E0", 
                                 hjust = 0.5),
    legend.position = "bottom",
    legend.title = element_text(color = "#FFFFFF", size = 10, face = "bold"),
    legend.text = element_text(color = "#FFFFFF", size = 8),
    legend.background = element_rect(fill = "transparent", color = NA)
  ) +
  labs(
    title = "Regional Focus in Texts about the British Empire",
    subtitle = "Heatmap of narrative intensity across colonial regions"
  ) +
  coord_fixed(1.3)

Mar 19, 2025

Digging into the British Empire with R: Loading a corpus, part 3

Alright, here’s the R script I used to generate the corpus. As you can see, there are a lot of ways this could be tweaked. I didn’t do this, but it would be possible (and probably even preferable) to run the search, collect the data, and manually sift through the results if you have a relatively small corpus.

# Install packages  
install.packages("gutenbergr")
install.packages("dplyr")
install.packages("stringr")

# Load required libraries
library(gutenbergr)
library(dplyr)
library(stringr)

# Get the Gutenberg metadata
gb_metadata <- gutenberg_works()

# Define search terms related to British Empire
search_terms <- c("india", "colony", "colonial", "empire", "africa", "asia", 
                  "imperial", "natives", "british", "england", "victoria", 
                  "trade", "east india", "conquest")

# First approach: Find works with publication dates in the 19th century where possible
# Many Gutenberg works have no publication dates
dated_works <- gb_metadata %>%
  filter(
    language == "en",
    !is.na(gutenberg_author_id),
    !str_detect(title, "Bible|Dictionary|Encyclopedia|Manual|Cookbook")
  ) %>%
  # But some do, so let's make sure we get those 
  filter(
    !is.na(gutenberg_bookshelf),
    str_detect(gutenberg_bookshelf, "1800|19th")
  )

# Second approach: Find popular authors from the 19th century
empire_authors <- c("Rudyard Kipling", "Joseph Conrad", "Charles Dickens", 
                    "H. Rider Haggard", "Robert Louis Stevenson", "Anthony Trollope", 
                    "E.M. Forster", "John Stuart Mill", "Thomas Macaulay", 
                    "Thomas Babington Macaulay", "James Mill", "George Curzon",
                    "Frederick Lugard", "Richard Burton", "David Livingstone",
                    "Henry Morton Stanley", "Mary Kingsley", "Flora Annie Steel")

author_works <- gb_metadata %>%
  filter(
    language == "en",
    str_detect(author, paste(empire_authors, collapse = "|"))
  )

# Third approach: Keyword search in titles
keyword_works <- gb_metadata %>%
  filter(
    language == "en",
    str_detect(tolower(title), paste(search_terms, collapse = "|"))
  )

# Combine all of the above approaches into one dataset
empire_works <- bind_rows(dated_works, author_works, keyword_works) %>% 
  distinct() %>%
  # Use author birth/death dates to try and estimate 19th century works
  left_join(gutenberg_authors, by = "gutenberg_author_id") %>%
  filter(
    # Authors who lived during the 19th century (possibly born earlier)
    (is.na(birthdate) | 
       birthdate <= 1880) & # Born before or during most of the 19th century
      (is.na(deathdate) | 
         deathdate >= 1800)   # Died after the 19th century began
  )

# View how many books we found
print(paste("Found", nrow(empire_works), "potentially relevant works"))

# Preview the first few works
head(empire_works %>% select(gutenberg_id, title, author.x), 20)

# If we have more than 1000 works, we can limit to the most relevant
# I'm going to cap this at 1000 works, but feel freee to use a lower number if you prefer
if(nrow(empire_works) > 1000) {
  # Calculate a relevance score based on how many search terms appear in the title
  empire_works <- empire_works %>%
    mutate(
      relevance_score = sapply(title, function(t) {
        sum(sapply(search_terms, function(term) {
          if(str_detect(tolower(t), term)) 1 else 0
        }))
      })
    ) %>%
    arrange(desc(relevance_score)) %>%
    head(1000)
}

# Download the corpus (this will take time)
# Uncomment the two lines below when ready to download
empire_texts <- gutenberg_download(empire_works$gutenberg_id, 
                                 meta_fields = c("title", "author"))

# Take a quick look at the dataset to get a sense of how it's organized  
View(empire_texts) 

# You might want to save the metadata for future reference. If so, uncomment
# the following lines 

#write.csv(empire_works %>% select(gutenberg_id, title, author.x), 
#          "outputs/empire_corpus_metadata.csv", row.names = FALSE)

Mar 16, 2025

Digging into the British Empire with R: Swapping out HTRC for Gutenbergr, part 2
In the last post, I detailed the process for building a corpus using HTRC’s Extracted Features datasets. In this post, I’m going to explain a little bit about what Extracted Features are, whether or not they might be useful for a text mining project, the problems I had working with them, why I decided to use Project Gutenberg instead, and how to go about using HathiTrust instead of Gutenberg if that’s your preference.

Let’s start with the Extracted Features dataset.

Each volume in our collection list has a page-level JSON file with features for that page–these features include metadata about the volume and metadata about the page, including a word frequency list.

So what can you do with this dataset?

Well, you can track things like word frequency across sections for each book. With multiple books, you could track word frequency over decades. Using keywords in titles, you could track topics in publication data across decades (or centuries). If you’re interested in the history of the book or publishing, you could track that information across a large corpus. You could do a sentiment analysis with this data, though since the page-level metadata is only giving you keyword frequency, working with a full-text corpus would be preferable.

The takeaway is this: using HTRC lets you text mine bibliographic metadata; Extracted Features do not really enable you to build a corpus.

Additionally, I ran into several challenges using the HTRC Extracted Features data. I have not found many examples on the web or in publications of people performing research using these datasets, but it’s possible I’ve overlooked them. These are the problems I had, but your mileage may vary.
1. The JSON parsing is extremely difficult
  
  At first, I assumed this was because of the size of my dataset (800 volumes at the beginning). I cut this in half twice, finally using just a couple hundred volumes. It was still nightmarishly slow. For some specs, I started by trying to parse the files on my local computer (M4 Mac, 16GB RAM), which generally performs data analyses well. When parsing kept stalling out, I switched over to ~~RStudio Cloud~~ Posit Cloud, used the paid version, maxed out the specs, and tried to parse overnight. Nothing doing
  
  After repeat failures, I did some digging into the dataset. Each JSON file in the dataset requires, on average, 1,050 “gathers” from the parser. At 1,050 gathers per 800 files, we’re closing in at 850,000 processes (these are averages, of course). If we assume ~1.5KB per gather for only the metadata, we’re looking at 1.26GB for processing metadata. So far, not bad. But some of the Extracted Features files are ~5MB: at 1,000 files, we’re inching up on 4-5GB of memory, assuming some of the Extracted Feature files are ~5MB. Significant, but not so bad
  
  To list these parsed elements, R loads all files in the corpus in memory all at once. Most people assume 3x overhead for R processes, so for 4-5GB in memory, the overhead is 12-15GB, maxing out the 16GB Mac (which, of course, still has to run its own processes)
  
  To make matters worse, we also need to run bind_rows() to flatten the JSON file into a data frame. Again, R holds the original in memory while copying, putting us at 24-30GB of RAM. Finally, the more nested the structure the more memory–and here’s the kicker–even when deleting the biggest JSON files in the corpus, even when cutting the corpus to around 150 volumes and then subsetting the data I was trying to parse, I was unable to run anything substantive without maxing out memory, even when using up to 32GB of RAM in a cloud environment, even when using Parallels to divide up my cores, etc.
2. Metadata vs. full-text: Is the juice worth the squeeze?
  
  At some point, I decided it was time to move on–the whole reason I wanted to use the Extracted Features dataset in the first place was so that I could analyze a huge corpus since I was relying on metadata and not full-text. In fact, the only reason I downloaded 800-1000 texts was for a quick example of what could be done; my hope was to eventually start a project where I could analyze 5 or 10x that number
  
  Ultimately, if I have to significantly cut down the corpus size and pare down the metadata by subsetting just to analyze a couple hundred texts, I might as well actually analyze them in full-text. I’m not sure why the dataset’s JSON structure is so overwrought, but it seems to me that the point of using a dataset like this would be to process a corpus orders of magnitude larger than you could with full-text. Since that did not turn out to be the case for me, I decided to abandon the dataset for a full-text option
I will point out that I emailed HathiTrust to ask whether or not it is possible to bulk download text files for out-of-copyright work, and it is possible by submitting some paperwork to them. I am going to take them up on that for another project in the future, but for now I’m going to Project Gutenberg for the sentiment analysis on the British Empire.

Project Gutenberg is not without its own challenges–first of all, the library itself is not as professionally curated as HathiTrust’s, which derives its catalog data from contributing academic institutions. Second, there’s just not as much there. And finally (I did not realize this until working with the data!), there is no publication date in Project Gutenberg metadata. I’m not sure why this is the case–it’s certainly something that people complain about. So to work on any kind of chronological analysis, you’d need to either 1) supply the dates manually or 2) include relative dates (e.g., second half of the nineteenth century) based on author birth/death dates, which are in the database. Obviously neither is ideal, but with a small enough corpus, option 1 is probably the best bet.

Another strength of Project Gutenberg is that a corpus can be built simply using the gutenbergr R package–even a corpus of 1-2000 texts can be built easily in under an hour.
Mar 12, 2025
Digging into the British Empire with R and HathiTrust, step 1
I had anticipated a more exciting update on this endeavor today, but I ran into a few problems getting the corpus from HathiTrust. In case anyone uses this as a guide for building a digital corpus, I’m going to document the steps I took.

The HTRC documentation is scattered across GitHub, Atlassian, the HTRC website, the HathiTrust website, and the HTRC analytics subdomain. I couldn’t really make heads or tails of how this documentation works together, and I ran into several frustrating elements across these sites that seem to be outdated.

I would’ve preferred to use the HathiTrust Data API, but it has evidently been deprecated.

HathiTrust has a Python library called HTRC Feature Reader—this is exactly what I was looking for (or so I thought). I’m not sure if this tool has been deprecated as well, but the Python library depends on several outdated dependencies. When I started, I figured I’d just update the libraries in question, but that trapped me in dependency hell. I then decided to downgrade Pip and Python in a virtual environment, only to open a new series of dependency problems. Ultimately, I abandoned the Feature Reader too (though, as you’ll see below, you still need to install it because, despite its dependency issues, downloading the library installs an important command-line tool you’ll need later).

Just for kicks, I also tried this in the recommended Anaconda environment—no dice.

What I wound up doing was creating a collection in the standard advanced search interface of HathiTrust. I saved just over 1,000 texts with keywords, setting the date range to 1800–1899 and selecting English-language texts. I called the collection “Imperial Sentiments” and made it public. Then, I downloaded the metadata file, which includes the HTID—HathiTrust’s identifier in its metadata.

Once you download the metadata file as a CSV, copy and paste the volume ID numbers into a separate text file. Delete the column title and name it something simple, like volume_ids.txt.

If you haven’t already done so by this point, make sure you’ve installed the HTRC Feature Reader—I recommend running pip install htrc-feature-reader in the command line.

Then, run the following:

bash
```
htid2rsync --from-file volume_ids.txt >> paths.txt
```
This will take the list of HTIDs from volume_ids.txt, convert them to the paths you’ll need to download the HathiTrust files, and output those paths into a file called paths.txt.

Now you’re ready to start downloading!

To download the Extracted Features files, you’re providing a list of files you want to HathiTrust’s servers, which then check the paths you provide and download the files to the directory you specify. We’ll accomplish this with a protocol called rsync, which is native to macOS and Linux and can be downloaded for Windows.

Run the following command:

bash
```
rsync -av --no-relative --files-from=paths.txt data.analytics.hathitrust.org::features/ /local/destination/path/
```
A couple of things are important to note:

You’ll definitely want to run this in verbose mode (hence the -av) because the process can hang sometimes, and without verbose output, you won’t know whether it’s just downloading a lot at once or if something’s wrong.

The compressed JSON files are spread across nested directories on the server—there’s almost no circumstance in which you’d want to keep that nested file structure, so use –no-relative to flatten it.

Make sure you specify your local path correctly. If you don’t, rsync saves to a temp folder you’ll have to track down.

After running the script, you should have your corpus!

Before working with the files, you’ll want to extract them, either in the command line or via a GUI, and then delete the zipped files. It’s probably easiest to navigate to the folder in the command line and run something like rm *.bz2 (but make sure you know what you’re doing).

And that’s it! Hopefully, I’ve saved someone the pain of wading through outdated documentation or tinkering with Python libraries when this is really a relatively painless and simple process.

In the next post, we’ll take a look at what exactly these HTRC files are comprised of and determine whether or not they’re fit for different kinds of projects.
Mar 12, 2025
Digging into the British Empire with R and HathiTrust

What did nineteenth-century texts say about the British Empire, and how can we use code to figure it out? In the next few posts, I’m going to explore some digital humanities ideas, using R to sift through some nineteenth-century texts from the HathiTrust Digital Library. I’ll look at novels, essays, travelogues, etc. I’ll start with HathiTrust’s Extracted Features dataset, which gives word counts for tons of digitized volumes. I’m aiming for 500–1,000 texts with terms like “colony,” “trade,” or “power,” just to keep it manageable. In R, I’ll use tools like tidytext and dplyr to clean things up—removing stop words—and focus on words tied to the empire. My approach will be simple.

First, I’ll look at which empire-related words show up most, broken down by decade—maybe “conquest” early, “trade” later? Then, I’ll try sentiment analysis with the NRC lexicon to see if I can detect any tonal shifts in the way empire is discussed. After that, I’ll run topic modeling with LDA to spot patterns—military topics, economic topics, whatever emerges. I’ll use R to visualize this data, making it comprehensible and attractive.

If you’re into digital humanities, this might spark ideas about blending tech with texts in the public domain. If you like coding, it’s an interesting exercise in working with messy data in R. Stick around!

Mar 7, 2025

Tools