beyond the keywords

  • Jastrow search

    I’m a big user (and fan of ) Sefaria.org, the online library of Jewish texts. It’s become especially valuable lately as the commentaries on each text seem to proliferate on the platform. See, for example, the number of commentaries on Esther 1:1:

    One of the best things about the site lately is the impressive list of reference works Sefaria has managed to compile. There are quite a few (aside from Jastrow), but here are some of my favorites:

    • Sefer HaShorashim
    • Sefer HeArukh
    • Otzar La’azei Rashi
    • BDB
    • Machberet Menachem

    While I think it would be valuable to search through any (or all) of these, I am most interested in Jastrow. It’s somewhat of a pain to search within a particular dictionary on the website, so I created a web application that uses the Sefaria API to search the Jastrow dictionary. You can check out the application at katzir.xyz/jastrow.

    There’s are a variety of endpoints on the Sefaria developer site that are worth exploring, but the one I used for this is under the lexicon menu. The specific call is:

      SEFARIA_API_BASE = "https://www.sefaria.org/api/words"

    Basically the application accepts a word in search, limits the reference parameter to Jastrow, and lists the corresponding definitions. The application is built on Sinatra, and the views are designed display definition details, citations, and other information.

    While I’ve mostly used the application on my phone while learning Talmud (I have a Hebrew keyboard on mobile), I did create a floating keyboard for the desktop site, an idea I got from the Yiddish Book Center’s keyboard.

    Anyway, check it out:

    Here’s the definition view:

    Eventually I’ll create a page for Jastrow’s preface as well as a list of acronyms and abbreviations from the dictionary.

  • Digging into the British Empire with R: Lexical topics, part 5

    Some of the most powerful (and interesting) features of working with R on corpus analysis involve exploring technical features of texts. In this script, I use a some well-known R packages for text analysis like tidyr and tidytext, combined with ggplot2, which I used in the previous post, to analyze and visualize textual data.

    This comnbination allows us to do things like find word frequences in particular subsets of the corpus. In this case, I’ve selected for Rudyard Kipling and charted the 10 words he uses most frequently in the corpus:

    To accomplish this, I filtered for texts by Kipling, unnested the tokens, generated a word count, and plotted the words in a bar graph. Here’s the code:

    # Filter for a specific author in the corpus and tokenize text into words. Here
    # I've used Kipling 
    word_freq <- empire_texts %>%
      filter(author == "Kipling, Rudyard") %>%
      unnest_tokens(word, text) %>%
      # Remove stop words and non-alphabetic characters
      anti_join(stop_words, by = "word") %>%
      filter(str_detect(word, "^[a-z]+$")) %>%
      # Count word frequencies
      count(word, sort = TRUE) %>%
      top_n(10, n)
    
    # Create bar graph
    ggplot(word_freq, aes(x = reorder(word, n), y = n)) +
      geom_bar(stat = "identity", fill = "steelblue") +
      coord_flip() +
      labs(title = "Top 10 Words by Rudyard Kipling",
           x = "Words", y = "Frequency") +
      theme_minimal()

    I also selected the top 10 bigrams and filtered for author, in this case H. Rider Haggard:

    In this case, I was just interested in seeing any bigrams, but depending on your analysis, you might want to see, for example, what the first word in a bigram is if the second word is always “land.” You could do that by slightly modifying the script and including a line to filter word2 as land. For example:

    bigram_freq <- empire_texts %>%
      filter(author == "Haggard, H. Rider (Henry Rider)") %>%
      unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
      filter(!is.na(bigram)) %>%
      separate(bigram, into = c("word1", "word2"), sep = " ") %>%
      # Filter for bigrams where word2 is "land"
      filter(word2 == "land") %>%
      filter(!word1 %in% stop_words$word) %>%
      unite(bigram, word1, word2, sep = " ") %>%
      count(bigram, sort = TRUE) %>%
      top_n(10, n)

    Finally, I’ve used TF-IDF (Term Frequency-Inverse Document Frequency) to chart the top 5 most distinctive words by title. The titles are randomly generated from the corpus with set seed (which I’ve set to 279 in my script), but you could regenerate with random titles by changing the set seed number.

    Here’s the script for that:

    # Filter, tokenize, and calculate TF-IDF
    tfidf_data <- empire_texts %>%
      filter(title %in% sample_titles) %>%
      unnest_tokens(word, text) %>%
      anti_join(stop_words, by = "word") %>%
      count(title, word, name = "n") %>%
      bind_tf_idf(word, title, n) %>%
      # Get top 5 words per title by TF-IDF
      group_by(title) %>%
      slice_max(order_by = tf_idf, n = 5, with_ties = FALSE) %>%
      ungroup()
    
    # Create faceted bar plot
    ggplot(tfidf_data, aes(x = reorder_within(word, tf_idf, title), y = tf_idf)) +
      geom_bar(stat = "identity", fill = "purple") +
      facet_wrap(~title, scales = "free_y") +
      coord_flip() +
      scale_x_reordered() +
      labs(title = "Top 5 Distinctive Words by Title (TF-IDF)",
           x = "Words", y = "TF-IDF Score") +
      theme_minimal() +
      theme(axis.text.y = element_text(size = 6))

    Needless to say, you could run this with as many titles as you chose, though the graph gets a little wonky if you run too many–and the process can slow down considerably depending on the size of the corpus.

    It’s worth noting that these plots can be written to an R document called Quarto and published on the web (with a free account) via RPubs. This can help if want to use charts for a presentation, or even if you just want the chart to display online in a more versatile environment. Maybe I’ll write a series on creating presentations in RStudio at some point.

    Here’s the third and final script for our British Empire sentiment analysis.

    Stay tuned for the next project.

  • Digging into the British Empire with R: Maps!, part 4

    Probably the coolest thing you can do with a text analysis project in R is to create data visualizations. In this script, I’ve used our corpus to create some interesting maps based on different regions and their prevalence in the corpus (what I’ve called “narrative intensity” in the charts).

    This script takes advantage of the ggplot2 package, which has a lot of flexibility in terms of data display and aesthetic customizations. Here is an example of a map of narrative intensity–in the code, I supply the regions I want to measure intensity for. For this particular map, I also have to define the coordinates, which is kind of a pain, but it also allows you customize on a granular level. In this map, I’ve used the default coordinates for the middle of each country (the generic coordinates available on most maps). But presumably most of the writing about Australia concerns the west coast of the continent and not its center; this is something I could adjust in the code.

    A heat map probably makes more sense for this data, and I can make that too. Below, I’ve used a heat map with a customized aesthetic theme and defined fixed coordinates.

    ggplot2 is a powerful library, and it’s easy to sink lots of time into beautiful visualizations–the first image was the default visual I didn’t change, the second one I changed a bit. Now that I’m uploading it, the colors are probably a bit brighter than I’d normally recommend using. It’s hard to see some of the gradations in the heat map. I think this is a good example of the fact that the best thing about R–seemingly infinite customizations–can also be the biggest drawback.

    In any event, hopefully you can see how powerful text mining and visualizations can be–in mining a corpus of 1,000 works on the British Empire, we can see pretty clearly which locations are the object of the most narrative intensity–what places were being talked about the most. One question that arises from this for me is the place (or lack thereof!) of Latin America in this data. Economic interest in Latin American mining was prevalent in the Victorian Era, so it might be interesting to dive into Latin American regions in the corpus and see how they compare to places that do figure in the map, such as Australia, India, or Canada. And of course, by manually reviewing the corpus data, it would be possible to select texts based on genre; I could imagine novels, for example, might be more focused on one region than financial reports or newspaper articles, or vice versa.

    Here’s the script I used for these:

    # In this script, we're going to plot some maps with our data to look at discrete 
    # regions and evaluate the "narrative intensity" of those regions--in other words,
    # how often they were depicted in the corpus 
    
    # First let's install our packages
    install.packages("stringi")
    install.packages("dplyr")
    install.packages("ggplot2")
    install.packages("maps")
    install.packages("stringr")
    install.packages("tidyr")
    install.packages("tidytext")
    
    # Now we'll load them
    library(stringi)
    library(dplyr)
    library(ggplot2)
    library(maps)
    library(stringr)
    library(tidyr)
    library(tidytext)
    
    
    # Alternative approach using base R to handle problematic characters
    empire_texts <- empire_texts %>%
      mutate(
        text_length = sapply(text, function(x) {
    #     # Try to handle encoding issues safely
          clean_text <- tryCatch({
          clean <- iconv(x, from = "UTF-8", to = "UTF-8", sub = "")
          if(is.na(clean)) return(0)
          return(nchar(clean))
        }, error = function(e) {
          return(0)  # Return 0 for length if all else fails
        })
         return(clean_text)
       })
     )
    
    regions <- c("India", "Africa", "Australia", "Canada", "Caribbean", "Egypt")
    
    
    # calculates the proportional regional focus for texts in the corpus
    # important to mutate sub out invalid utf-8, or else subsequent code will fail
    regional_focus <- empire_texts %>%
      mutate(text = iconv(text, to = "UTF-8", sub = "")) %>%  # Clean at the start
      filter(gutenberg_id != 1470, gutenberg_id != 3310, gutenberg_id != 6134, 
             gutenberg_id != 6329, gutenberg_id != 6358, gutenberg_id != 6469) %>%
      group_by(gutenberg_id, title) %>%
      summarize(text = paste(text, collapse = " ")) %>%
      mutate(
        text_length = str_length(text),
        india_focus = str_count(text, regex("India|Indian|Hindustan", ignore_case = TRUE)),
        africa_focus = str_count(text, regex("Africa|African|Cape|Natal|Zulu", ignore_case = TRUE)),
        australia_focus = str_count(text, regex("Australia|Sydney|Melbourne", ignore_case = TRUE)),
        caribbean_focus = str_count(text, regex("Jamaica|Barbados|Caribbean|West Indies", ignore_case = TRUE)),
        egypt_focus = str_count(text, regex("Egypt|Egyptian|Nile|Cairo", ignore_case = TRUE)),
        canada_focus = str_count(text, regex("Canada|Canadian|Ontario|Quebec", ignore_case = TRUE))
      ) %>%
      mutate(
        india_ratio = india_focus / text_length * 10000,
        africa_ratio = africa_focus / text_length * 10000,
        australia_ratio = australia_focus / text_length * 10000,
        caribbean_ratio = caribbean_focus / text_length * 10000,
        egypt_ratio = egypt_focus / text_length * 10000,
        canada_ratio = canada_focus / text_length * 10000
      a_ratio = canada_focus / text_length * 10000)
      
    
    # Let's start putting a map together with a focus on different regions 
    regional_summary <- regional_focus %>%
      summarise(
        India = sum(india_ratio, na.rm = TRUE),
        Africa = sum(africa_ratio, na.rm = TRUE),
        Australia = sum(australia_ratio, na.rm = TRUE),
        Caribbean = sum(caribbean_ratio, na.rm = TRUE),
        Egypt = sum(egypt_ratio, na.rm = TRUE),
        Canada = sum(canada_ratio, na.rm = TRUE)
      ) %>%
      # Reshape to long format for mapping
      pivot_longer(cols = everything(), 
                   names_to = "region", 
                   values_to = "focus_strength")
    
    # Creates a data frame with coordinates for each region prevalent in the data
    region_coords <- data.frame(
      region = c("India", "Africa", "Australia", "Caribbean", "Egypt", "Canada"),
      lon = c(78.9629, 21.0936, 133.7751, -76.8099, 31.2357, -106.3468),
      lat = c(20.5937, 7.1881, -25.2744, 18.7357, 30.0444, 56.1304)
    )
    
    # Combine text focus with region 
    map_data <- left_join(region_coords, regional_summary, by = "region")
    
    # Get world map data
    world <- map_data("world")
    
    # Create the map! 
    ggplot() +
      # World map background
      geom_polygon(data = world, 
                   aes(x = long, y = lat, group = group), 
                   fill = "gray90", color = "gray70", size = 0.1) +
      # Points for each region sized and colored by focus strength
      geom_point(data = map_data, 
                 aes(x = lon, y = lat, 
                     size = focus_strength, 
                     color = focus_strength),
                 alpha = 0.7) +
      # Optional: Add region labels
      geom_text(data = map_data,
                aes(x = lon, y = lat, label = region),
                vjust = -1, size = 3) +
      # Customize the appearance
      scale_size_continuous(range = c(3, 15), name = "Focus Strength") +
      scale_color_gradient(low = "blue", high = "red", name = "Focus Strength") +
      theme_minimal() +
      labs(title = "Regional Focus in Texts about the British Empire",
           subtitle = "Size and color intensity show relative focus strength",
           x = NULL, y = NULL) +
      coord_fixed(1.3) +  # Keeps map proportions reasonable
      theme(legend.position = "bottom")
    
    # This is a bit easier because you don't have to supply the coordinates 
    # but we do need to normalize the values in the focus strength field, so we'll
    # add those values to regional_summary 
    
    # Assuming world_subset is prepared as in previous examples
    # If not, here's a quick setup (replace with your actual data prep):
    world <- map_data("world")
    # Example map_data from earlier
    region_coords <- data.frame(
      region = c("India", "Africa", "Australia", "Caribbean", "Egypt", "Canada"),
      lon = c(78.9629, 21.0936, 133.7751, -76.8099, 31.2357, -106.3468),
      lat = c(20.5937, 7.1881, -25.2744, 18.7357, 30.0444, 56.1304),
      focus_strength = c(50, 30, 20, 15, 25, 40)  # Replace with your actual focus_strength
    )
    country_regions <- data.frame(
      region = c("India", "Africa", "Africa", "Australia", "Caribbean", "Caribbean", "Egypt", "Canada"),
      country = c("India", "South Africa", "Nigeria", "Australia", "Jamaica", "Barbados", "Egypt", "Canada")
    )
    map_data_countries <- left_join(country_regions, region_coords[, c("region", "focus_strength")], by = "region")
    world_subset <- world %>% left_join(map_data_countries, by = c("region" = "country"))
    
    # Create the heatmap
    ggplot() +
      # World map with heatmap fill
      geom_polygon(data = world_subset, 
                   aes(x = long, y = lat, group = group, fill = focus_strength),
                   color = "#2a4d69", size = 0.05) +  # Thin, dark borders for contrast
      # Landmass background (middle)
      geom_polygon(data = world, 
                   aes(x = long, y = lat, group = group), 
                   fill = "#5A5A5A", color = "#B0B0B0", size = 0.3) +  # Even lighter gray
      # Heatmap layer (top, fully opaque)
      geom_polygon(data = world_subset, 
                   aes(x = long, y = lat, group = group, fill = focus_strength),
                   color = "#D0D0D0", size = 0.4, alpha = 1) +  # No transparency
      # Bright, adjusted gradient
      scale_fill_gradientn(
        colors = c("#80CFFF", "#CCFF99", "#FFFF99", "#FFCCCC", "#FF99CC"),  # Super bright palette
        name = "Focus Strength",
        na.value = "transparent",
        limits = c(min(world_subset$focus_strength, na.rm = TRUE), 
                   max(world_subset$focus_strength, na.rm = TRUE)),  # Full data range
        breaks = seq(min(world_subset$focus_strength, na.rm = TRUE), 
                     max(world_subset$focus_strength, na.rm = TRUE), length.out = 5),
        guide = guide_colorbar(barwidth = 15, barheight = 0.5, title.position = "top")
      ) +
      # Theme
      theme_void() +
      theme(
        plot.background = element_rect(fill = "#2A4D69", color = NA),
        panel.background = element_rect(fill = "#2A4D69", color = NA),
        plot.title = element_text(family = "Arial", size = 16, color = "#FFFFFF", 
                                  face = "bold", hjust = 0.5),
        plot.subtitle = element_text(family = "Arial", size = 12, color = "#E0E0E0", 
                                     hjust = 0.5),
        legend.position = "bottom",
        legend.title = element_text(color = "#FFFFFF", size = 10, face = "bold"),
        legend.text = element_text(color = "#FFFFFF", size = 8),
        legend.background = element_rect(fill = "transparent", color = NA)
      ) +
      labs(
        title = "Regional Focus in Texts about the British Empire",
        subtitle = "Heatmap of narrative intensity across colonial regions"
      ) +
      coord_fixed(1.3)
  • Digging into the British Empire with R: Loading a corpus, part 3

    Alright, here’s the R script I used to generate the corpus. As you can see, there are a lot of ways this could be tweaked. I didn’t do this, but it would be possible (and probably even preferable) to run the search, collect the data, and manually sift through the results if you have a relatively small corpus.

    # Install packages  
    install.packages("gutenbergr")
    install.packages("dplyr")
    install.packages("stringr")
    
    # Load required libraries
    library(gutenbergr)
    library(dplyr)
    library(stringr)
    
    # Get the Gutenberg metadata
    gb_metadata <- gutenberg_works()
    
    # Define search terms related to British Empire
    search_terms <- c("india", "colony", "colonial", "empire", "africa", "asia", 
                      "imperial", "natives", "british", "england", "victoria", 
                      "trade", "east india", "conquest")
    
    # First approach: Find works with publication dates in the 19th century where possible
    # Many Gutenberg works have no publication dates
    dated_works <- gb_metadata %>%
      filter(
        language == "en",
        !is.na(gutenberg_author_id),
        !str_detect(title, "Bible|Dictionary|Encyclopedia|Manual|Cookbook")
      ) %>%
      # But some do, so let's make sure we get those 
      filter(
        !is.na(gutenberg_bookshelf),
        str_detect(gutenberg_bookshelf, "1800|19th")
      )
    
    # Second approach: Find popular authors from the 19th century
    empire_authors <- c("Rudyard Kipling", "Joseph Conrad", "Charles Dickens", 
                        "H. Rider Haggard", "Robert Louis Stevenson", "Anthony Trollope", 
                        "E.M. Forster", "John Stuart Mill", "Thomas Macaulay", 
                        "Thomas Babington Macaulay", "James Mill", "George Curzon",
                        "Frederick Lugard", "Richard Burton", "David Livingstone",
                        "Henry Morton Stanley", "Mary Kingsley", "Flora Annie Steel")
    
    author_works <- gb_metadata %>%
      filter(
        language == "en",
        str_detect(author, paste(empire_authors, collapse = "|"))
      )
    
    # Third approach: Keyword search in titles
    keyword_works <- gb_metadata %>%
      filter(
        language == "en",
        str_detect(tolower(title), paste(search_terms, collapse = "|"))
      )
    
    # Combine all of the above approaches into one dataset
    empire_works <- bind_rows(dated_works, author_works, keyword_works) %>% 
      distinct() %>%
      # Use author birth/death dates to try and estimate 19th century works
      left_join(gutenberg_authors, by = "gutenberg_author_id") %>%
      filter(
        # Authors who lived during the 19th century (possibly born earlier)
        (is.na(birthdate) | 
           birthdate <= 1880) & # Born before or during most of the 19th century
          (is.na(deathdate) | 
             deathdate >= 1800)   # Died after the 19th century began
      )
    
    # View how many books we found
    print(paste("Found", nrow(empire_works), "potentially relevant works"))
    
    # Preview the first few works
    head(empire_works %>% select(gutenberg_id, title, author.x), 20)
    
    # If we have more than 1000 works, we can limit to the most relevant
    # I'm going to cap this at 1000 works, but feel freee to use a lower number if you prefer
    if(nrow(empire_works) > 1000) {
      # Calculate a relevance score based on how many search terms appear in the title
      empire_works <- empire_works %>%
        mutate(
          relevance_score = sapply(title, function(t) {
            sum(sapply(search_terms, function(term) {
              if(str_detect(tolower(t), term)) 1 else 0
            }))
          })
        ) %>%
        arrange(desc(relevance_score)) %>%
        head(1000)
    }
    
    # Download the corpus (this will take time)
    # Uncomment the two lines below when ready to download
    empire_texts <- gutenberg_download(empire_works$gutenberg_id, 
                                     meta_fields = c("title", "author"))
    
    # Take a quick look at the dataset to get a sense of how it's organized  
    View(empire_texts) 
    
    # You might want to save the metadata for future reference. If so, uncomment
    # the following lines 
    
    #write.csv(empire_works %>% select(gutenberg_id, title, author.x), 
    #          "outputs/empire_corpus_metadata.csv", row.names = FALSE)
    
    
  • Digging into the British Empire with R: Swapping out HTRC for Gutenbergr, part 2

    In the last post, I detailed the process for building a corpus using HTRC’s Extracted Features datasets. In this post, I’m going to explain a little bit about what Extracted Features are, whether or not they might be useful for a text mining project, the problems I had working with them, why I decided to use Project Gutenberg instead, and how to go about using HathiTrust instead of Gutenberg if that’s your preference.

    Let’s start with the Extracted Features dataset.

    Each volume in our collection list has a page-level JSON file with features for that page–these features include metadata about the volume and metadata about the page, including a word frequency list.

    So what can you do with this dataset?

    Well, you can track things like word frequency across sections for each book. With multiple books, you could track word frequency over decades. Using keywords in titles, you could track topics in publication data across decades (or centuries). If you’re interested in the history of the book or publishing, you could track that information across a large corpus. You could do a sentiment analysis with this data, though since the page-level metadata is only giving you keyword frequency, working with a full-text corpus would be preferable.

    The takeaway is this: using HTRC lets you text mine bibliographic metadata; Extracted Features do not really enable you to build a corpus.

    Additionally, I ran into several challenges using the HTRC Extracted Features data. I have not found many examples on the web or in publications of people performing research using these datasets, but it’s possible I’ve overlooked them. These are the problems I had, but your mileage may vary.

    1. The JSON parsing is extremely difficult
      • At first, I assumed this was because of the size of my dataset (800 volumes at the beginning). I cut this in half twice, finally using just a couple hundred volumes. It was still nightmarishly slow. For some specs, I started by trying to parse the files on my local computer (M4 Mac, 16GB RAM), which generally performs data analyses well. When parsing kept stalling out, I switched over to RStudio Cloud Posit Cloud, used the paid version, maxed out the specs, and tried to parse overnight. Nothing doing
      • After repeat failures, I did some digging into the dataset. Each JSON file in the dataset requires, on average, 1,050 “gathers” from the parser. At 1,050 gathers per 800 files, we’re closing in at 850,000 processes (these are averages, of course). If we assume ~1.5KB per gather for only the metadata, we’re looking at 1.26GB for processing metadata. So far, not bad. But some of the Extracted Features files are ~5MB: at 1,000 files, we’re inching up on 4-5GB of memory, assuming some of the Extracted Feature files are ~5MB. Significant, but not so bad
      • To list these parsed elements, R loads all files in the corpus in memory all at once. Most people assume 3x overhead for R processes, so for 4-5GB in memory, the overhead is 12-15GB, maxing out the 16GB Mac (which, of course, still has to run its own processes)
      • To make matters worse, we also need to run bind_rows() to flatten the JSON file into a data frame. Again, R holds the original in memory while copying, putting us at 24-30GB of RAM. Finally, the more nested the structure the more memory–and here’s the kicker–even when deleting the biggest JSON files in the corpus, even when cutting the corpus to around 150 volumes and then subsetting the data I was trying to parse, I was unable to run anything substantive without maxing out memory, even when using up to 32GB of RAM in a cloud environment, even when using Parallels to divide up my cores, etc.
    2. Metadata vs. full-text: Is the juice worth the squeeze?
      • At some point, I decided it was time to move on–the whole reason I wanted to use the Extracted Features dataset in the first place was so that I could analyze a huge corpus since I was relying on metadata and not full-text. In fact, the only reason I downloaded 800-1000 texts was for a quick example of what could be done; my hope was to eventually start a project where I could analyze 5 or 10x that number
      • Ultimately, if I have to significantly cut down the corpus size and pare down the metadata by subsetting just to analyze a couple hundred texts, I might as well actually analyze them in full-text. I’m not sure why the dataset’s JSON structure is so overwrought, but it seems to me that the point of using a dataset like this would be to process a corpus orders of magnitude larger than you could with full-text. Since that did not turn out to be the case for me, I decided to abandon the dataset for a full-text option

    I will point out that I emailed HathiTrust to ask whether or not it is possible to bulk download text files for out-of-copyright work, and it is possible by submitting some paperwork to them. I am going to take them up on that for another project in the future, but for now I’m going to Project Gutenberg for the sentiment analysis on the British Empire.

    Project Gutenberg is not without its own challenges–first of all, the library itself is not as professionally curated as HathiTrust’s, which derives its catalog data from contributing academic institutions. Second, there’s just not as much there. And finally (I did not realize this until working with the data!), there is no publication date in Project Gutenberg metadata. I’m not sure why this is the case–it’s certainly something that people complain about. So to work on any kind of chronological analysis, you’d need to either 1) supply the dates manually or 2) include relative dates (e.g., second half of the nineteenth century) based on author birth/death dates, which are in the database. Obviously neither is ideal, but with a small enough corpus, option 1 is probably the best bet.

    Another strength of Project Gutenberg is that a corpus can be built simply using the gutenbergr R package–even a corpus of 1-2000 texts can be built easily in under an hour.

  • Digging into the British Empire with R and HathiTrust, step 1

    I had anticipated a more exciting update on this endeavor today, but I ran into a few problems getting the corpus from HathiTrust. In case anyone uses this as a guide for building a digital corpus, I’m going to document the steps I took.

    The HTRC documentation is scattered across GitHub, Atlassian, the HTRC website, the HathiTrust website, and the HTRC analytics subdomain. I couldn’t really make heads or tails of how this documentation works together, and I ran into several frustrating elements across these sites that seem to be outdated.

    I would’ve preferred to use the HathiTrust Data API, but it has evidently been deprecated.

    HathiTrust has a Python library called HTRC Feature Reader—this is exactly what I was looking for (or so I thought). I’m not sure if this tool has been deprecated as well, but the Python library depends on several outdated dependencies. When I started, I figured I’d just update the libraries in question, but that trapped me in dependency hell. I then decided to downgrade Pip and Python in a virtual environment, only to open a new series of dependency problems. Ultimately, I abandoned the Feature Reader too (though, as you’ll see below, you still need to install it because, despite its dependency issues, downloading the library installs an important command-line tool you’ll need later).

    Just for kicks, I also tried this in the recommended Anaconda environment—no dice.

    What I wound up doing was creating a collection in the standard advanced search interface of HathiTrust. I saved just over 1,000 texts with keywords, setting the date range to 1800–1899 and selecting English-language texts. I called the collection “Imperial Sentiments” and made it public. Then, I downloaded the metadata file, which includes the HTID—HathiTrust’s identifier in its metadata.

    Once you download the metadata file as a CSV, copy and paste the volume ID numbers into a separate text file. Delete the column title and name it something simple, like volume_ids.txt.

    If you haven’t already done so by this point, make sure you’ve installed the HTRC Feature Reader—I recommend running pip install htrc-feature-reader in the command line.

    Then, run the following:

    bash

    htid2rsync --from-file volume_ids.txt >> paths.txt

    This will take the list of HTIDs from volume_ids.txt, convert them to the paths you’ll need to download the HathiTrust files, and output those paths into a file called paths.txt.

    Now you’re ready to start downloading!

    To download the Extracted Features files, you’re providing a list of files you want to HathiTrust’s servers, which then check the paths you provide and download the files to the directory you specify. We’ll accomplish this with a protocol called rsync, which is native to macOS and Linux and can be downloaded for Windows.

    Run the following command:

    bash

    rsync -av --no-relative --files-from=paths.txt data.analytics.hathitrust.org::features/ /local/destination/path/

    A couple of things are important to note:

    You’ll definitely want to run this in verbose mode (hence the -av) because the process can hang sometimes, and without verbose output, you won’t know whether it’s just downloading a lot at once or if something’s wrong.

    The compressed JSON files are spread across nested directories on the server—there’s almost no circumstance in which you’d want to keep that nested file structure, so use –no-relative to flatten it.

    Make sure you specify your local path correctly. If you don’t, rsync saves to a temp folder you’ll have to track down.

    After running the script, you should have your corpus!

    Before working with the files, you’ll want to extract them, either in the command line or via a GUI, and then delete the zipped files. It’s probably easiest to navigate to the folder in the command line and run something like rm *.bz2 (but make sure you know what you’re doing).

    And that’s it! Hopefully, I’ve saved someone the pain of wading through outdated documentation or tinkering with Python libraries when this is really a relatively painless and simple process.

    In the next post, we’ll take a look at what exactly these HTRC files are comprised of and determine whether or not they’re fit for different kinds of projects.

  • Digging into the British Empire with R and HathiTrust

    What did nineteenth-century texts say about the British Empire, and how can we use code to figure it out? In the next few posts, I’m going to explore some digital humanities ideas, using R to sift through some nineteenth-century texts from the HathiTrust Digital Library. I’ll look at novels, essays, travelogues, etc. I’ll start with HathiTrust’s Extracted Features dataset, which gives word counts for tons of digitized volumes. I’m aiming for 500–1,000 texts with terms like “colony,” “trade,” or “power,” just to keep it manageable. In R, I’ll use tools like tidytext and dplyr to clean things up—removing stop words—and focus on words tied to the empire. My approach will be simple.

    First, I’ll look at which empire-related words show up most, broken down by decade—maybe “conquest” early, “trade” later? Then, I’ll try sentiment analysis with the NRC lexicon to see if I can detect any tonal shifts in the way empire is discussed. After that, I’ll run topic modeling with LDA to spot patterns—military topics, economic topics, whatever emerges. I’ll use R to visualize this data, making it comprehensible and attractive.

    If you’re into digital humanities, this might spark ideas about blending tech with texts in the public domain. If you like coding, it’s an interesting exercise in working with messy data in R. Stick around!

  • Using Ruby for Scripting

    In the world of programming languages, Python has long been crowned the king of scripting and data manipulation. But lately I’ve been giving Ruby a closer look, and I’ve come to believe it offers a compelling alternative.

    The Philosophical Underpinnings

    Ruby developers often emphasize Ruby’s “elegance,” a fact that likely stems from its English-based syntax and natural readability. Where Python emphasizes there should be one obvious way to do things, Ruby embraces the principle that programming can be an art form. This isn’t mere purple prose – Ruby represents a fundamentally different approach to computational thinking, one that privileges linguistic over mathematical form.

    Expressiveness

    Consider a simple data transformation task. In Python, you might write a functional but somewhat verbose script. In Ruby, the same task is characterized by method chaining and block manipulation. Take this example of transforming a collection:

    # Ruby's expressive power
    processed_data = raw_data
      .map { |item| item.transform }
      .select { |item| item.valid? }
      .group_by { |item| item.category }
    

    The code reads almost like natural language, revealing Ruby’s core philosophy: optimize for code readability.

    Performance and Flexibility

    While Python shines in data science and machine learning, Ruby has its own performance strengths, particularly in text processing and complex scripting scenarios. The Ruby ecosystem, powered by tools like RubyGems and frameworks such as Rails, provides robust solutions for rapid script development.

    Domain-Specific Advantages

    1. Text Processing: Ruby’s regex and string manipulation capabilities are arguably more intuitive and powerful than Python’s.
    2. Metaprogramming: Ruby’s dynamic metaprogramming features allow for more flexible code generation and runtime modifications.
    3. Block Manipulation: Ruby’s blocks are more versatile than Python’s lambda functions, enabling more complex functional programming patterns.

    Where Ruby Really Shines

    Certain scenarios reveal Ruby’s unique strengths:

    • Web scraping with more concise, readable code
    • Rapid prototyping of complex data transformation scripts
    • Building domain-specific languages (DSLs)
    • Scenarios requiring high-level abstraction and code that reads like prose

    The Cognitive Dimension

    Beyond technical merits, Ruby represents a different way of thinking about code. It’s less about rigid structure and more about expressing computational logic with grace and creativity.

    A Practical Illustration

    # Ruby's elegant error handling and block usage
    File.open('data.csv', 'r') do |file|
      file.each_line
        .map(&:chomp)
        .reject(&:empty?)
        .map { |line| parse_complex_line(line) }
    end
    

    This snippet demonstrates Ruby’s ability to chain methods, handle file operations, and transform data with remarkable concision and clarity.

    The Ecosystem Consideration

    While Python dominates data science, Ruby has carved out impressive niches. Tools like Faker for data generation, Nokogiri for XML parsing, and the entire Rails ecosystem provide powerful alternatives to Python’s libraries.

    Conclusion

    Choosing between Ruby and Python isn’t about selecting a superior language – it’s about finding the right tool for your specific cognitive and project needs. Ruby offers a less traveled path, one that celebrates creativity and expression.