In the last post, I detailed the process for building a corpus using HTRC’s Extracted Features datasets. In this post, I’m going to explain a little bit about what Extracted Features are, whether or not they might be useful for a text mining project, the problems I had working with them, why I decided to use Project Gutenberg instead, and how to go about using HathiTrust instead of Gutenberg if that’s your preference.
Let’s start with the Extracted Features dataset.
Each volume in our collection list has a page-level JSON file with features for that page–these features include metadata about the volume and metadata about the page, including a word frequency list.
So what can you do with this dataset?
Well, you can track things like word frequency across sections for each book. With multiple books, you could track word frequency over decades. Using keywords in titles, you could track topics in publication data across decades (or centuries). If you’re interested in the history of the book or publishing, you could track that information across a large corpus. You could do a sentiment analysis with this data, though since the page-level metadata is only giving you keyword frequency, working with a full-text corpus would be preferable.
The takeaway is this: using HTRC lets you text mine bibliographic metadata; Extracted Features do not really enable you to build a corpus.
Additionally, I ran into several challenges using the HTRC Extracted Features data. I have not found many examples on the web or in publications of people performing research using these datasets, but it’s possible I’ve overlooked them. These are the problems I had, but your mileage may vary.
- The JSON parsing is extremely difficult
- At first, I assumed this was because of the size of my dataset (800 volumes at the beginning). I cut this in half twice, finally using just a couple hundred volumes. It was still nightmarishly slow. For some specs, I started by trying to parse the files on my local computer (M4 Mac, 16GB RAM), which generally performs data analyses well. When parsing kept stalling out, I switched over to
RStudio CloudPosit Cloud, used the paid version, maxed out the specs, and tried to parse overnight. Nothing doing - After repeat failures, I did some digging into the dataset. Each JSON file in the dataset requires, on average, 1,050 “gathers” from the parser. At 1,050 gathers per 800 files, we’re closing in at 850,000 processes (these are averages, of course). If we assume ~1.5KB per gather for only the metadata, we’re looking at 1.26GB for processing metadata. So far, not bad. But some of the Extracted Features files are ~5MB: at 1,000 files, we’re inching up on 4-5GB of memory, assuming some of the Extracted Feature files are ~5MB. Significant, but not so bad
- To list these parsed elements, R loads all files in the corpus in memory all at once. Most people assume 3x overhead for R processes, so for 4-5GB in memory, the overhead is 12-15GB, maxing out the 16GB Mac (which, of course, still has to run its own processes)
- To make matters worse, we also need to run
bind_rows()
to flatten the JSON file into a data frame. Again, R holds the original in memory while copying, putting us at 24-30GB of RAM. Finally, the more nested the structure the more memory–and here’s the kicker–even when deleting the biggest JSON files in the corpus, even when cutting the corpus to around 150 volumes and then subsetting the data I was trying to parse, I was unable to run anything substantive without maxing out memory, even when using up to 32GB of RAM in a cloud environment, even when using Parallels to divide up my cores, etc.
- At first, I assumed this was because of the size of my dataset (800 volumes at the beginning). I cut this in half twice, finally using just a couple hundred volumes. It was still nightmarishly slow. For some specs, I started by trying to parse the files on my local computer (M4 Mac, 16GB RAM), which generally performs data analyses well. When parsing kept stalling out, I switched over to
- Metadata vs. full-text: Is the juice worth the squeeze?
- At some point, I decided it was time to move on–the whole reason I wanted to use the Extracted Features dataset in the first place was so that I could analyze a huge corpus since I was relying on metadata and not full-text. In fact, the only reason I downloaded 800-1000 texts was for a quick example of what could be done; my hope was to eventually start a project where I could analyze 5 or 10x that number
- Ultimately, if I have to significantly cut down the corpus size and pare down the metadata by subsetting just to analyze a couple hundred texts, I might as well actually analyze them in full-text. I’m not sure why the dataset’s JSON structure is so overwrought, but it seems to me that the point of using a dataset like this would be to process a corpus orders of magnitude larger than you could with full-text. Since that did not turn out to be the case for me, I decided to abandon the dataset for a full-text option
I will point out that I emailed HathiTrust to ask whether or not it is possible to bulk download text files for out-of-copyright work, and it is possible by submitting some paperwork to them. I am going to take them up on that for another project in the future, but for now I’m going to Project Gutenberg for the sentiment analysis on the British Empire.
Project Gutenberg is not without its own challenges–first of all, the library itself is not as professionally curated as HathiTrust’s, which derives its catalog data from contributing academic institutions. Second, there’s just not as much there. And finally (I did not realize this until working with the data!), there is no publication date in Project Gutenberg metadata. I’m not sure why this is the case–it’s certainly something that people complain about. So to work on any kind of chronological analysis, you’d need to either 1) supply the dates manually or 2) include relative dates (e.g., second half of the nineteenth century) based on author birth/death dates, which are in the database. Obviously neither is ideal, but with a small enough corpus, option 1 is probably the best bet.
Another strength of Project Gutenberg is that a corpus can be built simply using the gutenbergr
R package–even a corpus of 1-2000 texts can be built easily in under an hour.
Leave a Reply