This week was all about bandwidth and RAM, both literal and metaphorical. I had trouble connecting to our class’s Tuesday Zoom session—certainly a matter of too many devices and too many programs demanding too much of a limited Internet connection in my getaway in Kentucky. I also had trouble working on the visualizations this week—certainly a matter of too many worries and too many plans to make demanding too much of a brain that would ordinarily be on spring break, my annual reboot.
So, eager not to hold my team back with my distractedness, I turned to low-level but time-consuming tasks: scraping and cleaning data. Our team is interested in comparing our current data set of Newbery authors and protagonists to data about other awards. So, I scraped the basics on the next most recognizable children’s book honor: the Caldecotts, which recognize excellence in picture books.
The scraping experience reminded me that most online python tutorials work with best-case scenarios. The videos that taught me how to scrape earlier this semester drew from huge, well-established websites (the New York Times and monster.com) to demonstrate the power of the code. The few sites offering a full list of Caldecott winners were less established, and the HTML was erratic at best. The site from which I scraped the Caldecott honorees fortunately organized those winners in lists, so I could find all <li>, but within those lists, they only sometimes embedded the title in anchor tags, and they often included manual spacing and tabs for no apparent reason. About half of the time, the books were illustrated by someone other than the author, so splitting the results by the word “by,” as I was able to do for the Newberys, got tricky. So, my code just grabbed the titles and attempted to grab authors, resulting in a two-column .csv file in desperate need of real cleaning—a far cry from the comparatively tidy results of the Newbery scrape. But that kind of cleaning—split screening the data and the original site and checking manually for errors—was exactly the kind of mindless labor I needed. Now, we’ve got the years, titles, authors, and illustrators, and already, without further research, we can see that the Caldecotts are much more diverse than the Newberys. But of course, further research is what we now need as we identify author and protagonist gender and race/ethnicity more precisely.
Another great boon this week: Meg reminded us that what we’re doing matters. She crafted an initial blog post for our website intended to remind our users that as children consume books in isolation—away from school and peers and the outside world—that parents need to make sure that those pages reflect themselves and others. If children only read from a small slice of the literature available, they will be isolated indeed. We hope our project can help parents make well-informed decisions this spring, when reading might be, in some ways, the only contact kids have with the outside world.
(On a side note, I’ve remembered this week what we learned last term in the Intro to DH class: that the data infrastructure in the US allows us access to our jobs and each other in this time of crisis in a way that few other countries’ infrastructure can. I’m wondering how we might use that access to support those without it. Thoughts?)