This week for me was all about orienting myself in greater detail to the scope of the project and my role in it. As our team edited the proposal via Google Docs, I gained a greater appreciation for the potential good our project could do. Georgette had referenced in the proposal a great data repository that could expand the scope of our work: The Database of Award-Winning Children’s Literature (DAWCL). Her find reminds me of one of the many benefits of teamwork. As a non-librarian, I easily might have overlooked the site myself, as its design is a bit dated and amateurish by today’s standards. Yet, a little research confirmed that the information is thorough, accurate, and up-to-date—a real treasure.
As the programmer for our team, I wanted to figure out this week how we might automate data gathering with Python. Georgette had already gathered some initial data on the 98 Newbery Medal winners themselves. DAWCL offers additional data we hope to investigate such as the 400+ titles designated as Newbery Honorees, as well as avenues for broader analysis, such as the other awards Newbery books have won and award-winning titles that haven’t received Newbery recognition.
New to scraping websites with Python, I failed with the DAWCL spectacularly. But, I learned a ton in the process, and I hope to make some breakthroughs in the coming week. The DAWCL makes it easy to sift through thousands of children’s books by award. Yet, it returns its search results via an ASP, so Python can’t simply request the page contents by using the URL. After combing the web for help, I learned to leverage Chrome’s developer tools to dig beyond the first layer of code usually revealed by the Inspect command. I was able, ultimately, to follow the network requests made by the search form as I performed the search for Newbery winners (very cool), and I finally found the html behind the displayed search results. The html is not terribly sophisticated, which actually isn’t great as far as scraping goes. I’d have preferred designated classes for titles, authors, other awards, rather than just paragraph and break tags. So, this week, I’ll need to get creative, either treating the code I found like text and using Python parse it or changing tools to scrape from the ASP.
Lest I held the project up with that exploration, I turned to data cleaning, making some tweaks to the original set that Georgette posted, in hopes of showing up to our first Skype session with something concrete to offer. First, I split author names using Excel’s text-to-columns feature, so that ultimately our users could investigate by an author’s first or last name. Second, using what I learned from a data viz investigation of authors of New York Times bestsellers, I made a note of data that may cause trouble down the road, such as the tilde in Matt de la Peña’s name or the accent grave in William Pène du Bois’s. I also noted ethical issues we’ll need to tackle: keeping track of our data sources and having discussions about what constitutes ethnicity. Not only are these decisions essential to make in the early stages, but they are important to document and convey to our ultimate users.
Having wrapped up our first Skype session earlier this evening, I’m heading into the second week with great optimism. Each member of our team is elevating my thinking about our approach and is spurring me on to find an expedient way to get our data!