Data Management Plan Draft
Our data will be collected and stored in .csv files through web scraping programs we create in python. In addition, we will manually collect diversity data using Wikipedia and author pages. Our data is replicable, should it become lost or unusable. Our dataset is temporally restricted, from 1922 – 2020, with incremental changes made only once a year. So, the data we gather this spring will be to date until January of 2021. We will store the data on our laptops, publishing them to Asana, and expanding them via Google Sheets. We will analyze the data in Tableau, storing locally and on Tableau Public.
We will document our data collection procedures by having a document of data issues available to our group through google docs and our project management software, Asana. We will also share our python code, used for scraping data, so it is available to the public. We will state where we find our additional data not initially scraped. Specifically for information related to Wikipedia, we will include our research collection period. If any new sites are used, they will be added to this list of sources. We will ensure good project and data documentation by having a data document available for the group to reference. Kelly and Emily will be responsible for implementing our data management plan. We will use common sense when naming our files and we will conduct a heading review before bringing our data into Tableau. We will use community standards when defining race and ethnicity in our data. We will use the standard of entering data in lowercase characters to help keep the data readable and uniform.
Our data, taken from already public sources, do not require any steps to ensure privacy or confidentiality. While we are required to share this data by virtue of our course, we also feel bound ethically to share our work with our audience, which we imagine will be librarians, educators, the American Library Association (who grants the Newbery Award), parents, and researchers like us interested in the diversity of the powerful honor. As a result, we will include a page on our site that openly shares our data in two formats: .csv to promote longevity and open-source access, and .xls to aid perhaps less tech-savvy constituents such as parents.
Our data will be permanently retained in an academic repository, the CUNY Academic Works. The data will be available in .csv. This format will be sustainably accessible because it is an open source format. We will also have the data in a .xls format for those unfamiliar with .csv. We understand that this format is proprietary and for that reason we have the .csv format available. The CUNY Academic Works will maintain our data long term. Our data is appropriate for the repository mentioned above.