LISTSERV - LIBLICENSE-L Archives

From: Greg Cram <[log in to unmask]>

Date: Tue, Aug 6, 2019 at 11:25 AM

I thought I might add a little background. We've been interested in these Copyright Office records for some time for at least two reasons. First, we're interested in these because we use them every day in our effort to make more of our collections available to the public to inspire the creation of new knowledge. Second, we're interested in them because of the research possibilities. These paper records are one of the best records of American creativity. Making the data searchable and usable could produce scholarship on a range of topics unrelated to copyright.

The problem is that despite the Office photographing the records, that photography didn't produce a reliably searchable database. We've decided to invest time and resources into extracting the data from these records by doing a highly accurate transcription and parsing of the records. Although we've been crunching on the data, the dataset for books is not yet complete. We've received a grant from IMLS that will allow us to complete the dataset, but that work will take another few months. Our internal analysis tells us that the average renewal rate for books subject to the renewal requirement is between 25-35%. Before we're willing to publish a definitive number, we need to complete the work now funded by IMLS.

Books are only a portion of the overall universe of works registered with the Copyright Office. It is my goal to go beyond books and extract data from the complete set of 450,000 pages of records. We estimate that Class A books make up at least 1/3 of the total dataset, so we have lots of categories of works left to go. That means lots of fundraising in my future--highly accurate transcription and parsing at scale comes with a decent price tag.

I'm happy to answer any questions you might have about our work. You can read more about the project on this blog post:

<https://www.nypl.org/blog/2018/03/30/unlocking-record-american-creativity>,

or check out the project repository

<https://github.com/NYPL/catalog_of_copyright_entries_project>

where you can access the raw xml data. We haven't built a front end to facilitate searching the data yet, but part of the IMLS funding is to develop a list of requirements for the search interface.

-Greg Cram

Associate Director of Copyright and Information Policy

The New York Public Library