LISTSERV - LIBLICENSE-L Archives

From: Greg Cram <[log in to unmask]>
Date: Tue, Aug 6, 2019 at 11:25 AM

I thought I might add a little background. We've been interested in these
Copyright Office records for some time for at least two reasons. First,
we're interested in these because  we use them every day in our effort to
make more of our collections available to the public to inspire the
creation of new knowledge. Second, we're interested in them because of the
research possibilities. These paper records are one of the best records of
American creativity. Making the data searchable and usable could produce
scholarship on a range of topics unrelated to copyright.

The problem is that despite the Office photographing the records, that
photography didn't produce a reliably searchable database. We've decided to
invest time and resources into extracting the data from these records by
doing a highly accurate transcription and parsing of the records. Although
we've been crunching on the data, the dataset for books is not yet
complete. We've received a grant from IMLS that will allow us to complete
the dataset, but that work will take another few months. Our internal
analysis tells us that the average renewal rate for books subject to the
renewal requirement is between 25-35%. Before we're willing to publish a
definitive number, we need to complete the work now funded by IMLS.

Books are only a portion of the overall universe of works registered with
the Copyright Office. It is my goal to go beyond books and extract data
from the complete set of 450,000 pages of records. We estimate that Class A
books make up at least 1/3 of the total dataset, so we have lots of
categories of works left to go. That means lots of fundraising in my
future--highly accurate transcription and parsing at scale comes with a
decent price tag.

I'm happy to answer any questions you might have about our work. You can
read more about the project on this blog post:

<https://www.nypl.org/blog/2018/03/30/unlocking-record-american-creativity>,

or check out the project repository

<https://github.com/NYPL/catalog_of_copyright_entries_project>

where you can access the raw xml data. We haven't built a front end to
facilitate searching the data yet, but part of the IMLS funding is to
develop a list of requirements for the search interface.

-Greg Cram
Associate Director of Copyright and Information Policy
The New York Public Library