LISTSERV - LIBLICENSE-L Archives

LIBLICENSE-L Archives

LibLicense-L Discussion Forum

LIBLICENSE-L@LISTSERV.CRL.EDU

	LISTSERV Archives
	LIBLICENSE-L Home

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives
Options:	Use Forum View Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]
Subject:	Re: Suggested Readings in Text Mining?
From:	LIBLICENSE <[log in to unmask]>
Reply To:	LibLicense-L Discussion Forum <[log in to unmask]>
Date:	Tue, 6 Nov 2012 15:55:30 -0500
Content-Type:	text/plain
Parts/Attachments:	text/plain (167 lines)
From: Sandy Thatcher <[log in to unmask]>
Date: Tue, 6 Nov 2012 09:44:15 -0600

The amicus brief cited here makes a reasonable case for treating mass
digitization as fair use.  It is not too much of a stretch to see the
type of uses made by scholars of the data generated by mass
digitization as "transformative" in the way that concept has come to
be introduced into copyright law by the pioneering work of Judge
Pierre Leval (whom the authors cite).

I would urge these caveats, however:

First, the authors of the brief make much of the fact/expression
dichotomy that has come to be embedded in copyright jurisprudence and
is explicitly sanctioned in Sec. 102 of the Copyright Act of 1976. But
this concept can be pushed too far, in making all scholarly writing
seem more akin to the "factual" data of a telephone directory (Feist
1991) than to the "expressive" prose of fiction.  Should scholars be
deprived of all copyright protection just because their work is more
factual than expressive? And I daresay some scholarly writing is more
creatively expressive than some dull fiction writing. (See point # 3
below.)

Second, they cite the line of cases interpreting fair use by the Ninth
Circuit and (building on the Ninth's interpretation) the Fourth
Circuit.  But the Ninth Circuit cases involving the digitization of
images, especially Perfect 10, do not work so comfortably as
precedents as the authors of the brief seem to think.  They argue that
"Allowing Intermediate Copying in Order to Enable Nonexpressive Uses
Does Not Harm the Market for the Original Works in a Legally
Cognizable Manner, As The Practice Does Not Implicate the Works'
Expressive Aspects in Any Way." But the court on the Perfect 10 case
chose to ignore a real market for thumbnail images that Perfect 10 was
already developing by licensing their use on cell phones. That use
would have been for exactly the same "expressive" purpose as the
original, not just for indexing. By allowing Google to assemble a
collection of such images, the court effectively killed Perfect 10's
licensing business.  Even the Fourth Circuit's decision about Turnitin
can be questioned in this manner. While Turnitin can generate a
finding of possible plagiarism from its database of student papers,
any examination of actual infringement of any particular paper would
require reading it in detail and making a line-by-line comparison with
the allegedly infringing paper, hence making the same use of it as the
original (though admittedly for a somewhat different purpose).

Third, many people including most reporters have thought that the
objection of publishers to Google's library project had something to
do with the "snippets" that Google allowed users to see. It did not.
The primary objection was to Google's delivery of a digital file of
each book copied to the library that provided it (as well as Google's
effort to substitute "opt out" for "opt in" as the standard approach
to copyright). That had a direct impact on the potential market for
digitized copies that publishers could have sold to libraries.  Now,
admittedly, a lot depends on what the libraries felt they could do
with those copies. But, as we have seen with the HathiTrust case,
libraries are expanding their ideas of what uses they can make under
fair use, including potentially uses of orphan works that are the same
kinds of uses that are made of the originals in their expressive
capacity, as the brief's authors would put it.  As we move along this
slippery slope, we eventually get to the position enunciated in the
ARL's Code of Best Practices in Fair Use for Academic and Research
Libraries where uses of scholarly monographs (and journal articles)
through e-reserves are to be considered fair use because these works
are being used, so it is claimed, for a purpose different from the use
originally intended by the authors even though the kind of use--the
reading of the actual content line by line--is exactly of the same
kind!  The result of this kind of approach would be the destruction of
almost the entire market for paperbacks issued by academic publishers
for course use. While few authors of monographs make much, if any,
money from the sale of their books in hardback to libraries, quite a
few of them derive significant income from the sale of their books in
paperback for course use (some of them even earning amounts into six
figures) and I would be surprised if they would be happy about an
interpretation of fair use that deprived them of such income--not to
mention the publishers who depend on this income to sustain the whole
system of scholarly publishing.

Fourth, rather than head down this slippery slope and turn fair use in
general, and "transformative use" in particular, into a completely
muddled and all-expansive umbrella concept for justifying just about
every type of copying imaginable, a saner approach would be to do what
Public Knowledge has recommended in its Copyright Reform Act project,
viz., urging Congress to amend the law by explicitly sanctioning
certain limited but important kinds of transient or incidental
copying. Here is an excerpt from a white paper written by people
associated with the Berkeley Law group that summarizes this approach:

Specifically, the proposed reform provides an exemption to the
exclusive right of reproduction provided to copyright owners under §
106 of the Copyright Act for some incidental copies. Not all
intermediate copies are covered by the reform; there are three
targeted limitations that ensure that the reform effectively protects
the interests of copyright owners. First, the exemption is limited to
incidental or transient copies. This restriction prevents potential
infringers from creating copies, such as permanent or secondary
duplications, that possess substantial value outside of their
necessity to a particular end use. Second, these copies must be an
integral and essential part of a technological process. This condition
prevents copyists from circumventing copyright protection by
secondarily attaching incidental or transient copies to some
technological process. Finally, the primary purpose of the copy must
be to enable a lawful use. This restriction forces evaluation of the
end use that the copy facilitates, requiring that the end use be
evaluated in light of the property rights of copyright owners. By
limiting the exemption in this fashion, Congress can protect the
interests of both copyright holders and consumers.

Dena Chen et al., "Providing an Incidental Copies Exemption for
Service Providers and End-Users," March 31, 2011. Click on Report 5
here: http://www.publicknowledge.org/cra/

Sandy Thatcher



From: Ann Shumelda Okerson <[log in to unmask]>
Date: Mon, 5 Nov 2012 04:54:29 -0500

Forwarded by Paul Zarins, of Stanford University Library, below is a
message from Glen Worthy, Stanford's Digital Humanities LIbrarian.
________________________________

From: "Glen Worthey" <[log in to unmask]>
To: "Pavils Zarins" <[log in to unmask]>
Sent: Thursday, November 1, 2012 5:02:54 PM
Subject: Re: Fwd: Suggested Readings in Text Mining?

My bias will be pretty obvious to you -- but as far as I'm concerned,
regarding text mining specifically for humanities research, Matt Jockers
is the very best.  Here is a set of several highly relevant blog posts
from him:

http://www.matthewjockers.net/category/tm/

the best and most entertaining of which is basically a chapter from his
book /Macroanalysis: Digital Methods and Literary History /(due out
early next year):

http://www.matthewjockers.net/2011/09/29/the-lda-buffet-is-now-open-or-latent-dirichlet-allocation-for-english-majors/

I suspect that Ann (and others on the Liblicence list) may be especially
interested in this: Matt was also co-author (on behalf of digital
humanities and legal scholars) of an amicus brief that was filed in the

Authors Guild v. HathiTrust case:
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2102542

and which was frequently cited by the judge in his decision.  Obviously,
text mining is not the main focus of this brief, but it does play a
strikingly prominent role in what turned out to be a very important
legal document.

Finally, as just a portal into the huge world of text mining for
humanities research, see this very helpful "progressive" (that is,
progressing from "beginner" to "expert" level) review article with links
aplenty:  "Topic Modeling for Humanists: A Guided Tour"
http://www.scottbot.net/HIAL/?p=19113

(Note that, for some purposes -- though not all! -- "topic modeling" is
rough synonym for "text mining."  It's probably better characterized as
a subset of text mining, but I believe at the moment it's one of the
more actively-pursued subsets, at least in digital humanities.)

Hope this helps,

Glen
ATOM RSS1 RSS2
LISTSERV.CRL.EDU