Talk about big data: How the Library of Congress can index all 170 billion tweets ever posted: Library of Congress has a 133TB file of every update ever posted to Twitter; now it has to figure out how to manage it

Abstract

In a report outlining the library's work thus far on the project, officials note their frustration regarding tools available on the market for managing such big data dumps. "It is clear that technology to allow for scholarship access to large data sets is not nearly as advanced as the technology for creating and distributing that data," the library says. "Even the private sector has not yet implemented cost-effective commercial solutions because of the complexity and resource requirements of such a task."

The first data dump came in the form of a 20TB file of 21 billion tweets posted between 2006 when [Twitter] was founded and 2010, complete with metadata showing the place and description of tweets. More recently, the library got its second installment with all the tweets since 2010. In total, the pair of copies of the compressed files total 133.2TBs. Henceforth, the library is collecting new tweets on an hourly basis through partnering company Gnip. In February 2011 that amounted to about 140 million new tweets each day. In October of last year, it had grown to nearly a half-billion tweets per day.

The Library of Congress is not foreign to managing big data: Since 2000, it has been collecting archives of websites containing government data, a repository already 300TBs in size, it says. But Twitter archives pose a new problem, officials say, because the library wants to make the information easily searchable. In its current tape repository form, a single search of the 2006-2010 archive alone -- which is just one-eighth the size of the entire volume -- can take up to 24 hours. "The Twitter collection is not only very large, it also is expanding daily, and at a rapidly increasing velocity," the library notes. "The variety of tweets is also high, considering distinctions between original tweets, re-tweets using the Twitter software, re-tweets that are manually designated as such, tweets with embedded links or pictures and other varieties."

Details

Title

Talk about big data: How the Library of Congress can index all 170 billion tweets ever posted: Library of Congress has a 133TB file of every update ever posted to Twitter; now it has to figure out how to manage it

Author

Butler, Brandon

Publication title

Network World (Online); Southborough

Publication year

2013

Publication date

Jan 8, 2013

Publisher

Foundry

Place of publication

Southborough

Country of publication

United States

Publication subject

Communications--Computer Applications, Computers--Computer Networks

e-ISSN

19447655

Source type

Trade Journal

Language of publication

English

Document type

News

ProQuest document ID

1269123966

Document URL

https://www.proquest.com/trade-journals/talk-about-big-data-how-library-congress-can/docview/1269123966/se-2?accountid=208611

Last updated

2016-03-13

Database

ProQuest One Academic

Talk about big data: How the Library of Congress can index all 170 billion tweets ever posted: Library of Congress has a 133TB file of every update ever posted to Twitter; now it has to figure out how to manage it

Content area

Abstract

Details