Talk about big data: How the Library of Congress

Full text

The Library of Congress has received a 133TB file containing 170 billion tweets -- every single post that's been shared on the social networking site -- and now it has to figure out how to index it for researchers.

In a report outlining the library's work thus far on the project, officials note their frustration regarding tools available on the market for managing such big data dumps. "It is clear that technology to allow for scholarship access to large data sets is not nearly as advanced as the technology for creating and distributing that data," the library says. "Even the private sector has not yet implemented cost-effective commercial solutions because of the complexity and resource requirements of such a task."

If private organizations are having trouble managing big data, how is a budget-strapped, publicly funded institution -- even if it is the largest library in the world -- supposed to create a practical, affordable and easily accessible system to index 170 billion, and counting, tweets?

FREE CLOUDS! 12 free cloud storage options

GET YOUR CES ON: The best of CES '13, in pictures

Twitter signed an agreement allowing the nation's library access to the full trove of updates posted on the social media site.

Library officials say creating a system to allow researchers to access the data is critical since social media interactions are supplanting traditional forms of communication, such as journals and publications.

The first data dump came in the form of a 20TB file of 21 billion tweets posted between 2006 when Twitter was founded and 2010, complete with metadata showing the place and description of tweets. More recently,...

Show less

Talk about big data: How the Library of Congress can index all 170 billion tweets ever posted: Library of Congress has a 133TB file of every update ever posted to Twitter; now it has to figure out how to manage it

Full text

Suggested sources

Talk about big data: How the Library of Congress can index all 170 billion tweets ever posted: Library of Congress has a 133TB file of every update ever posted to Twitter; now it has to figure out how to manage it

Content area

Full text

Suggested sources