Content area

Abstract

Although many interesting results have been reported by researchers using numeric data mining methods, there are still questions that need answering before textual data mining tools will be considered generally useful due to the effort needed to learn and use them. In 2011, we generated a dataset from the legal statements (mainly privacy policy and terms of use) on the websites of 475 of the US Fortune 500 Companies and used it as input to see what we could detect about the organizational relationships between the companies by using a textual data mining tool. We hoped to find that the tool would cluster similar corporations into the same industrial sector, as validated by the company's self-reported North American Industry Classification System code (NAICS). Unfortunately, this proved only marginally successful, leading us to ask why and to pose our research question: What problems occur when a data-mining tool is used to analyze large textual datasets that are unstructured, complex, duplicative, and contain many homonyms and synonyms? In analyzing our large dataset we learned a great deal about the problem and fortunately, after significant effort, determined how to "massage" the raw dataset to improve the process and learn how the tool can be better used in research situations. We also found that NAICS, as self-reported by companies, are of dubious value to a researcher -- a matter briefly discussed. [PUBLICATION ABSTRACT]

Details

10000008
Business indexing term
Title
A research case study: Difficulties and recommendations when using a textual data mining tool
Publication title
Volume
50
Issue
7
First page
540
Publication year
2013
Publication date
Nov 2013
Publisher
Elsevier Sequoia S.A.
Place of publication
Amsterdam
Country of publication
Switzerland
ISSN
03787206
e-ISSN
18727530
CODEN
IMANDC
Source type
Scholarly Journal
Language of publication
English
Document type
Case Study, Feature
ProQuest document ID
1449192480
Document URL
https://www.proquest.com/scholarly-journals/research-case-study-difficulties-recommendations/docview/1449192480/se-2?accountid=208611
Copyright
Copyright Elsevier Sequoia S.A. Nov 2013
Last updated
2024-11-22
Database
ProQuest One Academic