John Wiley & Sons, 2003. — 304 p. — ISBN 0471228524.
Traditionally, analysts have performed the task of extracting useful information from recorded data. But, the increasing volume of data in modern business and science calls for computer-based approaches. As data sets have grown in size and complexity, there has been an inevitable shift away from direct hands-on data analysis toward indirect, automatic data analysis using more complex and sophisticated tools. The modern technologies of computers, networks, and sensors have made data collection and organization an almost effortless task. However, the captured data needs to be converted into information and knowledge from recorded data to become useful. Data mining is the entire process of applying computer-based methodology, including new techniques for knowledge discovery, from data.
The modern world is a data-driven one. We are surrounded by data, numerical and otherwise, which must be analyzed and processed to convert it into information that informs, instructs, answers, or otherwise aids understanding and decision-making. This is the age of the Internet, intranets, data warehouses and data marts, and the fundamental paradigms of classical data analysis are ripe for change. Very large collections of data—sometimes hundreds of millions of individual records–are being stored in centralized data warehouses, allowing analysts to use more comprehensive, powerful data mining methods. While the quantity of data is huge, and growing, the number of sources is unlimited, and the range of areas covered is vast: industrial, commercial, financial, and scientific.
In recent years there has been an explosive growth of methods for discovering new knowledge from raw data. In response to this, a new discipline of data mining has been specially developed to extract valuable information from such huge data sets. Given the proliferation of low-cost computers (for software implementation), low-cost sensors, communications, database technology (to collect and store data), and computer-literature application experts who can pose "interesting" and "useful" application problems, this is not surprising.
Data mining technology has recently become a hot topic for decision-makers because it provides valuable, hidden business and scientific "intelligence" from a large amount of historical data. Fundamentally however, data mining is not a new technology. Extracting information and knowledge from recorded data is a well-established concept in scientific and medical studies. What is new is the convergence of several disciplines and corresponding technologies that have created a unique opportunity for data mining in a scientific and corporate world.
Originally, this book was intended to fulfill a wish for a single, introductory source to direct students to. However, it soon became apparent that people from a wide variety of backgrounds, and positions, confronted by the need to make sense of large amounts of raw data, would also appreciate a compilation of some of the most important methods, tools, and algorithms in data mining. Thus, this book was written for a range of readers; from students wishing to learn about basic processes and techniques in data mining, to analysts and programmers who will be engaged directly in interdisciplinary teams for selected data mining applications. This book reviews state-of-the-art techniques for analyzing enormous quantities of raw data in high-dimensional data spaces to extract new information useful to the decision-making process. Most of the definitions, classifications, and explanations of the techniques covered in this book are not new, and they are presented in references at the end of the book. One of my main goals was to concentrate on systematic and balanced approach to all phases of a data mining process, and present them with enough illustrative examples. I except that carefully prepared examples should give the reader additional arguments and guidelines in the selection and structure of techniques and tools for their own data mining application. Better understanding of implementation details for most of the introduced techniques challenge the reader to build their own tools or to improve the applied methods and techniques.
To teach data mining, one has to emphasize the concepts and properties of the applied methods, rather than the mechanical details of applying different data mining tools. Despite all of their attractive bells and whistles, computer-based tools alone will never replace the practitioner who makes important decisions on how the process will be designed, and how and what tools will be employed. A deeper understanding of methods and models, how they behave, and why, is a prerequisite for efficient and successful application of data mining technology. Any researcher or practitioner in this field needs to be aware of these issues in order to successfully apply a particular methodology, understand a method's limitations, or develop new techniques. This is an attempt to present and discuss such issues and principles, and then describe representative and popular methods originating from statistics, machine learning, computer graphics, databases, information retrieval, neural networks, fuzzy logic, and evolutionary computation. It discusses approaches that have proven critical in revealing important patterns, trends, and models in large data sets.
Although it is easy to focus on technologies, as you read through the book keep in mind that technology alone does not provide the entire solution. One of my goals in writing this book was to minimize the hype associated with data mining, rather than making false promises of what can reasonably be expected. I have tried to take a more objective approach. I describe the processes and algorithms that are necessary to produce reliable, useful results in data mining applications.
I do not advocate the use of any particular product or technique over another; the designer of data mining process has to have enough background to select the appropriate methodologies and software tools. I expect that once a reader has completed this text, he or she will be able to initiate and perform basic activities in all phases of a data mining process successfully and effectively.
Data-Mining Concepts
Preparing the Data
Data Reduction
Learning from Data
Statistical Methods
Cluster Analysis
Decision Trees and Decision Rules
Association Rules
Artificial Neural Networks
Genetic Algorithms
Fuzzy Sets and Fuzzy Logic
Visualization Methods
A: Data-Mining Tools
B: Data-Mining Applications