How do you rate data mining

Data mining

What does it mean?

With the help of data mining, valuable, non-obvious information is to be discovered in large databases in order to support decisions. This means that data mining is a process of selecting, explaining and modeling large amounts of data in order to find previously unknown relationships.

The amount of data collected and stored in companies is constantly increasing. It is estimated that the amount of information available worldwide doubles every 20 months. The use of electronic recording systems such as scanner cash registers and the use of increasingly powerful storage media promotes this development. However, as the amount of data increases, it becomes more difficult to find useful information. Therefore, the huge amounts of data have to be analyzed to determine their meaning.

Fig. 1: The need for data mining

Definition and classification

The term data mining was coined in 1996 by Fayyad, Piatetsky-Shapiro and Smyth. Data mining is then part of Knowledge Discovery in Databases (KDD). The KDD encompasses the entire process of (semi-) automatic extraction of knowledge from databases, while data mining is a sub-process that deals with the evaluation and analysis of data.

Fig. 2: Process model Knowledge Discovery in Databases (KDD)

Significance and practical examples

Data mining is becoming increasingly important in the marketing sector. Based on the analysis and interpretation of customer data (age, gender, address, occupation, leisure activities, number and type of products and services purchased, etc.), extremely effective advertising strategies can be developed and market segments determined. Mainly for this reason, the bonus and loyalty card programs are also increasing sharply. In addition to customer loyalty, programs such as HappyDigits, Pay Back, etc. offer participating companies the benefit of receiving customer-related data when shopping. The cash register provides the item-related data and the customer card the customer-related data.

This allows individual data, which in and of itself has little or limited informational value, to be merged and related to one another in order to enable conclusions to be drawn about purchasing behavior and to create detailed customer profiles.

By analyzing these data relationships, a supermarket could, for example, determine that 80% of women between the ages of 25 and 35 buy chips or similar snacks at the same time as they buy a magazine. This information could be used to optimize both target group-specific advertising and product placement.

Insurance companies use data mining to analyze the likelihood of cross-selling among customer groups. What is the probability that men between the ages of 30 and 40 will take out life insurance in addition to disability insurance? If the probability is high enough, coordinated sales activities can be started. Furthermore, predictions can also be made about the future value of a customer (customer lifetime value).

The so-called outlier detection can be used, for example, to detect fraud. What similarities do customers have in common using their auto insurance for scams? Telecommunications companies analyze their data to find out which customer groups are most interesting for new services and products. Is there a relationship between the number of SMS a customer sends a month and their willingness to buy a camera phone?

Data mining is also increasingly used in the technical field. In a system developed at the University of Helsinki, the chronological sequence of alarms in a telecommunications network is analyzed. Each of the numerous components of such a network can sound an alarm in certain situations, which can occur 200 to 10,000 times a day. The Telecommunication Network Alarm Sequence Analyzer (TASA) system searches for rules that can predict the occurrence of further alarms from the sequence of alarms.

Data mining can also support the acquisition of knowledge from texts or documents on the Internet or on internal servers. The documents can thus be classified automatically. In this context, one also speaks of text mining or web mining.

Practical tip

Data mining must always be viewed from a cost-benefit perspective. The value of the information obtained must clearly exceed the costs incurred. Only then is this information valuable. Try to evaluate the information in order to assess the profitability of data mining projects.

The application of patterns

The aim of data mining is to gain knowledge from the available data. In the context of data mining, knowledge is generally understood to mean:

Patterns that have certain additional properties and are represented in a formal language.

The most common patterns are:

  • Cluster
  • regulate
  • classification
  • Dependency pattern
  • Connection pattern
  • Temporal patterns
  • Formulas and laws.

Fig. 3: Examples of patterns

Methods and Techniques

There are a variety of methods, techniques and algorithms for finding such patterns in databases. Many methods originally come from the field of machine learning, but statistical methods and interactive analyzes using visualization methods are also used.

Very frequently used methods are, for example:

  • Regression models
  • Decision trees
  • Neural Networks
  • Factor analysis
  • Time series forecast
  • Connection analysis.

Which method is used depends very much on the type of pattern to be found. In the meantime, standard software tools such as the Enterprise Miner from SAS also offer a wide variety of methods.

Practical tip

With all these technical possibilities and methods, people are the focus of data mining. It takes great tact and knowledge on the part of the data mining expert to decide which method is to be used in which situation or how differences in the results between different methods are to be assessed.

Literature tips

Ester, M .; Sander, J .: Knowledge Discovery in Databases: Techniques and Applications, Springer Verlag, 2000.

Alpar, P .; Niedreichholz, J .: Data mining in practical use, Vieweg Verlagsgesellschaft, 2000.

Otte, R. et al .: Data mining for industrial practice, Hanser Fachbuchverlag, 2003.


Data warehouse

First-time author

Stefan Heindl