Data Mining – Automated Pattern Discovery
ABSTRACT
In this paper we propose to talk about Data Mining the methods in use, how it links with Data Ware house activity to ferret out hitherto unknown patterns in statistical data, its objectives and benefits and its distinction from and interdependency with OLAP and query/reporting.
Data Mining methods and systems are based on years of classical work on statistics, pattern recognition and information theory. The availability of on-line enterprise-wide historical and therefore frequently huge data and the increasingly inexpensive computing power and access has facilitated the process of probing the statistical data mine in search of new perspectives and patterns that were previously hidden.
The proper use of Data Mining methods is focused on finding the effects that existed so that the causes could be reasoned there from. Thus exposed trends could be analyzed and exploited to beneficial effect there by improving the quality of decision making to meet single or multiple objectives.
INTRODUCTION
What is a Data Warehouse?
What do companies hope to gain from them?
Companies hope to find information that will help them remain viable in the business market by providing better services, more timely product introduction, better customer relations, or improve their business processes. A business would be able to discern customers tastes and predict trends in future buying decisions. They could analyze customer data and determine which are the most profitable and concentrate on them. Services such as travel packages, telephone banking, and others could be customized to specific customers to help keep their goodwill. Using buying-pattern data could use mailing, telephone, and other advertising campaigns to target specific groups. This would help companies get a better return for each dollar spent and remain competitive
Decision Support – Key components
Decision Support is a broad term referring to the use of information as a strategic corporate asset, enabling companies to utilize their databases to make better decisions.
There are three separate components to an enterprise-wide decision support system:
v The Data Warehouse, where the mountain of corporate data is stored all in one
place. Here data volumes are very high as multi-terabit data warehouses are beginning to appear more frequently. These designs are usually either star-schemas (or snow-flakes, etc.) or highly normalized data structures.
v The Data Mart where departmental data is stored, and often various external data
items are added. The data volumes are usually 15% to 30% of warehouse sizes, and the envelope is being pushed towards the terabit limit. These databases are also usually either 50based on star-schemas or are in a normalized form. They mostly deal with the data space, but at times some multi-dimensional analysis is performed.
v The Data Mine, where the data is re-organized for analysis and information is
extracted from the data. The data volumes here are the same as the Data Mart, but the data is much more richly structured and is no longer just departmental. The data here refers to a specific business objective and is analyzed for the purpose of information extraction. While the data structures used within the warehouse and the mart may be similar, the data structures used within the data mine are significantly different. The data mine differs from the data warehouse not just in the size of data it manages, but in the structure of the data. The content of the data in the mine is also often different from the data in the warehouse, because it is often enriched by additional external data not found within the warehouse. However, content aside, the key issue about data mining architecture is that the existing theories of data structuring do not apply to it.
Decision support systems have traditionally relied on three types of analyses:
v Query and Reporting: Where a user asks a question, e.g. "what were the sales for a
motorcycles in Bangalore..” OLAP (On Line Analytical Processing): which amounts to the processing of queries along multiple dimensions such as state, month, etc. For instance: "categorize sales by month by brand, by store-chain, by salesperson"
v Data Mining: which provides influence factors and relationships in data, e.g. "what
impacts sales in Bangalore for motorcycles." Data Warehouse query and reporting, OLAP and data mining have often been viewed as related activities. However these take place in conceptually distinct computational spaces. Data access operations such as query and reporting deal with the data space, OLAP uses the multi-dimensional space and data mining takes place in the Influence space. With statistics and reports just summaries of data were available to business users. And, the data could only be obtained by request from an analyst, e.g. sales summaries per quarter. With data warehouses, some query and reporting could be performed by business users on their own, e.g. product and store performance reports, etc. With OLAP, multi-dimensional summary questions could be addressed by business users, e.g. the total of sales by product, by channel, by month. With data mining, analysts and a sophisticated subset of business users could gain insight into the influence factors and trends in data. But often significant analysis was needed before key questions could be answered. With knowledge access, almost all the relevant patterns in the data are found beforehand, and stored for use by business users such as marketing analysts, bank branch managers, store managers, etc. Business users get the interesting patterns of change every week or month or can query the pattern-base at will. Because large databases often provide too much of a good thing, approaches based on Query and OLAP usually encounter a problem known as "The Maze of a Million Graphs" – a user can build a million pie charts and yet not see the forest for the trees because there is so much data. Data mining, on the other hand, draws its power from the ability to search through the data with its own initiative, discovering key patterns by itself.
Function Asks the question
Querying and Reporting What happened?
On-line Analytical Processing (OLAP) What happened and why?
Executive Information Systems(EIS) What do I need to know now?
Data Mining What's interesting?
What might happen?
Source: Simon, Alan. "Better Clients, Better Decisions." Byte, January 1997: 91.
DATA MINING
Data Mining is the "automatic" extraction of patterns of information from historical data, enabling companies to focus on the most important aspects of their business -- telling them
what they did not know and had not even thought of asking. Industry surveys clearly indicate that over 80% of Fortune 500 companies view data mining as a critical factor for business success by the year 2000. Most such companies now collect and refine massive quantities of
data in data warehouses.
These companies realize that to succeed in a fast paced world, business users need to be
able to get information on demand. And, they need to be pleasantly surprised by unexpected, but useful, information. There is never enough time to think of all the important questions -- the computer should do this itself. It can provide the winning edge in business by exploring the database itself and brings back invaluable nuggets of information. Many organizations now view information as one of their most valuable assets and data mining allows a company to make full use of these information assets.
Data mining employs techniques such as rule tree induction, laws of probability, and neural networks. Even with the application of these techniques, false data can often result. For example, one data mining company found a connection between dog food sales and cold. drinks in a supermarket database. The result was written off as a quirk, but not all connections may be dismissed so easily
Data Mining Above, Beside and Within the Warehouse
Once we accept the fact that the data mine is distinct from the data warehouse, the next
logical question is: "Where does the data mine actually exist? Is it a separate repository next to the warehouse, a set of views above the warehouse, or just part of the warehouse?" We can answer this question in each of these three ways and get a different architecture for the data mine.
The data mine can exist in three basic forms:
Ø Above the warehouse, as a set of conceptual views.
Ø Beside the warehouse, as a separate repository.
Ø Within the warehouse, as a distinct set of resources.
Data mining "above the warehouse" provides a minimal architecture for the discovery and
analysis. It is suitable only in cases where data mining is not a key objective for the warehouse. In this approach SQL statements are used to build a set of conceptual views above the warehouse tables. And, additional external data from other tables may be merged as part of the views. The views built above the warehouse may either be materialized (i.e. saved to disk as new tables), or not. Therein lies the fundamental problem (if not contradiction) built into this approach. If the views are not of significant size, then serious data mining can not take place. However, if the views are of a significant size, then without materialization the effort in computing them again and again will require very large amounts of processing power – in some cases significantly affecting the availability of the warehouse resources and interfering with other applications performing indexed retrievals.
On the other hand, if the views are of significant size and they are materialized, we are no longer data mining "above" the warehouse and will be using a disorganized form of the third approach, i.e. data mining within the warehouse. If the views are materialized, the third approach will almost always work better, because it can utilize a suitable data distribution approach and a specific processor allocation strategy, as well as using different data structures for data mining. Without these precautions, the number of potential pitfalls increase rapidly, sacrificing both performance and functionality. Hence data mining above the warehouse should be restricted to applications in which data mining is only of peripheral business interest, and not a key objective. However, holding this view is often a big a business mistake in itself -- i.e. why have so much data in a warehouse and not understand it?
In most cases, data mining is effectively performed beside the warehouse, with data structures that lend themselves to detailed analyses. And, in most cases additional data suitable for the analyses is merged with the warehoused data in order to perform specific analyses for focused business needs.
Data mining beside the warehouse both overcomes and sidesteps several problems at once. To begin with, it allows data mining to be done with the right data structures, avoiding the
problems associated with the structures of the data space. Moreover, the paradox of warehouse patterns can be avoided by selecting specific data segments, corresponding to specific business objectives. And, the interactive exploratory analyses that are often performed in the data mine with "wild rides" through the data no longer interfere with the warehouse resources that are responsible for routine processes such as query and reporting. In fact, different business departments can use their own data mines that address their specific needs, e.g. direct marketing vs. claim analysis. The data is then moved from the large warehouse to the mine, is restructured during the transformation and is analyzed. It is, however, important to design the transfer and transformation methods carefully, in order to allow for optimal "refresh methods" that require minimal computing. For instance, as we bring new data into the data mine every day or every week, the over-head for re-aggregation should be minimized.
In some cases, where the warehouse is a very large, massively parallel processing (MPP)
computer, the data mine may actually reside as a separate repository within the large warehouse. This is very similar to a data mine beside the warehouse, where the mine uses a portion of the physical warehouse, but is independent of the warehouse structures, in effect being a "republic within a republic". In this scenario, the disk apace and the processors for the data mine are specifically allocated and separately managed. For instance, on a "shared nothing" MPP machine with 32 processor, the disk space for the data mine is separately allocated on 8 of the 32 nodes and 8 processors are dedicated to data mining, while the other 24 processors manage the rest of the warehouse. And, when needed, additional processing capability may be directed towards the needs of data mining.
Although this idea may sound attractive based on arguments for centralization and scalability, in practice it usually leads to loss of flexibility, without providing any significant benefits for data mining. In most cases, when we consider at the technical, marketing and business issues, the problems with mining within the warehouse multiply quite rapidly, and the data mines planned for use within the warehouse will eventually find themselves beside it.
The key point is that the likelihood of serving the needs of many people within the data space is much higher that the likelihood of serving their needs within the multi-dimensional and
influence spaces. While the data elements may be almost the same for several departments, the dimensions, the influence relationships and the predictive models they all need will vary far more than their simple data needs. Hence, the data mine within the warehouse will soon become the lowest common denominator of all designs. Therefore, while the design of the data space may be subject to compromises to please the various user groups, there should be no compromises in design of the data mine where serious and detailed analyses take place. The data mine should be optimized to deliver effective results by focusing on specific business needs, because influence analysis is so much harder than data access.
Steps in Data Mining
Data Mining may involve some or all of the following steps, depending on the amount of data,
condition of the data and results desired.
Data Selection
This involves choosing the types of data to be used. A database may contain data about
customer purchases, lifestyles, demographics, census, state taxes etc. If a retailer wanted to decide how to lay out the display shelves in the store they may only need to use purchase
and demographic data.
Data Transformation
Once the data has been selected it often needs to be cleaned up or possibly transformed into values that can be operated on by the type of data mining operation to be performed and the
technique to be used. Data may need to be converted into numeric values to be used in a neural network, new attributes may need to be defined or derived. In one case the database included 500 different ways of identifying which state of the U.S. the information came from.
Data Mining
The data is then mined using the desired technique in an effort to extract the information.
There are many methods of mining for data. The method used is often based on the type of information you are seeking and the type of data that you have. Some of the methods are: association, sequence-based analysis, clustering, classification, estimation, fuzzy logic, neural networks, fractal-based transforms, and genetic algorithms.
To develop a symbolic classification model to predict if a magazine subscriber will renew their subscription, you first need to use clustering to segment the database and then apply rule induction to create a classification model for each desired cluster.
Data mining can also be:
Ø Multilevel - able to compare between texts of any size.
Ø Multimedia - supports analysis of text and images.
Ø Multimode - operation includes interactive client-side and automatic server-side
processing.
Ø Multilingual - analysis is designed to support direct comparison of texts in different
languages.
Result Interpretation
Once the information has been extracted, it is analyzed based on the end users requirements, and the information is identified and presented to the decision maker via the decision support system. The purpose of interpretation is to visualize the output (logically or graphically) and filter the information to be presented to the decision maker. It is not uncommon to find during the interpretation step that the rules of data selection need to be modified. Some of the decisions to be made may involve large amounts of money and management is not very enthusiastic about embracing ideas that they cannot understand or analyze for themselves. If management cannot understand the rules it is hard to explain to a client how they reached the decision.
Who is using Data Mining?
The Washington State Dept. of Health and Social Services, National Basketball Association, U. S. Treasury, scientists, banks, retailers, and service providers are some of the organizations that have implemented Data Mining and hope to benefit from its use. Because the cost of implementing Data Mining systems with massively parallel computers can run from $350,000 to a few million dollars, some companies like IBM are offering data mining services. A company sends their data to the vendor, who than cleans and processes the data. The results could then be sent back to the company or posted on the Internet where the company can access them. Service fees have not been set yet, but the software costs are expected to be around $40,000. Data mining is not just for large corporations, but can be powerful for small businesses also. A restaurant owner who serves 500 meals a week may want to know what new dish to recommend to a customer based on past selections.
What have they actually found?
Washington State Dept. of Social and Health Services
They had no idea of how many children were in the foster care system, how long they were in the system, or if there was any abuse. Their concern was in finding the answers to these questions without basic data or adequate analysis. One of the data analysts posed the question to KDD Nuggets electronic newsletter and were then given a contact at Columbia University. Working with the researchers at Columbia University and a 500-MB database of foster caregiver payment data they were able to find out how many children stay in long term care and who they are. They also found that 30% of the children that leave the system come back. Using this new information the State of Washington is working to improve resources so that children who enter the foster care system as infants get adopted into permanent homes rather than staying in the foster care system. One of their other goals is to install support services for children when they go home to help keep them from returning to the foster care system. They continue to look for more information to help improve the foster care system.
National Basketball Association
The NBA has teamed up with IBM to deploy a KDD system, called Advanced Scout, that helps coaches strategize for each upcoming game. Last season 14 of the 29 teams used Advanced Scout and the rest were slated to be given copies for this season. During games the NBA collects all of the data on shots, fouls, free throws, turnovers, and other game activities. This data is then posted to a bulletin board so that each team can have access to it and download it. The Orlando Magic used it to better understand rebounding problems and find the best team combination for rebounding. They then used this new information when substituting players. It used to take days to interpret data that can now be analyzed in minutes. In some cases just analyzing the data can lead to misinterpretation, but when tied in with viewing the game videos, it can be used to make strategic decisions. The Mavericks and Magic are both working on systems that will tie the data to digital video with some form of time stamping. This will allow the coaching staff to view the footage instantly. Their plan is to implement this on a laptop computer which will make it very easy to analyze.
U.S. Treasury
The Treasury's Financial Crimes Enforcement Network (FinCEN) uses and Artificial Intelligence Tracking System, a data mining application, to detect suspicious monetary transactions which could point to money laundering. AITS processes about 200,000 large cash transactions (> $10,000) posted by casinos, banks and other organizations each week under federal laws. The results have been turned over to various local, state and federal agencies to assist in thousands of investigations. Because of the volume of data they currently use large computer systems to break down the data and then data analysts do further extraction to drill down farther. They are working on a next generation system with an automated explorer. Because the designer of the system made it user friendly, new workers are able to become productive within a couple of days, and what used to be impossible to analyze can now be searched in times running from a few seconds to 30 minutes.
California Institute of Technology
The Palomar Observatory (San Diego County, Ca.) is taking pictures that will result in a 3- terabyte data set. Knowing that the amount data would be so large, about the equivalent of six million books, George Djorgovski and his team decided to digitize the pictures. They realized that with all of this data they would need to use new software to turn the raw pixel data into stars, galaxies and other objects. He and one of his doctoral students hooked up with the JPL. This collaboration resulted in Skicat, a KDD system that automatically measures and classifies the millions of objects detected in the survey images. Because of the new system they have been able to catalog about 250,000 stars, where the typical study is about 2,000. Djorgovski commented that they now have a living, breathing catalog and that with the huge databases they need to explore new ways to structure, store and use them. He feels that where astronomy used to be data poor they are now becoming data rich.
Watson Research Center
Using the same 40 financial indicators as those used by stockbrokers at the American Securities Market, they tried to find the simplest rules for spotting the best investment bet each month. The researchers used a technique known as disjunctive normal logic, which is a way of connecting the descriptions of data so that contradictions can be quickly found. By following the simple rules they were able to average a return of 270 per cent over a five year period compared to the market average of 110 per cent.
The Paradox of Warehouse Patterns
Data warehouse is a natural place for storing "data" a data mine is the natural place for
performing influence related analyses. To understand the paradox, let us note that the concepts of "large warehouse" and "useful pattern" often interact in a seemingly contradictory way. On one hand, the larger a warehouse, the richer its pattern content, i.e. as the warehouse grows the more patterns it includes. On the other hand, after a point, if we analyze "too large" a portion of a warehouse, patterns from different data segments begin to dilute each other and the number of useful patterns begins to decrease! So the paradox may be stated as follows: "The more data in the warehouse, the more patterns there are, and after a point the more data we analyze the fewer patterns we find!"
A few simple examples easily clarify this further. First, consider a vehicle warranty database. In order to find patterns for customer claims it is essential to store details of each claim in a large data warehouse. But does it make sense to analyze all of the warehouse at the same
time? Does it make sense to ask: "what causes brake problems?" No. In practice, cars are built at different plants and different models of cars use different parts -- and some parts are now discontinued. Moreover, over the course of years the parts used in cars change, so analyzing the entire warehouse may tell us less than analyzing part of it. What works best in practice is to analyze the claims for a given model year for cars built at a given plant – again a segmentation task. Once again, the paradox of the warehouse comes into play here in that by analyzing all of the warehouse at once we reduce the number of useful patterns we are likely to get!
As another example, consider a large data warehouse that includes details of bank's customer accounts, marketing promotions, etc. There can be several business objectives for mining this data, including campaign analysis, customer retention, profitability, risk assessment, etc. To begin with, these are distinct business tasks and it does not make sense to mix the analyses -- hence each of the data mining exercises needs to be performed separately, and will require different data structures as well, because some are association analyses, some are clustering, etc.
However, even the campaign analysis task itself should often not be performed on the entire warehouse. The bank may have undertaken 30 different marketing campaigns over the years, and these campaigns will have usually involved different products and gone to different
customer segments -- some of the products are even discontinued now. To understand who responds best to marketing promotions, we need to analyze each campaign (or group of campaigns) separately because each case will involve patterns with distinct signatures. Mixing the analyses into one data mining exercise will simply dilute the differences between these signatures. And the campaigns are often different enough that mixing them simply may not make sense. So we need to have a separate "Analysis Session" for each group of campaigns. To demonstrate this with a simple example, let us assume that those customers who are over 40 years old and have more than 2 children have a high response rates to credit card promotions. Now, let us also assume that customers who are less than 40 years old and have only 1 child are good prospects for new checking accounts. If we combine these campaigns within the same data mining study and simply look for customers who have a high response rate, these two patterns will dilute each other.
Of course, we can get a rule that separates these campaigns and still display the patterns, but in a large warehouse so many of these rules will appear that they will overwhelm the user.
Thus, the smaller patterns may be found in the warehouse if we are prepared to accept large amounts of conditional segment information, e.g. "If Campaign = C12 and ... Then ...". However, in a large warehouse, there are so many of these that the user will be overloaded with them. The best way is to analyze each group of campaigns separately.
The need for segmentation is even more clear when we consider predictive modeling. When trying to predict the response to a new campaign, it simply does not make sense to base the predictions on all previous campaigns that have ever taken place, but on those campaigns which are most similar to the one being considered. For instance, responses to campaigns for a new checking account may have little bearing on responses to campaigns for a new credit card or a refinancing a home. In this case, the paradox of the warehouse patterns comes into play in that by considering more data, we lose accuracy. This is, of course, because some of the data will not be relevant to the task we are considering. But what happens if there are one or two key indicator that are common to all of the campaigns? Will they be lost if we just analyze the campaigns a few at a time? Of course not. If a pattern holds strongly enough in the entire database, it will also hold in the segments.
For instance, if the people with more than 5 children never respond to campaigns, this fact will also be true in each individual campaign. Hence, most of the time it does not make sense to analyze all of a large warehouse because patterns are lost through dilution. To find useful patterns in a large warehouse, we usually have to select a segment (and not a sample) of data that fits a business objective, prepare it for analysis and then perform data mining. Looking at all of the data at once often hides the patterns, because the factors that apply to distinct business objectives often dilute each other. Hence, the thirst for information can go unquenched by looking at too much data.
The Concept of an Analysis Session
When using a data mine, we bring a segment (and not a sample) of data from a warehouse (or
other sources) to the data mine and perform discovery or prediction. The process of mining this data segment is called an Analysis Session. For example, we may want to predict the response to a proposed direct mail campaign by analyzing previous campaigns for similar products, or we may want to know how customer retention has varied over various geographic regions, etc.
An analysis session may be either "structured" or "unstructured". A structured session is a
more formal activity in which we set out with a specific task, e.g. analyzing profitability by customer segments and/or products. In fact, structured sessions are often performed in a routine manner, e.g. we may analyze costs, revenues or expenses every quarter, and understand the reasons for the trends. Or we may routinely perform forecasting for various items such as product demand in various markets. Or, we may look for unusual transactions that have taken place in the past 30 days. In fact, a structured analysis session usually is of three forms: a discovery, prediction or forensic analysis activity where we perform a specific task.
An unstructured session is a "wild-ride" through the database, where the user wanders around without a goal, hoping to uncover something of interest by serendipity -- or by help from a "exploration-agent". This type of abstract wild-ride usually uncovers some very wild facts hidden in the data. And the mine is a natural place for this activity because the unexpected nature of queries may interfere with the more routine tasks for which the warehouse was designed, e.g. looking up the history of a specific claim. The data in the data mine often needs to be enriched with aggregations. Again, let me emphasize that these are not just summaries, but additional elements added to the data. How these aggregations are built is partly decided by a business analysis. For instance, we may need to look at the number of credit cards a customer has as an item. And we may want to look at the "volume" of transactions the customer has had. We may also want to look at the number of claims a customer has had in an insurance setting, etc. These aggregations enrich the data and co-exist with the atomic level data in the mine.
CONCLUSION
Data Mining offers great promise to organizations in helping uncover hidden patterns in their historical data and using the patterns discovered in analyzing trends that can be exploited to
enable executive decision making.
However the proper use of data mining tools has to be ensured by letting the users who
understand the business, the data, the pros and cons of the analytical methods used and the advantages and disadvantages of the tools used.
Building the right model is one step and it makes sense to try on various models and tools
and choose the tool that suits the specific business problem as well as one that matches the skill set of the end-user. It is equally important to begin early and ensure that the data warehouse be built to facilitate data mining. The data has to be properly collected and cleaned and prepared. Thus ensured one can expect rewarding results from improving revenues to reducing costs.
REFERENCES
Ralph Kimball, “The Data Warehouse Lifecycle Toolkit”.
Parsaye, K., "New Realms of Analysis". Database Programming & Design, April 1996.
Parsaye, K., Chignell, M.H.: "Intelligent Database Tools and Applications". New York: John
Wiley and Sons, 1993 .
Parsaye, K.: "Data Mines for Data Warehouses". Database Programming & Design,
September 1996.
Parsaye, K., "OLAP and Data Mining: Bridging the Gap". Database Programming & Design,
February 1997.
Parsaye, K. "Machine-Man Interaction" DM Review, September 1997.
Matthews, Robert. "Panning for Data Gold". New Scientist. May 25, 1996:30
"Large Scale Data Mining in Parallel", DBMS Magazine, March 1995.
http://www.datamining.com
http://www.datamine.inter.net/datamine
1 comment:
I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in Data Mining, kindly contact us http://www.maxmunus.com/contact
MaxMunus Offer World Class Virtual Instructor led training on Data Mining. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
For Free Demo Contact us:
Name : Arunkumar U
Email : arun@maxmunus.com
Skype id: training_maxmunus
Contact No.-+91-9738507310
Company Website –http://www.maxmunus.com
Post a Comment