School of Information Systems

Data Mining and Predictive Analytics with Databases

Introduction

In today’s data-driven landscape, the amalgamation of cutting-edge data mining techniques, predictive analytics, and robust databases forms the bedrock of informed decision-making and strategic insights. This article will unravel the symbiotic relationship between data mining and predictive analytics, highlighting the pivotal role played by databases in shaping these analytical endeavours, as we navigate the intricate terrain of data exploration and analysis.

From a comprehensive overview of data mining to understanding the nuances of predictive analytics, and delving into the diverse types of databases suitable for these tasks, we aim to illuminate the interconnected facets that drive the synergy between technology and data-driven decision-making.

Data Mining Overview

Data mining is the process of extracting meaningful patterns and knowledge from large volumes of data. Normally, it involves the use of a variety of approaches such as statistical analysis, machine learning, pattern recognition, and others, to uncover hidden correlations and insights within information.

The main purpose of data mining is to gain valuable information and insights that can help with decision making processes, making a prediction, and other optimizations in a variety of fields such as business, healthcare, and science.

The data mining process can be divided into 4 primary stages:

  1. Data gathering. This step involves identifying and compiling data. The data could be stored in several source systems, a data warehouse, or a data lake.
  2. Data preparation. Involve steps to prepare the data such as exploration, profiling, and preprocessing. The data is then undergo through data cleansing step which corrects mistakes and improve the data quality. Additional step like data transformation may be applied for consistency, ensuring datasets are suitable for analysis.
  3. Data mining. After the data is prepared, the data scientist selects the suitable data mining approach and implements one or more algorithms to do the mining. These algorithms undergo initial testing on smaller datasets to discern the sought-after information before implementing it to the entirety of the dataset in machine learning applications.
  4. Data analysis and interpretation. The results of the data mining are used to develop analytical models that can help with decision making and other business operations. Data visualisation and data storytelling may be used by data scientists to convey the findings to stakeholders with decision making responsibilities. This step ensures that the insights gained are not only understandable but also actionable in helping to create strategic decisions and enhance overall business operations.

Understanding Predictive Analytics

Predictive analytics involves the use of statistical algorithms, machine learning, and data mining techniques to analyse past data and identify patterns that can be used to predict future outcomes or trends. It seeks to anticipate the likelihood of certain occurrences, behaviours, or trends by analysing patterns and relationships within the data. By focusing on underlying patterns and trends in data, predictive analytics create predictive models that can be applied to new data to make accurate predictions.

In terms of business applications, predictive analytics is widely used across various industries to optimize decision making processes and enhance operational efficiency. For example, entertainment application like Netflix collects data from its customers based on their behaviour and past viewing patterns. The information is then used to make recommendations based on their preferences. This will improve the customer experience because of the personalised feature the app offers.

Types of predictive analytics models:

  1. Decision trees

This model divides data into several parts depending on specific factors. It resembles a tree, with distinct branches and leaves. Individual leaves symbolise a specific decision, while branches represent the possible options. They’re quite straightforward to comprehend and analyse, and may be beneficial when a decision needs to be made quickly.

  1. Regression

Mostly used in statistical analysis, this model is used when it’s necessary to find patterns in vast amounts of data. This approach works by calculating a formula that describes the connection between all of the inputs in the datasets.

  1. Neural Networks

This model uses artificial intelligence to mimic the way the human brain functions. Besides using AI, this model also uses pattern recognition to handle complicated data interactions. This type of model is normally used when data scientists have too much data, doesn’t have the formula needed, or when scientists need to make predictions.

  1. Cluster Models

A way of grouping data that has similar characteristics or features. These models seek to uncover fundamental patterns in data and group pieces with similar characteristics together. Cluster models enable the discovery of natural groups within datasets, offering insights into the underlying structures and interactions between data pieces. This clusters assist in segmenting data for subsequent analysis, allowing for a more nuanced understanding of unique patterns or behaviours contained in the dataset.

  1. Time Series Modelling

Data can sometimes be related to time, and specialised predictive algorithms rely on this link. These models evaluate inputs on a regular basis such as weekly or monthly, and then depending on the time, the model look for seasonality, trend or behavioural patterns. This type of model may be used to forecast peak customer service hours or when a certain sale will occur.

Role of Database in Data Mining and Predictive Analytics

In the context of data mining and predictive analytics, database serve the role as the repository, the data provider that contains vast amount of structured and unstructured data which will be mined and processed for creating predictive models. These systems provide centralised and organised storage solutions that enables data scientists and analysts to access, retrieve, and alter data effectively. Organised form of databases makes it easier to apply various data mining methods, which allows scientists and analysts to find relevant patterns, correlations, and trends within the data smoothly.

However, handling large datasets may have significant challenges in storage, processing, and retrieval issues. Databases solve these issues with their efficient storage systems, indexing, and query optimisation techniques. Databases are able to expand both vertically and horizontally, which helps them to accommodate larger datasets while maintaining its performance. This feature is critical for data mining and predictive analytics because it guarantees that analyses are completed on time and predictive models are effectively trained and implemented.

Database Suitable for Data Mining

Acknowledging the various types of databases, not all databases are suitable for data mining processes. Each database has its own set of strengths and applicability for particular data mining processes.

Relational databases such as MySQL and PostgreSQL are well suited for structured data and accept SQL queries, which make them appropriate for scenarios in which data follows a preset schema. On the other hand, NoSQL databases such as MongoDB and Cassandra are built to manage unstructured and semi-structured data, which enable them to be flexible in storing various sort of information.

The suitable database for data mining cannot be generalised. What works for one may not work for others. It’s also depends on the nature of the data and the specific requirements of the mining tasks. In some cases, a hybrid method combining relational and NoSQL database may be used to take use of the strengths of each type, dependent on the features of the data being analysed.

Relational databases are more suitable when dealing with structured data that adheres predefined schema. This type of database is a good solution for data mining tasks if the data is well-organised into tables with specified relationships.

NoSQL databases are more suited for dealing with heterogeneous or unstructured data. This is because they provide greater flexibility in dealing with various data types and are frequently more scalable for huge amounts of unstructured data.

Case Study: Netflix’s Predictive Analytics for Content Recommendation

The leading streaming service platform, Netflix, was faced with the challenge of keeping subscribers engaged. To address this, the company collects extensive data from its subscribers, including viewing history, search queries, user rating, time spent on each content, or even their device information. This data is then collected and stored in their massive data warehouse.

Netflix then combines predictive analytics models including collaborative filtering which identifies patterns by comparing user behaviours with other users, and content-based filtering, which recommends content based on specific attributes and user preferences.

The company use complex tools to provide each user with personalised recommendations. They also use A/B test, which is a test to inform decisions and continuously innovate the products (2021). A/B testing allows Netflix to compare the performance of different recommendations models or user interface elements by dividing its user base into two or more groups (A & B) and exposing them to different variations. For example, Netflix use A/B test to compare two different algorithms for recommending movies based on a user’s viewing history. The algorithm with a higher click-through rate and user satisfaction in the A/B test might be implemented as the primary recommendation model for all users.

The platform’s predictive analytics implementation can be said to be successful, as according to Netflix, over 75% of viewer activity is based off personalised recommendations (Dylan, 2012). Besides that, Netflix also able to predict when subscribers are at risk of cancelling their subscription. So, to reduce churn and retain users, when the patterns found, Netflix can intervene proactively with targeted discounts, personalised suggestions, or other retention techniques.

Challenges Faced and Solutions

Embarking on the journey of working with data presents data scientists with a myriad of challenges that, if not addressed properly, can impede the analytical process. Here are some challenges and the solutions to each of them.

  • Inaccurate or incomplete data

This problem can significantly impact the effectiveness of data mining and predictive analytics. Implementing robust data cleansing processes and regularly audit and enhance data quality may help solve this problem.

  • Data security

Protecting sensitive information is crucial especially when dealing with personal or proprietary data in predictive analytics applications. To address this challenge, applying tight security measures such as encryption and monitoring access controls is essential. Additionally, complying with relevant data protection regulations and industry standards can provide a comprehensive solution to protect sensitive data and ensure compliance with established security protocols.

  • Unstructured data

Handling unstructured data such as text or images may give another challenge for traditional database and analytical models. The solution to this challenge may be leveraging NoSQL database and implement advanced preprocessing techniques and utilize specialized algorithms specifically made for unstructured data types.

Aside from the challenges mentioned above, regular monitoring is required to ensure not only the security but also the sustained quality of the data. This requires constant monitoring to discover and repair possible security flaws, ensuring that data is secured against emerging threats. Simultaneously, a continuous monitoring system allows for the early detection of any changes in data quality, allowing for prompt interventions and improvements.

References:

Stedman, C., & Hughes, A. (2021, September 7). What is data mining?. Business Analytics. https://www.techtarget.com/searchbusinessanalytics/definition/data-mining

Halton, C. (2023, January 30). Predictive analytics: Definition, model types, and uses. Investopedia. https://www.investopedia.com/terms/p/predictive-analytics.asp

Lutkevich, B., & Hughes, A. (2023, February 23). What is a database?: Definition from TechTarget. Data Management. https://www.techtarget.com/searchdatamanagement/definition/database

Relational vs. non-relational databases. MongoDB. (n.d.). https://www.mongodb.com/compare/relational-vs-non-relational-databases

Compare relational and NoSQL databases. OpenClassrooms. (2023, June 2). https://openclassrooms.com/en/courses/5671741-design-the-logical-model-of-your-relational-database/6255746-compare-relational-and-nosql-databases

Netflix Tech Blog. (2022, January 10). What is an A/B test?. Medium. https://netflixtechblog.com/what-is-an-a-b-test-b08cc1b57962

Love, D. (2012, April 9). Netflix’s recommendation engine drives 75% of viewership. Business Insider. https://www.businessinsider.com/netflixs-recommendation-engine-drives-75-of-viewership-2012-4

Ankleshwariya, Y. (2023, June 28). How predictive analytics enhance various aspects of business. LinkedIn. https://www.linkedin.com/pulse/how-predictive-analytics-enhance-various-aspects-yash-ankleshwariya/

10, J. B. M., Bilham, J., H., B., H., P., Hong, B., Hong, S., & Bilham, B. (2023, March 13). Data Analytics in design from Airbnb, Netflix, Amazon & Spotify. Raw.Studio. https://raw.studio/blog/data-analytics-airbnb-netflix-amazon-spotify/

VivekR. (2023, June 18). How did netflix use Big Data to transform their company and dominate the streaming industry?. Medium. https://vivekjadhavr.medium.com/how-did-netflix-use-big-data-to-transform-their-company-and-dominate-the-streaming-industry-a93f90ae8dad

Devyano Luhukay & Vanessa Elizabeth Harianto