The Importance of Data Preprocessing for Machine Learning in the E-Commerce Industry
Abstract
Big data, as the name suggests, are large volumes of data that contain a variety of data that travel in high velocity. Big data are bound to contain dirty data as it is collected from various sources that are raw or unprocessed. Data preprocessing is the process of transforming raw data to an understandable format which is ready for analytical uses. Machine Learning is an artificial intelligence subset and an analytical application that is used to make decisions without programming by receiving and analyzing data. E-commerce industry is the industry which revolves around the application of technology into commercial businesses. This article will review about the relationship and importance of data preprocessing for machine learning in the e-commerce industry.
Introduction
In this developing era, technology cannot be separated from mankind because technology develops parallel to the growth of people’s needs. This development of technology has led to the creation of many advanced and useful innovations for humans. In this era, companies and industries are also pushed to develop their technologies in order to facilitate the development of newer technologies. An example of this are companies in the e-commerce industries, because their operations revolve mainly around technology such as internet, server, and many more. In e-commerce companies, data and machine learning becomes a crucial factor as it is needed to be used for decision making. However, according to Downs (2018), companies around the world feel that around 26% of data are dirty which may cause large losses to the companies.
Big data
Big data is data that are large in volume, contains a variety of data, and travels in high velocity (Oracle, 2016). Big data is also a combination of structured, semistructured, and unstructured data collected by organizations that can be mined for information and used for analytics applications such as machine learning, predictive modeling, and etcetera (Botelho & Bigelow, 2021). Due to having large volume with a variety of data, big data often contains imperfect data, or dirty data. Examples of dirty data are data with missing values, contains outlier values, or duplicate data (García, Ramírez-Gallego, Luengo, Benítez, & Herrera, 2016).
Data Preprocessing
Data preprocessing is the process of transformation in which raw data from big data are checked, filtered, and processed into an understandable format that can be applied to analytical applications such as machine learning. Data preprocessing is needed to check the data quality by checking its accuracy, completeness, consistency, timeliness, believability, and interpretability (Anunaya, 2021). Data quality itself is important because if the input of data in analytical applications are good in quality, then the output will be good in quality as well, meaning that data quality is crucial in getting the desired output (Komarraju, 2021).
Machine Learning
Machine Learning is a type of artificial intelligence that builds a training data, or a mathematical model based on sample data, in order to make decisions or predictions without being explicitly programmed to perform the task (Zhang, 2020). In other words, this AI is able to become more accurate at predicting outcomes by automatically learning from past data without programming. For example, if a user frequently browses for an iPhone 12, then the website will automatically suggest other iPhone 12 choices to the user.
Relationship of Data Preprocessing, Machine Learning, and the E-Commerce Industry
Since companies in the e-commerce industry heavily relies on technological innovations such as data and machine learning, the reliability and accuracy of the data processed is important. Machine learning in this industry is used to receive data about the user’s activities in the company’s website or platform and will automatically suggest products or services based on the user’s activities. In this case, data preprocessing is crucial because the data that will be used by the machine learning AI needs to be good in quality, therefore the raw data from big data must be processed accordingly in order to prevent assumption mistakes or incorrect output from the AI.
Conclusion
In conclusion, the relationship of data preprocessing, machine learning, and the e-commerce industry is that data preprocessing is needed in order to provide good data quality to the analytical application of machine learning, in which machine learning will receive the data to achieve the desired output, which is recommending accurate products and services to users in an e-commerce company’s platform based on their activities such as search history. E-commerce companies will then analyze and use the data for performance and analytical dashboards and for the company’s decision making.
Referensi :
Journal
García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J. M., & Herrera, F. (2016). Big data preprocessing: methods and prospects. Big Data Analytics.
Zhang, X.-D. (2020). Machine Learning. A Matrix Algebra Approach to Artificial Intelligence, 223-240.
Website
Anunaya, S. (2021, August 10). Data Preprocessing in Data Mining -A Hands On Guide. Retrieved from Analytics Vidhya: https://www.analyticsvidhya.com/blog/2021/08/data-preprocessing-in-data-mining-a-hands-on-guide/
Botelho, B., & Bigelow, S. J. (2021, May 27). big data. Retrieved from SearchDataManagement: https://searchdatamanagement.techtarget.com/definition/big-data
Downs, E. (2018, August 21). The Staggering Impact of Dirty Data. Retrieved from MarkLogic: https://www.marklogic.com/blog/the-staggering-impact-of-dirty-data/
Komarraju, A. (2021, February 11). How Important is Data Quality in Machine Learning. Retrieved from Analytics Insight: https://www.analyticsinsight.net/how-important-is-data-quality-in-machine-learning/
Oracle. (2016, May 13). What is Big Data? Retrieved from Oracle: https://www.oracle.com/in/big-data/what-is-big-data/