Data Mining Techniques for Managing Large Data Sets
Huge amount of data is required in any field of industry. However, this data is converted into useful information that provides profit to the business. However, this huge amount of data and information is required for providing a catastrophic view to the operations (Wu et al. 2014). Therefore, this huge data have to be properly managed to keep record of information.
This report discusses about the need of data mining process maintaining huge amount of data and information collected from the data set. Various techniques of classification of the data mining technique have been discussed in the report.
This report outlines the Bayes Theorem for analyzing the monthly sales from the data set. The use of Bayes theorem in the analysis of data set has been justified in this report. The geographic area for the data set has been targeted in the report and based in this recommendations are provided for improving the business in the market have been provided in the market.
The research has followed the classification technique of data mining that might help in classifying the data set into various order based on the quality, price and categories. This function of the data mining also helps in maintaining and enhancing the search option for the data set (Roiger 2014). The research has followed the Bayes Theorem for maintaining the classification process. The use of the probability technique by the Bayes theorem have helped in providing a prior classification of the products and data. As commented by Horner and Richard (2016), the Bayes theorem is the concept of conditional probability.
P(A_i|A) = P(A_i)P(A|A_i)/( sum_(j=1)^NP(A_j)P(A|A_j)),
where P(A_i) is the probability of an event A_i, P(A_i|A) is the conditional probability of A_i given that A has already occurred, the events are disjoint, and union _(i = 1)^NA_i = A.
As mentioned by Zheng (2015), Bayes theorem focuses on the probability of the event that are based on the prior knowledge of conditions that are related to the event. As argued by Aggarwal (2015), decision tree is not applicable in this data set analysis. The decision tree consist of nodes, root, branches and leaf nodes that are not applicable to the data set analysis. The use of these components are not required in the analysis. Therefore, the use of the decision tree is not applicable to the analysis (Lu, Setiono and Liu 2017). The discussion tree algorithm can be applied to project that have various steps that are interrelated to each other.
Bayes Theorem for Analyzing Monthly Sales
The research methodology have discussed about the use of Bayes theorem for analyzing the data set. The description of the process followed in the analysis has been provided below. The use of various techniques using the Bayes theorem has been implemented in the section (Shmueli et al. 2017). The graphs and charts obtained from the analysis of data set has been discussed.
After the analysis of the data set we got, there are 412 customers from the NSW region, 432 customers from QLD region, 436 customers from SA, 402 customers from TAS, 413 customers form VIC and 405 customers from WA. There are total of 838 existing customers, 853 loyal customers and 809 new customers in the data set.
The statistical data about the data set is given by the table and analysis is that Mean for Price is $ 67 (max- $119, Min- $15). However, for review, Mean- 2.995200 and Max- 4, Min- 2. Mean value for the number of customers who bought is 39 (approx.). The sales of the company has been increasing in the recent years Therefore, the mean value for monthly sales is $ 500.891200. Therefore, it can be analysis that the company is at good pace (Olson and Wu 2017). However, the company needs some changes in the marketing of the product in the market the use of the marketing strategy and planning might help in maintaining a good position in the market (Linchangco, Jay and Brouwer 2017).
From the above graph, it is evident that regions TAS have the maximum number of sales in the data set which is 44o. Therefore, after further analysis of the data set using the group by clause, we got 420 sales in QLD and SA have 418 sales. In addition to that, the regions 391 sales from the WA region, 398 sales form VIC and 399 sales from WA.
The above graph depicts the rating of the products according to the reviews provided by customers in the market. It is observed that maximum number of the products are reviewed with two star and minimum number of products are rated with three star. Similar to the have process, we found that there are 803 products with rating “2”, 908 products with rating “4” and 789 products with rating “3”.
There are 526 cheddar cheese packets, 495 JM shirts, 484 pasta packets, 496 pen and 499 Reebok limited edition shoes were sold.
Recommendations for Business Optimization based on Regional Analysis
Scatter plot for sale of products in regions
Box plot for the Price of the Products
For the products and prices, the above box plot gives the mean price is between the $60-$80
Cross table 1
Review |
2 |
3 |
4 |
All |
CustomerType |
|
|
|
|
EXISTING |
274 |
274 |
282 |
830 |
LOYAL |
286 |
269 |
280 |
835 |
NEW |
295 |
259 |
281 |
835 |
All |
855 |
802 |
843 |
2500 |
Existing customers rated 248 products with rating 2, 268 products with rating 3 and 322 products with rating 4. Other details compared to the rating by customers of different regions are given by:
Review |
2 |
3 |
4 |
All |
Region |
|
|||
NSW |
154 |
134 |
144 |
432 |
QLD |
151 |
127 |
142 |
420 |
SA |
134 |
131 |
153 |
418 |
TAS |
164 |
143 |
133 |
440 |
VIC |
128 |
131 |
140 |
399 |
WA |
124 |
136 |
131 |
391 |
All |
855 |
802 |
843 |
2500 |
Cross table 2
From the NSW region, 121 products are rated with rating 2, 145 products with rating 3 and 146 products with rating 4. Other ratings the different regions are shown below:
Total sales from the given data set is mounted to $166702.
Region |
NSW |
QLD |
SA |
TAS |
VIC |
WA |
All |
Product |
|
||||||
Cheddar chesse |
78 |
91 |
79 |
105 |
86 |
87 |
526 |
JM SHIRTS |
87 |
93 |
84 |
70 |
82 |
79 |
495 |
Pasta |
89 |
74 |
83 |
79 |
76 |
83 |
484 |
Pens |
89 |
81 |
93 |
93 |
72 |
68 |
496 |
Reebok Limited Edition Shoes |
89 |
81 |
79 |
93 |
83 |
74 |
499 |
All |
432 |
420 |
418 |
440 |
399 |
391 |
2500 |
After the modeling of the Bayes model we got the monthly mean revenue is predicted to 500.891200
Conclusion
It can be concluded that the analysis of the data set used in this report has been properly analyzed using various theorem and procedure including python programing the graphs and charts included din the report have able to analyses the result properly. The data included in the data set has been perfectly manipulated using the data mining classification. The use of various techniques using the analytical tool has been discussed in the report. Data mining techniques have helped in maintaining the huge amount of data set and different to data to manipulate the analysis of the data set.
From the above analysis of the data set, it can be seen that the products are variably sold in different parts of the country. Therefore, there is a requirement of proper management in order in increase the sales of products in the market. The use of various strategies including the market planning and target products at proper places. This methodology might help in increasing the sales of different products at different places of the company. As per the analysis, VIC region have the lowest new customers of 115, QLD has 135 new customers and SA has maximum customers 144. The product required to be promoted more is pen. From the analysis, it is found that pens are less sold in the region. The demand of product in the market have ti be increased by proper advertisements in the market. The use of the advertisements of the pens in the market might help in promoting product in the market. Pens have to be promoted in the NSW region, as the number of new customers from that region is minimum.
Classification Techniques for Data Mining
The use of promotion strategies might help in developing the interest among the customers to the pens. The company might go fir free shipping cost to the customers. In several cases, customers are nit wiling t buy a product due to high shipping charges. Therefore, the company might bear the cost of shipping and make to free for the customers. Therefore, this strategy will increase the interest of the customers in the product. The company have to restart the advertisement program by minimizing the cost of the pens by dropping the shipping cost. The time of delivery of the products to the customers need to be minimized that helps in maintaining the interest of the customers. The feedback of the customers plays an important role in increasing the demand of the product in the market. The company have to focus on the customer satisfaction. A satisfied customer is capable of bringing another customer to the company for purchasing. Staff training might be another strategy for initiating the marketing plan of the company. Training sessions needs to be initiated in the company for enhancing skills and knowledge of the employees. Customer relationship network of the company needs to be enhanced in the company for the development of the company. Various campaigns and seminars needs to be conducted by the company for providing rewards ad prizes to the employees having good performance in the company. This help in enhancing the confidence level of the employee and can increase the production level of the company. The sales of the company is totally depended on the brand image of the company in the market. Most of the products are rated with 3. Most of the products sold valued at $60-$80.
References
Aggarwal, C.C., 2015. Data mining: the textbook. Springer.
Chaurasia, V. and Pal, S., 2017. Data mining techniques: To predict and resolve breast cancer survivability.
Gholizadeh, A., Carmon, N., Klement, A., Ben-Dor, E. and Bor?vka, L., 2017. Agricultural Soil Spectral Response and Properties Assessment: Effects of Measurement Protocol and Data Mining Technique. Remote Sensing, 9(10), p.1078.
Horner, M.W. and Richard, A., 2016. Social data mining for understanding public perceptions of autonomous vehicles: National trends and the case of florida (No. 16-3786).
Larose, D.T., 2014. Discovering knowledge in data: an introduction to data mining. John Wiley & Sons.
Linchangco, R., Jay, J.J. and Brouwer, C., 2017. Linking Nutrition and Molecular Biology Using Data Mining and Graph Theory. The FASEB Journal, 31(1 Supplement), pp.lb134-lb134.
Lior, R., 2014. Data mining with decision trees: theory and applications (Vol. 81). World scientific.
Lu, H., Setiono, R. and Liu, H., 2017. Neurorule: A connectionist approach to data mining. arXiv preprint arXiv:1701.01358.
Olson, D.L. and Wu, D.D., 2017. Data Mining Models and Enterprise Risk Management. In Enterprise Risk Management Models (pp. 119-132). Springer, Berlin, Heidelberg.
Roiger, R.J., 2017. Data mining: a tutorial-based primer. CRC Press.
Shmueli, G., Bruce, P.C., Yahav, I., Patel, N.R. and Lichtendahl Jr, K.C., 2017. Data Mining for Business Analytics: Concepts, Techniques, and Applications in R. John Wiley & Sons.
Witten, I.H., Frank, E., Hall, M.A. and Pal, C.J., 2016. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann.
Wu, X., Zhu, X., Wu, G.Q. and Ding, W., 2014. Data mining with big data. IEEE transactions on knowledge and data engineering, 26(1), pp.97-107.
Zheng, Y., 2015. Trajectory data mining: an overview. ACM Transactions on Intelligent Systems and Technology (TIST), 6(3), p.29.