- Synthesize the application of software used in data science environments.
- Explain data storage processes and database management systems.
- Explain statistical techniques used in data science.
- Explain the use of classification analysis in data science.
- Explain the use of cluster analysis in data science.
- Describe the data science project lifecycle.
After working in the industry for a number of years, you have decided to become a full time, self-employed consultant. William Cogswell, President of Cogswell Cogs, works with highly proprietary information, but has some sample data that he is familiar with. He requests that you perform a quick proof of concept with this sample data to showcase your skills and show William and his leadership team what you can offer them. If he and his team at Cogswell Cogs likes what they see, they will likely offer you a long term consulting contract for their data analysis business needs, at which time you would be allowed to access their proprietary data and information.
In a comprehensive presentation to William Cogswell and the Leadership Team at Cogswell Cogs, address the following items. Include all code, screenshots, explanations, and other information necessary to prove that you will be a worthwhile hire as their consultant.
Present a statistical overview on the
Sales Forecasting Data file
and the following data:
1. Using the R programming language, complete the following tasks:
- Generate the mean and standard deviation of the weekly sales using the R programming language.
- Generate a histogram for the weekly sales.
- Using the ‘cor’ function, generate individual correlations between “Weekly Sales” and the following parameters: “store, dept, Date (break out by month and year), and Holiday
2. Using the “R” statistical package, complete the following task:
- Perform a multiple regression, modeling between “Weekly Sales” and the following parameters: “store, dept, Date (break out by month and year), and Holiday
3. Using the R programming rpart function, complete the following task:
- Generate a decision tree model using the sales price “Weekly Sales” and the following parameters: “store, dept, Date (break out by month and year), and Holiday prune the tree appropriately in order to support a concise description that can lead to actionable results.
4. Using the
, complete the following tasks:
- Use the clusters.py Python module from the Programming Collective Intelligence text to perform a hierarchical clustering model.
- Generate a cluster representation (image). You may wish to explore a subset of your data in order to support a smaller cluster representation.
- Leverage the same module to perform a k-means clustering model. In this model you are not required to print out the cluster but rather the groups of the clusters (which rows are clustered together). Again, you may use a subset of the data in order to represent a more tractable output.
5. Provide a summary recommending the tools that you think best fit for the means of establishing a complete institutionalized data pipeline for data analysis and presentation. Address your recommendations in terms of Big Data (extremely large data sets), as William Cogswell has expressed that his proprietary data sets are extremely large.
Include the following topic areas, stating advantages and disadvantages of the packages described and your recommendation. Note: you may have overlap in your packages as they can support more than one need. Again, note that you need to express the support of advantages and disadvantages of each in the context of extremely large data sets (Big Data).
- Programming Languages (e.g. R, Python)
- Machine Learning Libraries (e.g. Anaconda)
- Extract-Transform-Load Utilities (e.g. Pentaho, Alteryx)
- Graphic Support/ Dashboard Analytics (e.g. Tableau, Qlikview)
- BI Software and Big Data (Hadoop, Apache Spark).