Programming Project
Please chose one of the provided topics and hand in the answers to the programming project by uploading a PDF in Ilias before the deadline expires. The PDF should contain the theoretical background, the complete and documented implementation, explanations and all code (in the appendix) that is required to reproduce your results. The PDF needs to be produced using RMarkdown. This Project is an individual task, will be graded and determines your final grade. The deadline for submitting your solution is 21. June 2020 - 23:59.
Topic 1: Web Scraping
Data scraping, data harvesting, or data extraction is the process of extracting information from websites in an automated fashion. The program that downloads the content and extracts the desired piece of information is usually referred to as bot or crawler. The scraping process consists of two parts: fetching and extracting. The first step, fetching, is the downloading of a page. Therefore, web crawling is the main component of web scraping, to fetch pages for later extraction and processing. When a page has been downloaded (and saved) the second step, the information extraction can take place. This involves taking some parts out of the downloaded page and prepare it in a way that it can be used for another purpose, e.g. an analysis.
The goal of this task is to scrape data such as cryptocurrency time series, job descriptions or car offers from a given website and calculate, present as well as interpret some descriptive statistics in an appealing way (for example using Shiny). The key learning is to develop a crawling algorithm that is capable of downloading content from the given site. Each of the following websites can be used only once in the course. If you are interested in a listed website please send me a message (benjamin.buchwitz@ku.de) stating the website and your identifier/matriculation number. Websites are assigned on a first-come-first-serve basis. However, you can also propose a website, that will then be added to the list. When coding please keep in mind that you can and should not crawl the entire content from your chosen page. Please limit the number of crawled instances to an absolute maximum of 100.000 entries (less is also ok - if in doubt talk to me).
Site | Type | Student |
---|---|---|
https://guide.michelin.co.jp/ | Restaurants | wws22603 |
https://www.rottentomatoes.com/ | Movies | wws23848 |
https://www.stepstone.de/ | Jobs | |
https://www.mobile.de/ | Car / Motorbike | |
https://www.kickstarter.com/ | Crowdfunding |
Getting started Literature:
Simon Munzert, Christian Rubba, Peter Meißner, and Dominic Nyhuis. 2014. Automated Data Collection with R. A Practical Guide to Web Scraping and Text Mining. Chichester: John Wiley & Sons. Accompanying website: http://r-datacollection.com/.
Topic 2: Forecasting
In autoregression models, the variable of interest is forecasted using a linear combination of past values of the variable itself. The term autoregression indicates that it is a regression of the variable against itself. The \(AR(p)\)-Model for a time series \(\{y_t\}\) is given by:
\[ y_t = c + \phi_1 \cdot y_{t-1} + \ldots + \phi_p \cdot y_{t-p} + \epsilon_t\] If \(\{y_t\}\) is instationary it may be preferable to estimate the model on the basis of the first differences of the time series which is given by \(z_t = y_t - y_{t-1}\) (details in the referenced literature). The goal of this task is to compare the performance of the \(AR(p)\) model on the 100.000 Series of the M4 Time Series Dataset. Therefore it is required to develop a function that estimates a model of a given order and calculates a required amount of forecasts. These forecasts should be compared against the real data from the M4 holdout by means of the Mean Absolute Scaled Error (MASE). Please report results for stationary and instationary \(AR(p=1)\) to \(AR(p=10)\) models (20 models in total) for the whole M4 dataset. Additionally you can also include an stationary and instationary \(AR(p=0)\) model to model the data as a simple and trivial benchmark for the more compelx models (yielding 22 models in total).
Dataset:
M4 Competition Website: https://www.mcompetitions.unic.ac.cy/the-dataset/
Getting started Literature:
Hyndman, R.J., & Athanasopoulos, G. (2018) Forecasting: principles and practice, 2nd edition, OTexts: Melbourne, Australia. Accompanying website: http://OTexts.com/fpp2.
Topic 3: Algorithm Implementation & Data Analysis
This topic is a more flexible choice. The basic idea is to select a dataset and a corresponding method to model and predict the outputs. The Dataset and the methodology is free to choose (in consultation with me via benjamin.buchwitz@ku.de). The goal of this task is to implement the method on your own (not using a package) and develop a deeper understanding of the method while performing an analysis with the selected dataset. I strongly recommend choosing a simple method which is often challenging (e.g. simple regression tree for Boston Housing or simple classification tree for Twitter Classification). Each dataset method pair can only be chosen once and the task is allocated on a first-come-first-serve basis.
Dataset | Methodology | Student |
---|---|---|
Red Wine Quality | tbd | wws16885 |
Twitter Classification Dataset | tbd | ww |
Boston Housing Dataset | Regression Tree |
Getting started Literature:
Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction. 2nd ed. New York: Springer.
Topic 4: Custom Topic
You are still free to suggest a custom topic that you are interested in. If you want to do so please reach out to me to clarify the details and get you listed below.
Project Description | Student |
---|---|
COVID-19 Visualization + Forecasting (https://data.europa.eu) | wws22488 |