Return the assignments by email to tada14@mpi-inf.mpg.de by 3 July, 1600 hours. The subject of the email must start with [TADA]. The assignment must be returned as a PDF and the PDF must contain your name, matriculation number, and e-mail address together with the exact topic of the assignment.
Topic 4 is hard. Grading of this topic takes this hardness into account.
You will need a username and password to access the papers outside the MPI network. Contact the lecturers if you don't know the username or password.
For the topic of the assignment, choose one of the following:
The Minimum Description Length (MDL) principle [1] states that for meaningful inference from data your encoding should be lossless and efficient. That is, you should be able to completely reconstruct your data, and the encoding you use should not waste any bits. Why would loss be bad? What are the pitfalls (if any) for the 'generalized' MDL proposed by Laksmanan et al. [2]?
Many authors claiming to use MDL do in fact not use a lossless encoding nor an inefficient encoding. Read Lucchese et al. [3] Closely investigate the encoding they use. Reason about the effect of the choices made in the encoding, what this means for the results and whether these correspond with the overall goals. If you find deficiencies, try to propose improvements. (Bonus: read and consider also Chakrabarti et al. [4]. They do other things right/wrong.)
One of the advantages of using Minimum Description Length (MDL) [1] is that it allows to define parameter-free methods [5]. Some people go as far as claiming this is the future of data mining [6] as it allows for true exploratory data mining. Is this so? Where did the parameters go? Is MDL a magic wand? What if we dont like what MDL says is the best? Discuss.
Clustering is one of the key tasks in data mining. The task is easy: group things that are similar. Formally defining a meaningful measure for similarity is not so simple, as we typically have no idea what we are looking for, and even less of an idea of how things may be similar in the data we have. As a result there exist godzillian many similarity measures and clustering algorithms — each of which somehow being much much 'better' than anything that already existed.
Cilibrasi and Vitanyi define the ultimate clustering using Kolmogorov complexity [7] and it can be approximated using off-the-shelf compression algorithms: you only need to use a data compression algorithm that suits your data. Is the framework they propose as powerful as they claim? Does this make sense at all? Why and when (not)? Are there any hidden assumptions? Can you think of examples where this framework would not be so easy to apply, or not work at all?
Read [9]. What is the novelty of this paper with regard to [7]? Are there key differences in the scores? What are the implications? How would you rate the novelty of this paper, what would you say is the key contribution?
(Bonus, and fun) Cilibrasi and Vitanyi wrote a follow-up paper [8] to Clustering by Compression [7], where (I claim) they essentially use Google as the compression algorithm. Does this claim make sense? Argue why (not).
In data mining the goal is to find interesting structure of your data, things that somehow stand out. There are many ways to define 'standing out'. Significance testing based on background knowledge is one of them, but can, again, be done in different ways. There are two main schools. Gionis et al [10] propose to measure significance using (swap) randomization, whereas De Bie argues to use Maximum Entropy (MaxEnt) modelling [11]. (Variants exist for all sorts of data types.) What are the key differences between the two approaches? What are the differences in background knowledge that can be incorporated? What are the key differences in the assumptions about the background knowledge? What will the effect be in practice? Which do you think is more practical? Why?
Within the MaxEnt school, there exist two sub-schools. In addition to [11], read [12]. What are the key differences between the models? What are the differences in type of knowledge they can incorporate? Can we use both models to test significance of the same types of structures/patterns? Are the two approaches unitable? Does it make any sense to have the type of background information of [11] incorporated into [12], and how about vice versa? If it does, sketch how you think this would work.