3rd assignment (DL 3 July)

Return the assignments by email to tada14@mpi-inf.mpg.de by 3 July, 1600 hours. The subject of the email must start with [TADA]. The assignment must be returned as a PDF and the PDF must contain your name, matriculation number, and e-mail address together with the exact topic of the assignment.

Topic 4 is hard. Grading of this topic takes this hardness into account.

You will need a username and password to access the papers outside the MPI network. Contact the lecturers if you don't know the username or password.

For the topic of the assignment, choose one of the following:

  1. What's loss got to do with it?

    The Minimum Description Length (MDL) principle [1] states that for meaningful inference from data your encoding should be lossless and efficient. That is, you should be able to completely reconstruct your data, and the encoding you use should not waste any bits. Why would loss be bad? What are the pitfalls (if any) for the 'generalized' MDL proposed by Laksmanan et al. [2]?

    Many authors claiming to use MDL do in fact not use a lossless encoding nor an inefficient encoding. Read Lucchese et al. [3] Closely investigate the encoding they use. Reason about the effect of the choices made in the encoding, what this means for the results and whether these correspond with the overall goals. If you find deficiencies, try to propose improvements. (Bonus: read and consider also Chakrabarti et al. [4]. They do other things right/wrong.)

  2. Look, Ma, no hands!

    One of the advantages of using Minimum Description Length (MDL) [1] is that it allows to define parameter-free methods [5]. Some people go as far as claiming this is the future of data mining [6] as it allows for true exploratory data mining. Is this so? Where did the parameters go? Is MDL a magic wand? What if we dont like what MDL says is the best? Discuss.

  3. Sounds... familiar...

    Clustering is one of the key tasks in data mining. The task is easy: group things that are similar. Formally defining a meaningful measure for similarity is not so simple, as we typically have no idea what we are looking for, and even less of an idea of how things may be similar in the data we have. As a result there exist godzillian many similarity measures and clustering algorithms — each of which somehow being much much 'better' than anything that already existed.

    Cilibrasi and Vitanyi define the ultimate clustering using Kolmogorov complexity [7] and it can be approximated using off-the-shelf compression algorithms: you only need to use a data compression algorithm that suits your data. Is the framework they propose as powerful as they claim? Does this make sense at all? Why and when (not)? Are there any hidden assumptions? Can you think of examples where this framework would not be so easy to apply, or not work at all?

    Read [9]. What is the novelty of this paper with regard to [7]? Are there key differences in the scores? What are the implications? How would you rate the novelty of this paper, what would you say is the key contribution?

    (Bonus, and fun) Cilibrasi and Vitanyi wrote a follow-up paper [8] to Clustering by Compression [7], where (I claim) they essentially use Google as the compression algorithm. Does this claim make sense? Argue why (not).

  4. Elementary, my dear Watson — (Hard)

    In data mining the goal is to find interesting structure of your data, things that somehow stand out. There are many ways to define 'standing out'. Significance testing based on background knowledge is one of them, but can, again, be done in different ways. There are two main schools. Gionis et al [10] propose to measure significance using (swap) randomization, whereas De Bie argues to use Maximum Entropy (MaxEnt) modelling [11]. (Variants exist for all sorts of data types.) What are the key differences between the two approaches? What are the differences in background knowledge that can be incorporated? What are the key differences in the assumptions about the background knowledge? What will the effect be in practice? Which do you think is more practical? Why?

    Within the MaxEnt school, there exist two sub-schools. In addition to [11], read [12]. What are the key differences between the models? What are the differences in type of knowledge they can incorporate? Can we use both models to test significance of the same types of structures/patterns? Are the two approaches unitable? Does it make any sense to have the type of background information of [11] incorporated into [12], and how about vice versa? If it does, sketch how you think this would work.

References

  1. Grunwald, Peter B. 2004. A Tutorial Introduction to the Minimum Description Length Principle PDF (short version) PDF (longer version, optional, only if you want to know more)
  2. Lakshmanan, Laks V.S., Ng, Raymond T., Wang, Christine X., Zhou, Xiaodong and Johnson, Theodore J. 2002. A Generalized MDL Approach for Summarization. In: Proceedings of VLDB 2002 PDF
  3. Lucchese, Claudio, Orlando, Salvatore and Perego, Raffaele. 2010. In Proceedings of SDM 2010. PDF
  4. Chakrabarti, Deepayan, Papadimitriou, Spiros, Modha, Dharmendra S., Faloutsos, Christos. 2004. In Proceedings of KDD 2004. PDF
  5. Mampaey, Michael and Vreeken, Jilles. 2010. Summarising Data by Clustering Items. In Proceedings of ECMLPKDD 2010 PDF
  6. Keogh, Eamonn, Lonardi, Stefano and Ratanamahatana, Chotirat A. 2004. Towards Parameter-Free Data Mining. In Proceedings of KDD 2004. PDF
  7. Cilibrasi, Rudi L. and Vitanyi, Paul M.B. 2005. Clustering by Compression. IEEE Transactions on Information Theory 51(4):1523–1545 PDF
  8. Cilibrasi, Rudi L. and Vitanyi, Paul M.B. 2007. The Google Similarity Distance. IEEE Transactions on Knowledge and Data Engineering 19(3):370–383. PDF
  9. Campana, Bilson J.L., Keogh, Eamonn. 2010. A Compression Based Distance Measure for Texture. In Proceedings of SDM 2010. PDF
  10. Gionis, Aristides, Mannila, Heikki, Mielikäinen, Taneli and Tsaparas, Panayiotis. 2006. Assessing Data Mining Results via Swap Randomization. In Proceedings of KDD 2006. PDF
  11. De Bie, Tijl. 2011. Maximum entropy models and subjective interestingness: an application to tiles in binary databases. 2011. Data Mining and Knowledge Discovery 23:407–446. PDF
  12. Mampaey, Michael, Tatti, Nikolaj, and Vreeken, Jilles. 2011. Tell Me What I Need to Know: Succinctly Summarizing Data with Itemsets. In Proceedings of KDD 2011. PDF