1st assignment (DL 8 May)

Return the assignments by email to tada14@mpi-inf.mpg.de by 8 May, 1600 hours. The subject of the email must start with [TADA]. The assignment must be returned as a PDF and it must contain your name, matriculation number, and e-mail address together with the exact topic of the assignment.

Topic 4 is hard and contains an optional extra guestion. Grading of this topic takes this hardness into account.

You will need a username and password to access the papers outside the MPI network. Contact the lecturers if you don't know the username or password.

For the topic of the assignment, choose one of the following:

  1. Did Tukey invent Data Mining?

    Read [1] and discuss how exploratory data analysis relates to data mining.

  2. (Don't) Believe the Hype

    Read [2]. The authors introduce a method for detecting correlation in data. They present their approach very confidently. How does it relate to data mining? How strong are their claims? Is the method earth shattering or not? Read [3]. Try to identify and discuss as many (practical and theoretical) strong and weak points of [2] as you can find.

  3. Big Data: The Best Thing slice Sliced Bread or just Another Bottle of Snake Oil?

    Read [4, 5, 6, 7, 8]. Is Big Data worth all the hype? What are the prospects? What are the (potential) problems? Are these problems insurmountable? What are your opinions about Big Data?

  4. Where did the candidates go? (Hard)

    The standard approach to mine frequent itemsets is to

    1. generate a set of candidate itemsets,
    2. test which are frequent, and
    3. use those to generate new candidates,
    and iterate until done. Eclat [9], proposed in 1997, is an example of a simple yet very efficient algorithm for mining frequent itemsets that follows this principle in a depth-first search.

    The authors of [10] claim that their method can mine frequent itemsets without candidate generation. This raises the question: where did the candidates go? Discuss whether this claim is valid or not, and why.

    (optional) TreeProjection [11] was proposed before [10]. The authors of [10] almost aggressivly discuss that FPGrowth is really different than TreeProjection. Are they really? Why (not)? Discuss, and if possible, give an example where they are (not) different.

References

  1. J.W. Tukey. We Need Both Exploratory and Confirmatory. American Statistician, Vol. 34(1), pp. 23–25, 1980 PDF
  2. D.N. Reshef et al. Detecting Novel Associations in Large Data Sets. Science, Vol. 334, pp. 1518–1524, 2011 PDF
  3. N. Simon and R. Tibshirani. Comment on "Detecting Novel Associations in Large Data Sets" by Reshef et al, Science Dec 16, 2011. arXiv, 1401.7645, January 2014 PDF
  4. T. Harford. Big Data: are we making a big mistake? Financial Times, 28 March 2014 PDF
  5. D. Lazer et al. The Parable of Google Flu: Traps in Big Data Analysis. Science, Vol. 343, 2014 PDF
  6. M. White. How Big Data is Chaning Science (and Society). Pacific Standard, 8 November 2013 PDF
  7. J. Manyika et al. Big data: The next frontier for innovation, competition, and productivity (executive summary), McKinsey Global Institute, May 2011 PDF
  8. D. Boyd and K. Crawford. Critical Questions for Big Data. Inform. Comm. Soc., Vol 15(5), pp. 662–679, June 2012 PDF
  9. M.J. Zaki et al. New algorithms for fast discovery of association rules. In 3rd International Conference on Knowledge Discovery and Data Mining (KDD), 1997 PDF
  10. J. Han et al. Mining Frequent Patterns without Candidate Generation. In ACM SIGMOD International Conference on Management of Data (SIGMOD), 2000 PDF
  11. R.C. Agarwal et al. A tree projection algorithm for generation of frequent item sets. J. Parallel Distr. Com. Vol. 61(3), pp. 350–371, 2001 PDF