index2.html

Lecture "Discrete Topics in Data Mining" WS 2012/13, 3 ECTS credits

Lecturer

Dr Pauli Miettinen

News

The exam and re-exam will be in room 021 at the MPII building (E1.4), ground floor.
There is no lecture on 27 November. The course schedule is updated.
Students must register to the final exam in HISPOS by 4th of November.
The final exam will be held on 19th of February; the re-exam will be on 19th of March

Lectures

Two hours per week, on Tuesdays, from noon till 2 pm. Place: room 007 in building E2 1 (bioinformatics).

Schedule & Slides

The schedule of the course and the slides are here.

The articles related to each topic are listed below. To access the PDFs, you need the username and password.

Essay Topics

The essay topics are here.

Guidelines to Return the Essays

The essays must be returned in PDF format via e-mail to the lecturer (see slides or lecturer's home page for the address). The deadline for the essays, unless otherwise specified, is two weeks from the date the topics were given at 14:00 hours (2 pm). Failure to submit the essay on time will give you a failed grade. The time of the submission is the timestamp of the mail as shown by the lecturer's e-mail system. It is advisable to send the essays before noon, so that if you have not received a response, you can ask the status of your essay before the lecture.

Every essay you return must have the following information:

Your name
Your matriculation number
Your e-mail address
The topic of the essay (even if there was only one given)

In addition, it is advisable to start the subject line of the mail with "DTDM" and have the word "essay" somewhere in the subject. This helps me to notice the purpose of the mail and (hopefully) prevents the spam filters from filtering the mails.

There are no page limits for the essays, but I expect a good essay to take between two to five A4-pages in 10pt font and 2.5cm margins all around (you are free to use other font sizes and margins as long as the text stays legible).

The essays must follow the normal scientific citation practices. Substantial failure to do so will cause a failure of the essay. The essays may contain (numbered) section and subsection headings if the author so prefers.

Content

The course will provide an overview of some important topics in data mining. The purpose of the course is to concentrate on the ideas and intuition behind these topics, with the aim that after the course, the students can follow the current research on the topics.

The exact topics covered on this lecture will be announced later (and students' preferences can be considered), but tentatively we will cover at least pattern set mining, graph mining, and significance testing (in pattern set mining).

The course will have two hours of lectures every week. There will not be any homework sessions. Instead, students have to write longer essays/reports on the topics covered on the lectures.

Prerequisites

Students are expected to have passed either Information Retrieval & Data Mining or Machine Learning core lectures, or hold equivalent knowledge.

Requirements for Passing the Course

At least four (4) passing grades from the essays (out of five essays)
Final exam

Grading System

The essays are graded in failed/passed/excellent grades. Out of the five essays, you need to have a passing grade from at least four to be allowed to take the final exam. If you are allowed to take and pass the final exam, then each excellent grade from essays will improve your final grade by 1/3 of what you got from the final exam. That is, if you got 2.0 from the final exam and you have one excellent grade, your final grade will be 1.7; if you have three or more excellent grades, it will be 1.0.

Articles for the Topics

Topic I, Pattern Set Mining

Geerts, F., Goethals, B. & Mielikäinen, T., 2004. Tiling databases. In 7th International Conference on Discovery Science. pp. 77–122. PDF
Gionis, A., Mannila, H. & Seppänen, J.K., 2004. Geometric and Combinatorial Tiles in 0–1 Data. In 8th European Conference on Principles and Practice of Knowledge Discovery in Databases. pp. 173–184. PDF
Tatti, N. & Vreeken, J., 2012. Discovering Descriptive Tile Trees By Mining Optimal Geometric Subtiles. In 2012 European Conference on Machine Learning and Priciples and Practice on Knowledge Discovery in Databases. pp. 9–24. PDF
Vreeken, J., van Leeuwen, M. & Siebes, A., 2011. Krimp: mining itemsets that compress. Data Mining and Knowledge Discovery 23(1), pp.169–214. PDF

Topic II, Graph Mining

Inokuchi, A., Washio, T. & Motoda, H., 2002. An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data. In 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, pp. 13–23. PDF
Yan, X. & Han, J., 2002. gSpan: Graph-Based Substructure Pattern Mining. In 2nd IEEE International Conference on Data Mining, pp. 721–724. PDF (extended techincal report PDF)
Shahaf, D. & Guestrin, C., 2010. Connecting the Dots Between News Articles. In 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 623–632. PDF
Shahaf, D. & Guestrin, C., 2012. Connecting Two (or Less) Dots: Discovering Structure in News Articles. ACM Transactions on Knowledge Discovery from Data 5(4), article 24. PDF
Shahaf, D., Guestrin, C. & Horvitz, E., 2012a. Trains of Thought: Generating Information Maps. In 21st International World Wide Web Conference, pp. 899–908. PDF
Shahaf, D., Guestrin, C. & Horvitz, E., 2012b. Metro Maps of Science. In 18th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 1122–1130. PDF

Topic III, Significance Testing

Kirsch, A., Mitzenmacher, M., Pietracaprina, A., Pucci, G., Upfal, E. & Vandin, F., 2012. An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets. Journal of the ACM, 59(3), article 12. PDF
Gionis, A., Mannila, H., Mielikäinen, T. & Tsaparas, P., 2007. Assessing data mining results via swap randomization. ACM Transactions on Knowledge Discovery from Data, 1(3), article 14. PDF
Ojala, M., Vuokko, N., Kallio, A., Haiminen, N. & Mannila, H., 2009. Randomization methods for assessing data analysis results on real-valued matrices. Statistical Analysis and Data Mining, 2, pp. 209–230. PDF
Ojala, M., 2010. Assessing Data Mining Results on Matrices with Randomization. In 10th IEEE International Conference on Data Mining, pp. 959–964. PDF
De Bie, T., 2010. Maximum entropy models and subjective interestingness: An application to tiles in binary databases. Data Mining and Knowledge Discovery, 23(3), pp. 407–446. PDF
Kontonasios, K.-N. & De Bie, T., 2010. An information-theoretic approach to finding informative noisy tiles in binary databases. In 2010 SIAM International Conference on Data Mining, pp. 153–164. PDF
Kontonasios, K.-N., Vreeken, J. & De Bie, T., 2011. Maximum Entropy Modelling for Assessing Results on Real-Valued Data. In 11th IEEE International Conference on Data Mining, pp. 350–359. PDF

Topic IV, Tensors

Kolda, T. G. & Bader, B. W., 2009. Tensor Decompositions and Applications. SIAM Review, 51(3), pp. 455–500. PDF
Cerf, L., Besson, J., Robardet, C. & Boulicaut, J.-F., 2009. Closed patterns meet n-ary relations. ACM Transactions on Knowledge Discovery from Data, 3(1), article 3. PDF
Cerf, L., Besson, J., Nguyen, K.-N. T. & Boulicaut, J.-F., 2013. Closed and noise-tolerant patterns in n-ary relations. Data Mining and Knowledge Discovery, 26(3), pp.574–619. PDF
Miettinen, P., 2011. Boolean Tensor Factorizations. In 11th IEEE International Conference on Data Mining. pp. 447–456. PDF
Nickel, M., Tresp, V. & Kriegel, H.-P., 2011. A Three-Way Model for Collective Learning on Multi-Relational Data. In 28th International Conference on Machine Learning. pp. 809–816. PDF
Nickel, M., Tresp, V. & Kriegel, H.-P., 2012. Factorizing YAGO: Scalable Machine Learning for Linked Data. In 21st International World Wide Web Conference. pp. 271–280. PDF

Background Literature

Mohammed J. Zaki, Wagner Meira Jr. Fundamentals of Data Mining Algorithms, manuscript (pdf, requires username and password)
Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data Mining, Addison-Wesley, 2006. (Website)
Jiawei Han, Micheline Kamber, Jian Pei. Data Mining - Concepts and Techniques, 3rd ed., Morgan Kaufmann, 2011. (Website)