Lecture "Discrete Topics in Data Mining" WS 2012/13, 3 ECTS credits
- The exam and re-exam will be in room 021 at
the MPII building (E1.4), ground floor.
- There is no lecture on 27 November.
The course schedule is updated.
must register to the final exam in HISPOS by 4th of November.
- The final exam will be held on 19th of February; the re-exam will be on
19th of March
Two hours per week, on Tuesdays, from noon till 2 pm. Place: room
007 in building
E2 1 (bioinformatics).
Schedule & Slides
The schedule of the course and the slides are here.
The articles related to each topic are listed
below. To access the PDFs, you need the username
The essay topics are here.
Guidelines to Return the Essays
The essays must be returned in PDF format via e-mail to the lecturer (see
slides or lecturer's home page for the address). The deadline for the essays,
unless otherwise specified, is two weeks from the date the topics were given
at 14:00 hours (2 pm). Failure to submit the essay on time
will give you a failed grade. The time of the submission is the
timestamp of the mail as shown by the lecturer's e-mail system. It is advisable
to send the essays before noon, so that if you have not received a response,
you can ask the status of your essay before the lecture.
Every essay you return must have the following information:
- Your name
- Your matriculation number
- Your e-mail address
- The topic of the essay (even if there was only one given)
In addition, it is advisable to start the subject line of
the mail with "DTDM" and have the word "essay" somewhere in the subject. This
helps me to notice the purpose of the mail and (hopefully) prevents the spam
filters from filtering the mails.
There are no page limits for the essays, but I expect a good essay to take
between two to five A4-pages in 10pt font and 2.5cm margins all around (you
are free to use other font sizes and margins as long as the text stays legible).
The essays must follow the normal scientific citation practices. Substantial
failure to do so will cause a failure of the essay. The essays may contain
(numbered) section and subsection headings if the author so prefers.
The course will provide an overview of some important
topics in data mining. The purpose of the course is to concentrate on the
ideas and intuition behind these topics, with the aim that after the course,
the students can follow the current research on the topics.
The exact topics covered on this lecture will be announced later (and students'
preferences can be considered), but tentatively we will cover at least
pattern set mining, graph mining, and significance testing (in pattern set
The course will have two hours of lectures every week. There will not be any
homework sessions. Instead, students have to write longer essays/reports
on the topics covered on the lectures.
Students are expected to have passed either Information Retrieval &
Data Mining or Machine Learning core lectures, or hold equivalent
Requirements for Passing the Course
- At least four (4) passing grades from the essays (out of five essays)
- Final exam
The essays are graded in failed/passed/excellent grades. Out of the five essays,
you need to have a passing grade from at least four to be allowed to take the
final exam. If you are allowed to take and pass the final exam, then each
excellent grade from essays will improve your final grade by 1/3 of
what you got from the final exam. That is, if you got 2.0 from the final exam
and you have one excellent grade, your final grade will be 1.7; if you have three or more excellent grades, it will be 1.0.
Articles for the Topics
Topic I, Pattern Set Mining
- Geerts, F., Goethals, B. & Mielikäinen, T., 2004. Tiling databases.
In 7th International Conference on Discovery
Science. pp. 77–122.
- Gionis, A., Mannila, H. & Seppänen, J.K., 2004. Geometric and
Combinatorial Tiles in 0–1 Data. In 8th
European Conference on
Principles and Practice of Knowledge Discovery in Databases.
- Tatti, N. & Vreeken, J., 2012. Discovering Descriptive Tile Trees
By Mining Optimal Geometric Subtiles. In 2012
European Conference on Machine
Learning and Priciples and Practice on Knowledge Discovery in Databases.
- Vreeken, J., van Leeuwen, M. & Siebes, A., 2011. Krimp:
mining itemsets that compress. Data Mining and Knowledge Discovery
Topic II, Graph Mining
- Inokuchi, A., Washio, T. & Motoda, H., 2002. An Apriori-Based
Algorithm for Mining Frequent Substructures from Graph Data. In
4th European Conference on Principles and
Practice of Knowledge Discovery in Databases, pp. 13–23.
- Yan, X. & Han, J., 2002. gSpan: Graph-Based Substructure Pattern
Mining. In 2nd IEEE International Conference on
Data Mining, pp. 721–724.
(extended techincal report PDF)
- Shahaf, D. & Guestrin, C., 2010. Connecting the Dots Between News
Articles. In 16th ACM SIGKDD Conference on
Knowledge Discovery and Data Mining, pp. 623–632.
- Shahaf, D. & Guestrin, C., 2012. Connecting
Two (or Less) Dots: Discovering Structure in News Articles.
ACM Transactions on Knowledge Discovery from Data 5(4), article 24.
- Shahaf, D., Guestrin, C. & Horvitz, E., 2012a. Trains of Thought:
Generating Information Maps. In 21st International
World Wide Web Conference, pp. 899–908.
- Shahaf, D., Guestrin, C. & Horvitz, E., 2012b. Metro Maps of Science.
In 18th ACM SIGKDD Conference on Knowledge
Discovery and Data Mining, pp. 1122–1130.
Topic III, Significance Testing
- Kirsch, A., Mitzenmacher, M., Pietracaprina, A., Pucci, G., Upfal, E.
& Vandin, F., 2012. An Efficient Rigorous
Approach for Identifying Statistically Significant Frequent
Itemsets. Journal of the ACM, 59(3), article 12.
- Gionis, A., Mannila, H., Mielikäinen, T. & Tsaparas, P., 2007.
Assessing data mining results via swap randomization.
ACM Transactions on Knowledge Discovery from Data, 1(3), article
- Ojala, M., Vuokko, N., Kallio, A., Haiminen, N. & Mannila, H., 2009.
Randomization methods for assessing data analysis
results on real-valued matrices. Statistical Analysis and Data
Mining, 2, pp. 209–230.
- Ojala, M., 2010. Assessing Data Mining Results on Matrices with
Randomization. In 10th IEEE International
Conference on Data Mining, pp. 959–964.
- De Bie, T., 2010. Maximum entropy models and
subjective interestingness: An application to tiles in binary databases.
Data Mining and Knowledge Discovery, 23(3), pp. 407–446.
- Kontonasios, K.-N. & De Bie, T., 2010. An information-theoretic
approach to finding informative noisy tiles in binary databases.
In 2010 SIAM International Conference on Data
Mining, pp. 153–164.
- Kontonasios, K.-N., Vreeken, J. & De Bie, T., 2011. Maximum Entropy
Modelling for Assessing Results on Real-Valued Data. In
11th IEEE International Conference on Data Mining,
Topic IV, Tensors
- Kolda, T. G. & Bader, B. W., 2009. Tensor
Decompositions and Applications. SIAM Review, 51(3), pp.
- Cerf, L., Besson, J., Robardet, C. & Boulicaut, J.-F., 2009.
Closed patterns meet n-ary relations.
ACM Transactions on Knowledge Discovery from Data, 3(1), article 3.
- Cerf, L., Besson, J., Nguyen, K.-N. T. & Boulicaut, J.-F., 2013.
Closed and noise-tolerant patterns in n-ary
relations. Data Mining and Knowledge Discovery, 26(3),
- Miettinen, P., 2011. Boolean Tensor Factorizations. In
11th IEEE International Conference on Data
Mining. pp. 447–456.
- Nickel, M., Tresp, V. & Kriegel, H.-P., 2011.
A Three-Way Model for Collective Learning on Multi-Relational Data.
In 28th International Conference on Machine
Learning. pp. 809–816.
- Nickel, M., Tresp, V. & Kriegel, H.-P., 2012.
Factorizing YAGO: Scalable Machine Learning for
Linked Data. In 21st International World Wide
Web Conference. pp. 271–280.
- Mohammed J. Zaki, Wagner Meira Jr.
Fundamentals of Data Mining Algorithms,
manuscript (pdf, requires username and password)
- Pang-Ning Tan, Michael Steinbach, Vipin Kumar.
Introduction to Data Mining,
- Jiawei Han, Micheline Kamber, Jian Pei. Data Mining - Concepts and
Techniques, 3rd ed., Morgan Kaufmann, 2011. (Website)