General Guidelines on Topics

Select one of the given topics for each part of the course. The questions below the topic are aimed to help you to get idea what to study. You do not necessarily need to answer to each of them and you can—and should—consider other questions, as well.

The essays are not about the answers; they are about justifying the answers. Simple lists of answers with no justification are not enough (even with a citation). Rather, you must explain why (you think) the answer is what you say (for example, explain in your own words how your sources justify the answer; if there is no source, explain why you think that is the answer). For some questions, justification can as simple as a one-line equation; for others, more argumentation is required.

Essay Topics

Warm-up Essay (DL 30 October)

What is Data Mining?
Is Data Mining a Science?

Pattern Set Mining (DL 20 November)

0/1 Tiling versus Density Tiling

Pros and cons of both methods?
When they can be used?
When they should be used?
When one is better than the other?
Can we use one to get other?
Better algorithms?

0/1 Tiling versus Krimp

Same questions as above
Can we use parts of one in other (e.g. MDL in tiling or Set Cover in Krimp)? Would that be useful?

MDL versus Bayesian Information Criterion (BIC)

This topic requires readign outside the lecture's scope
Differences/similarities?
Pros and cons?
When one is better than the other?
Which one should I use?

Graph Mining (DL 18 December)

Applications of Frequent Subgraph Mining

Explore some applications studied in scientific literature
What is the data and how is it modelled as a graph?
What are the frequent subgraphs? Why are they interesting?
Are there restrictions on the type of subgraphs (trees, DAGs, etc.)? Why?

Metro Maps of Science

Read Metro Maps of Science (PDF)
Explain the work
Relations to other work?
Your opinion about it (Interesting? Usefull? Boring?)

Parameters in Connecting the Dots and Trains of Thought

What are the user-supplied parameters?
What do they do?
Are their effects intuitive?
How to select good values for them?
Too many? Too few?
Your opinion about user-supplied parameters in general

Significance Testing (DL 29 January)

Swap-based methods vs. maximum entropy methods

What are they and how do they work?
What are their similarities and differences?
Is one clearly better than the other? In some special application? If yes, when?
Consider both binary and continuous data.

Method for finding a frequency threshold for significant itemsets vs other methods for significance testing

Consider the method of Kirch et al. (2012) (PDF)
How does it related to swap based methods?
What about MaxEnt methods?
Only binary data.

Tensors (DL 12 February)

N-way itemset mining v.s. normal itemset mining

What's so hard with tensors?
Why not use N-way Apriori?
- How would that work?
Do also maximal and non-derivable itemsets' definitions generalize to N modes?

Noise-tolerant N-way itemsets

Consider the method by Cerf et al. (2013) PDF
Explain the (main) ideas and algorithm.
Can this method be used to compute Boolean CP decomposition? How?

Will the BU problem be a problem?

Applications of tensor decompositions in data mining

Present some work that applies tensor decompositions in data mining.
Explain the ideas.
Are tensors necessary for the idea?
Is the work good/relevant/interesting?
Kolda & Bader (2009) have listed number of applied work (PDF); more recent work can be found, for example, using Google.