Final exam and assignament of the course on advanced database systems, 2013-2014
Grading
Grades are based on a final exam (%) and a final assignament.
Final exam
Final exam will be oral. The teacher's questions may regard any argument
in those discussed in the course slides. Students are required
to have understood the main concepts supporting the technologies and
the methods presented during the lectures.
Remembering all the details and all the exercises of the more complex arguments is not mandatory,
but the students should be able to reconstruct most of those details and exercises by their
knowledge.
Dates
Students can take the final examination and submit the assigament on official days
for exams or (upon request) on hoffice ours.
Final assignament
The goal of the final assignament is to deepen a topic in data mining
or information retrieval.
Students can choose between two types of final assignaments:
To study a research paper.
To carry out a small project which includes an experimental part.
The final assignament can be carried in groups of up to 2 students.
While the students are carrying out the assignament,
the teacher is availble for hints and to verify the preliminary versions of the assignament.
The argument of the project must be discussed in advanced with the teacher.
Here, some examples are listed. Students may also propose their own topics.
Research papers
Students must produce a short presentation with slides (10-15 minutes)
of a research manuscript (obviously, the presentation must be autonomously produced
and cannot be copied from any existing one).
Web spam detection is the problem of recognizing pages that contain spam.
In [1], a survey on spam detection methods is contained.
The two papers [2], [3] domenstrate the importance of using the web connectivity
to detect the spam.
Learning to rank is the application of machine learning to the constructions
of ranking models for information retrieval systems.
The work [8] provides a general description of how such a problem can be faced in search engines.
Several tutorials about learning to rank are available
here.
Three of the most well known
methods are:
RankNet [4] (which is said to be used in Microsoft's Bing),
AdaRank [5] and ListNet [6].
Finally, the method [9] has been develop at our department.
Text categorization is the problem of classifying a text according to the content.
A survey on the methods used for text categorization is in [12].
The most common approaches are based on naive bayes, f.i. [10] and on support vetor machines, f.i.
[11].
Adertising on the web allows to deliver to each user a customized advertise.
However, associating users and advertises is a difficult and resource consuming task.
An introduction to this topic is in [19].
Predicting user behaviours allow to implement tools that help the users
by suggesting them products and, more generally, usueful information.
These tools are called recommender systems and have been widely used n the web.
In [20], such a topic is introduced.
Detecting frauds is a crucial activity for several companies and organizations, f.i.
insurance companies, telephone companies, banks. Fraud detection is a difficult task,
since fraudulent behaviours are often generated by anomalous activities,
for which a sufficient number of examples is not available.
Fraud detection techniques are reveiwed in [13] and [14].
Medical informatics is a field, in which
data mining techniques play an important role.
A wide variety of applications are encountered in medical informatics.
A survey of those applications is in [15].
Bioinformatics is the application of information technology to the field of molecular biology.
Such a field is characterized by huge biological datasets, whose analysis involve adavnaced
data amining tools using techniques from databases, statistics, computer scicence
and machine learning. Chapter [16] includes a survey on this topic.
Modelling customer bahaviour in the future is
very important for customer relationship management, (e.g. to manage call centers,
to improve marketing and sale efforts). Customer behaviour can be modelled both
by exploiting past behaviours and demografic information. A survey on this topic is in [17].
The goal of feature selection methods is to reduce the domain dimension in data mining applications,
by distinguishing important features from non-impontant ones.
Actually, data mining techniques (f.i., methods or classification, clustering and associaiton rule
extraction) often must cope with domains containing a large number of features, most
of which are not useful to solve the problem at hand.
An introduction to this topic is in [21].
Deep neural networks are a particular class of articial neural networks characterized by beeing composed
of several hidden layers. Recently, they have been applied on several problems, so as
image classiication [22,23] and speech recognition [24]: the success
has been so evident that several companies, including Google, Microsft and Facebook, immediately started to use
such technology in their products.
Experimental projects
The following one is a list of examples of experimental projects. Such a list will be modified in the future
on the base of project assignments and when new benchmarks will become available.
Students can propose their own projects,
which can be also correlated to a project in another course or to an internship.
Students should discuss in advance the project with the theacher.
If the project is particularly demanding, the student may be granted a reduction
on the topics to study for the oral examination.
Students must present the teacher the results of their experimental actvity, by a short slide presentation
or by showing directly the results produced by the software.
WEBSPAM-UK2006 is a benchmark that has been used in one of the
competitions whose goal was to compare web spam detection algorithms (see
this site
for information about the competition).
The dataset
contains a snapshot of web containg with 18 million of pages extracted from 11.400 hosts.
Such a benchmark can be used to experiment spam detection methods.
LETOR has been extensively used to test query ranking
algorithms. LETOR includes several benchmarks, each one containinig information
about a set of queries and a set of documents. The purpose is to design
algorithms that are able to sort the documents according to the relevance
to the queries.