Franco Scarselli's course on Advanced Database Systems 2012/2013

University of Siena
Department of Information Engineering and Mathematics

Franco Scarselli

Home

Research

Publications

Curriculum Vitae

Teaching

GNN software

Links

DIISM

AIR group

Final exam and assignament of the course on advanced database systems, 2013-2014

Grading
Grades are based on a final exam and a final assignament.

Final exam
Final exam will be oral. The teacher's questions may regard any argument in those discussed in the course slides. Students are required to have understood the main concepts supporting the technologies and the methods presented during the lectures. Remembering all the details and all the exercises of the more complex arguments is not mandatory, but the students should be able to reconstruct most of those details and exercises by their knowledge.

Dates
Students can take the final examination and submit the assigament on official days for exams or (upon request) on hoffice ours (wensday, 10:30-12:30).

Final assignament
The goal of the final assignament is to deepen a topic in data mining or information retrieval. Students can choose between two types of final assignaments: to study a research paper; to carry out a small project which includes an experimental part. In both the cases, students should produce a short presentation with slides (10-15 minutes) to describe their activity (obviously, the presentation must be autonomously produced and cannot be copied from any existing one).

The final assignament can be carried in groups of up to 2 students. While the students are carrying out the assignament, the teacher is availble for hints and to verify the preliminary versions of the assignament.

The argument of the project must be discussed with the teacher. Here, some examples are listed. Students may also propose their own topics.

Research papers (provisory list: it will be extended soon)
- Web spam detection is the problem of recognizing pages that contain spam. In [1], a survey on spam detection methods is contained. The two papers [2], [3] domenstrate the importance of using the web connectivity to detect the spam.
- Learning to rank is the application of machine learning to the constructions of ranking models for information retrieval systems. The work [8] provides a general description of how such a problem can be faced in search engines. Several tutorials about learning to rank are available here. Three of the most well known methods are: RankNet [4] (which is said to be used in Microsoft's Bing), AdaRank [5] and ListNet [6]. Finally, the method [9] has been develop at our department.
- Text categorization is the problem of classifying a text according to the content. A survey on the methods used for text categorization is in [12]. The most common approaches are based on naive bayes, f.i. [10] and on support vetor machines, f.i. [11].
- The web link analysis allows search engines to measure the importance of the pages. PageRank has been introduced in [13]. A deep analysis of PageRank properties is in [14]. Extensions of PageRank include ranks based on document contents [15], [16], HITS [17] and TrustRank (currently used by Google).
- Adertising on the web allows to deliver to each user a customized advertise. However, associating users and advertises is a difficult and resource consuming task. An introduction to this topic is in [19].
- Predicting user behaviours allow to implement tools that help the users by suggesting them products and, more generally, usueful information. These tools are called recommender systems and have been widely used n the web. In [20], such a topic is introduced.
- Data mining techniques (f.i., methods or classification, clustering and associaiton rule extraction) often must cope with domains containing a large number of features, most of which are not useful to solve the problem at hand. Several methods have been desinged in order to reduce the domain dimension, by distinguishing important features from non-impontant ones. An introduction to this topic is in [21].
- Deep neural networks are a particular class of articial neural networks characterized by beeing composed of several hidden layers. Recently, they have been applied on several problems, so as image classiication [22,23] and speech recognition [24]: the success has been so evident that several companies, including Google, Microsft and Facebook, immediately started to use such technology in their products.
Experimental projects (provisory list: it will be extended soon)
- WEBSPAM-UK2006 is a benchmark that has been used in one of the competitions whose goal was to compare web spam detection algorithms (see this site for information about the competition). The dataset contains a snapshot of web containg with 18 million of pages extracted from 11.400 hosts. Such a benchmark can be used to experiment spam detection methods.
- LETOR has been extensively used to test query ranking algorithms. LETOR includes several benchmarks, each one containinig information about a set of queries and a set of documents. The purpose is to design algorithms that are able to sort the documents according to the relevance to the queries.
- Recently, a research group of our university has constructed a large dataset of about 160.000 proteins. In the dataset, each pattern describes the amino acid distribution in the core of a protein and the classification of the protein. Such a dataset can be used to study the reltionships between the characterstics of the proteins and the characteristics of their cores.

Bibliography