|
|
Final exam and assignament of the course on advanced database systems, 2013-2014
Grading
Grades are based on a final exam (75%) and a final assignament (25%).
Final exam
Final exam will be oral. The teacher's questions may regard any argument
in those discussed in the course slides. Students are required
to have understood the main concepts supporting the technologies and
the methods presented during the lectures.
Remembering all the details and all the exercises of the more complex arguments is not mandatory,
but the students should be able to reconstruct most of those details and exercises by their
knowledge.
Dates
Students can take the final examination and submit the assigament on official days
for exams or (upon request) on hoffice ours (wensday, 14:30-16:30).
Final assignament
The goal of the final assignament is to deepen a topic in data mining
or information retrieval.
Students can choose between two types of final assignaments:
to study a research paper; to carry out a small project which includes an
experimental part.
In both the cases, students should produce a short presentation
with slides (10-15 minutes) to describe their activity.
The final assignament can be carried
in groups of up to 3 students. While the students are carrying out the assignament,
the teacher is availble for hints and to verify the preliminary versions of the assignament.
The argument of the project must be discussed with the teacher.
Here,
some examples are listed. Students may also propose their own topics.
-
Research papers
- Web spam detection is the problem of recognizing pages that contain spam.
In [1], a survey on spam detection methods is contained.
The two papers [2], [3] domenstrate the importance of using the web connectivity
to detect the spam.
- Learning to rank is the application of machine learning to the constructions
of ranking models for information retrieval systems.
The work [8] provides a general description of how such a problem can be faced in search engines.
Several tutorials about learning to rank are available
here.
Three of the most well known
methods are:
RankNet [4] (which is said to be used in Microsoft's Bing),
AdaRank [5] and ListNet [6].
Finally, the method [9] has been develop at our department.
- Computer clusters are required when an information retrieval system
has a huge number of users. Mapreduce, a programming model
for processing and generating large data sets, is used by Google an
described in [7].
- Text categorization is the problem of classifying a text according to the content.
A survey on the methods used for text categorization is in [12].
The most common approaches are based on naive bayes, f.i. [10] and on support vetor machines, f.i.
[11].
-
The web link analysis allows search engines to measure the importance of the pages.
PageRank has been introduced in [13]. A deep analysis of PageRank properties is in [14].
Extensions of PageRank include ranks based on document contents [15], [16],
HITS [17] and TrustRank (currently used by Google).
- Experimental projects
- WEBSPAM-UK2006 is a benchmark that has been used in one of the
competitions whose goal was to compare web spam detection algorithms (see
this site
for information about the competition).
The dataset
contains a snapshot of web containg with 18 million of pages extracted from 11.400 hosts.
Such a benchmark can be used to experiment spam detection methods.
-
LETOR has been extensively used to test query ranking
algorithms. LETOR includes several benchmarks, each one containinig information
about a set of queries and a set of documents. The purpose is to design
algorithms that are able to sort the documents according to the relevance
to the queries.
- Recently, a research group of our university has constructed a large dataset of about 160.000 proteins.
In the dataset, each pattern describes the amino acid distribution in the core of a protein
and the classification of the protein. Such a dataset can be used to study the
reltionships between the characterstics of the proteins and the characteristics
of their cores.
Bibliography
- [1]
Spirin, N., & Han, J. (2012). Survey on web spam detection: principles and algorithms. ACM SIGKDD Explorations Newsletter, 13(2), 50-64.
- [2]
Castillo, C., Donato, D., Gionis, A., Murdock, V., & Silvestri, F. (2007, July). Know your neighbors: Web spam detection using the web topology. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 423-430). ACM.
- [3]
Becchetti, L., Castillo, C., Donato, D., Baeza-Yates, R., & Leonardi, S. (2008). Link analysis for web spam detection. ACM Transactions on the Web (TWEB), 2(1), 2.
- [4]
Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., & Hullender, G. (2005, August). Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning (pp. 89-96). ACM.
- [5]
Xu, J., & Li, H. (2007, July). Adarank: a boosting algorithm for information retrieval. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 391-398). ACM.
- [6]
Cao, Z., Qin, T., Liu, T. Y., Tsai, M. F., & Li, H. (2007, June). Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning (pp. 129-136). ACM..
- [7]
Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113.
- [8]
Joachims, T., & Radlinski, F. (2007). Search engines that learn from implicit feedback. IEEE Computer, 40(8), 34-40.
- [9]
Rigutini, L., Papini, T., Maggini, M., & Scarselli, F. (2011). SortNet: Learning to rank by a neural preference function. Neural Networks, IEEE Transactions on, 22(9), 1368-1380.
- [10]
McCallum, A., & Nigam, K. (1998, July). A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text categorization (Vol. 752, pp. 41-48).
- [11]
Drucker, H., Wu, S., & Vapnik, V. N. (1999). Support vector machines for spam categorization. Neural Networks, IEEE Transactions on, 10(5), 1048-1054.
- [12]
Sebastiani, Fabrizio. "Machine learning in automated text categorization." ACM computing surveys (CSUR) 34.1 (2002): 1-47.
- [13]
Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank Citation Ranking: Bringing Order to the Web. Technical report of Stanford InfoLab.
- [14]
Bianchini, M., Gori, M., & Scarselli, F. (2005). Inside pagerank. ACM Transactions on Internet Technology (TOIT), 5(1), 92-128.
- [15]
Haveliwala, T. H. (2002, May). Topic-sensitive pagerank. In Proceedings of the 11th international conference on World Wide Web (pp. 517-526). ACM.
- [16]
Diligenti, M., Gori, M., & Maggini, M. (2004). A unified probabilistic framework for web page scoring systems. Knowledge and Data Engineering, IEEE Transactions on, 16(1), 4-16.
- [17]
Liben-Nowell, D., & Kleinberg, J. (2007). The link-prediction problem for social networks. Journal of the American society for information science and technology, 58(7), 1019-1031.
- [18]
Gyöngyi, Z., Garcia-Molina, H., & Pedersen, J. (2004, August). Combating web spam with trustrank. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30 (pp. 576-587). VLDB Endowment.
|