Information Retrieval (IR) systems are the prominent means for searching and accessing huge amounts of unstructured information on the Web and elsewhere. They are complex systems, constituted by many different components interacting together, and evaluation is crucial to both tune and improve them. Nevertheless, in the current evaluation methodology, there is still no way to determine how much each component contributes to the overall performances and how the components interact together. This hampers the possibility of a deep understanding of IR system behaviour and, in turn, prevents us from designing ahead which components are best suited to work together for a specific search task.
In this paper, we move the evaluation methodology one step forward by overcoming these barriers and beginning to devise an “anatomy” of IR systems and their internals. In particular, we propose a methodology based on the General Linear Mixed Model (GLMM) and ANalysis Of VAriance (ANOVA) to develop statistical models able to isolate system variance and component effects as well as their interaction, by relying on a Grid of Points (GoP) containing all the combinations of the analysed components. We apply the proposed methodology to the analysis of two relevant search tasks – news search and Web search – by using standard TREC collections. We analyse the basic set of components typically part of an IR system, namely stop lists, stemmers and n-grams, and IR models. In this way, we derive insights about English text retrieval.
Citations are the cornerstone of knowledge propagation and the primary means of assessing the quality of research, as well as directing investments in science. Science is increasingly becoming “data-intensive”, where large volumes of data are collected and analyzed to discover complex patterns through simulations and experiments, and most scientific reference works have been replaced by online curated datasets. Yet, given a dataset, there is no quantitative, consistent and established way of knowing how it has been used over time, who contributed to its curation, what results have been yielded or what value it has.
The development of a theory and practice of data citation is fundamental for considering data as first-class research objects with the same relevance and centrality of traditional scientific products. Many works in recent years have discussed data citation from different viewpoints: illustrating why data citation is needed, defining the principles and outlining recommendations for data citation systems, and providing computational methods for addressing specific issues of data citation.
The current panorama is many-faceted and an overall view that brings together diverse aspects of this topic is still missing. Therefore, this paper aims to describe the lay of the land for data citation, both from the theoretical (the why and what) and the practical (the how) angle.
Statistical stemmers are important components of Informa- tion Retrieval (IR) systems, especially for text search over languages with few linguistic resources. In recent years, research on stemmers produced relevant results, especially in 2011 when three language-independent stemmers were published in relevant venues. In this paper, we describe our efforts for reproducing these three stemmers. We also share the code as open-source and an extended version of Terrier system integrating the developed stemmers.
Digital libraries and digital archives are the information management systems for storing, indexing, searching, accessing, curating and preserving digital resources which manage our cultural and scientific knowledge heritage (KH). They act as the main conduits for widespread access and exploitation of KH related digital resources by engaging many different types of users, ranging from generic and leisure to students and professionals.
In this chapter, we describe the evolution of digital libraries and archives over the years, starting from Online Public Access Catalog (OPAC), passing through monolithic and domain specific systems, up to service-oriented and component- based architectures. In particular, we present some specific achievements in the field: the DELOS Reference Model and the DelosDLMS, which provide a con- ceptual reference and a reference implementation for digital libraries; the FAST annotation service, which defines a formal model for representing and search- ing annotations over digital resources as well as a RESTful Web service imple- mentation of it; the NESTOR model for digital archives, which introduces an alternative model for representing and managing archival resources in order to enhance interoperability among archives and make access to them faster; and, the CULTURA environment, which favours user engagement over multimedia digital resources.
Finally, we discuss how digital libraries and archives are a key technology for facing upcoming challenges in data sharing and re-use. Indeed, due to the rapid evolution of the nature of research and scientific publishing which are increasingly data-driven, digital libraries and archives are also progressively ad- dressing the issues of managing scientific data. In this respect, we focus on some key building blocks of this new vision: data citation to foster accessibility to scientific data as well as transparency and verifiability of scientific claims, re- producibility in science as an exemplar showcase of how all these methods are indispensable for addressing fundamental challenges, and keyword-based search over relation/structured data to empower natural language access to scientific data.
We propose the Assessor-driven Weighted Averages for Retrieval Evaluation (AWARE) probabilistic framework, a novel methodology for dealing with multiple crowd assessors, who may be contradictory and/or noisy. By modeling relevance judgements and crowd assessors as sources of uncertainty, AWARE takes the expectation of a generic performance measure, like Average Precision (AP), composed with these random variables. In this way, it approaches the problem of aggregating different crowd assessors from a new perspective, i.e. directly combining the performance measures computed on the ground-truth generated by the crowd assessors instead of adopting some classification technique to merge the labels produced by them. We propose several unsupervised estimators that instantiate the AWARE framework and we compare them with state-of-the-art approaches, i.e. Majority Vote (MV) and Expectation Maximization (EM), on TREC collections. We found that AWARE approaches improve in terms of their capability of correctly ranking systems and predicting their actual performance scores.
Data citation is an interesting computational challenge, whose solution draws on several well-studied problems in database theory: query answering using views, and provenance. We describe the problem, suggest an approach to its solution, and highlight several open research problems, both practical and theoretical.
Data citation is of growing concern for owners of curated databases, who wish to give credit to the contributors and curators responsible for portions of the dataset and enable the data retrieved by a query to be later examined. While several databases specify how data should be cited, they leave it to users to manually construct the citations and do not generate them automatically.
We report our experiences in automating data citation for an RDF dataset called eagle-i, and discuss how to gen- eralize this to a citation framework that can work across a variety of different types of databases (e.g. relational, XML, and RDF). We also describe how a database administrator would use this framework to automate citation for a partic- ular dataset.
An increasing amount of information is being collected in structured, evolving, curated databases, driving the question of how information extracted from such datasets via queries should be cited. Unlike traditional research products, such books and journals, which have a fixed granularity, data citation is a challenge because the granularity varies. Different portions of the database, with varying granularity, may have different citations.
Furthermore, there are an infinite number of queries over a database, each accessing and generating different subsets of the database, so we cannot hope to explicitly attach a citation to every possible result set and/or query. We present the novel problem of automatically generating citations for general queries over a relational database, and explore a solution based on a set of citation views, each of which attaches a citation to a view of the database. Citation views are then used to automatically construct citations for general queries. Our approach draws inspiration from results in two areas, query rewriting using views and database provenance and combines them in a robust model. We then discuss open issues in developing a practical solution to this challenging problem.
The practice of citation is foundational for the propagation of knowledge along with scientific development and it is one of the core aspects on which scholarship and scientific publishing rely.
Within the broad context of data citation, we focus on the automatic construction of citations problem for hierarchically structured data. We present the “learning to cite” framework which enables the automatic construction of human- and machine-readable citations with different level of coarseness. The main goal is to reduce the human intervention on data to a minimum and to provide a citation system general enough to work on heterogeneous and complex XML datasets. We describe how this framework can be realized by a system for creating citations to single nodes within an XML dataset and, as a use case, show how it can be applied in the context of digital archives.
We conduct an extensive evaluation of the proposed citation system by analyzing its effectiveness from the correctness and completeness viewpoints, showing that it represents a suitable solution that can be easily employed in real-world environments and that reduces human intervention on data to a minimum.
The aim of LEARNER@ICTIR2017 is to investigate new solutions for LtR. In details, we identify some research areas related to LtR which are of actual interest and which have not been fully explored yet. We solicit the submission of position papers on novel LtR algorithms, on evaluation of LtR algorithms, on dataset creation and curation, and on domain specific applications of LtR. LEARNER@ICTIR2017 will be a gathering of academic people interested in IR, ML and related application areas. We believe that the proposed workshop is relevant to ICTIR since we look for novel contributions to LtR focused on foundational and conceptual aspects, which need to be properly framed and modeled.
The Web has created a global marketplace for e-Commerce as well as for talent. Online employment marketplaces provide an effective channel to facilitate the matching between job seekers and hirers. This paper presents an initial exploration of user behavior in job and talent search using query and click logs from a popular employment marketplace. The observations suggest that the understanding of users’ search behavior in this scenario is still at its infancy and that some of the assumptions made in general web search may not hold true. The open challenges identified so far are presented.
Ranking query results effectively by considering user past behaviour and preferences is a primary concern for IR researchers both in academia and industry. In this context, LtR is widely believed to be the most effective solution to design ranking models that account for user-interaction features that have proved to remarkably impact on IR e ectiveness. In this paper, we explore the possibility of integrating the user dynamic directly into the LtR algorithms. Specifically, we model with Markov chains the behaviour of users in scanning a ranked result list and we modify LambdaMart, a state-of-the-art LtR algorithm, to exploit a new discount loss function calibrated on the proposed Markovian model of user dynamic. We evaluate the performance of the proposed approach on publicly available LtR datasets, finding that the improvements measured over the standard algorithm are statistically significant.
In this paper, we describe a set of experiments that turn the machine learning classification task into a game, through gamification techniques, and let non expert users to perform text classification without even knowing the problem. The application is implemented in R using the Shiny package for interactive graphics. We present the outcome of three different experiments: a pilot experiment with PhD and post-doc students, and two experiments carried out with primary and secondary school students. The results show that the human aided classifier performs similarly and sometimes even better than state of the art classifiers.
To address the challenge of adapting Information Retrieval (IR) to the constantly evolving user tasks and needs and to adjust it to user interactions and preferences we develop a new model of user behavior based on Markov chains. We aim at integrating the proposed model into several aspects of IR, i.e. evaluation measures, systems and collections. Firstly, we studied IR evaluation measures and we propose a theoretical framework to describe their properties. Then, we presented a new family of evaluation measures, called Markov Precision (MP), based on the proposed model and able to explicitly link lab-style and on-line evaluation metrics. Future work will include the presented model into Learning to Rank (LtR) algorithms and will define a collection for evaluation and comparison of Personalized Information Retrieval (PIR) systems.
Multilingual information access and retrieval is a key concern in today global society and, despite the considerable achievements over the past years, it still presents many challenges. In this context, experimental evaluation represents a key driver of innovation and multilinguality is tackled in several evaluation initiatives worldwide, such as CLEF in Europe, NTCIR in Japan and Asia, and FIRE in India. All these activities have run several evaluation cycles and there is a general consensus about their strong and positive impact on the development of multilingual information access systems. However, a systematic and quantitative assessment of the impact of evaluation initiatives on multilingual information access and retrieval over the long period is still missing.
Therefore, in this paper we conduct the first systematic and large-scale longitudinal study on several CLEF Adhoc-ish tasks – namely the Adhoc, Robust, TEL, and GeoCLEF labs – in order to gain insights on the performance trends of monolingual, bilingual and multilingual information access systems, spanning several European and non-European languages, over a range of 10 years.
We learned that monolingual retrieval exhibits a stable positive trend for many of the languages analyzed, even though the performance increase is not always steady from year to year due to the varying interests of the participants, who may not always be focused on just increasing performances. Bilingual retrieval demonstrates higher improvements in recent years – probably due to the better language resources now available – and it also outperforms monolingual retrieval in several cases. Multilingual retrieval shows improvements over the years and performances are comparable to those of bilingual and monolingual retrieval, and sometimes even better. Moreover, we have found evidence that the rule-of-thumb of a 3-year duration for an evaluation task is typically enough since top performances are usually reached by the third year and sometimes even by the second year, which then leaves room for research groups to investigate relevant research issues other than top performances.
Overall, this study provides quantitative evidence that CLEF has achieved the objective which led to its establishment, i.e. making multilingual information access a reality for European languages. However, the outcomes of this paper not only indicate that CLEF has steered the community in the right direction, but they also highlight the many open challenges for multilinguality. For instance, multilingual technologies greatly depend on language resources and targeted evaluation cycles help not only in developing and improving them, but also in devising methodologies which are more and more language-independent. Another key aspect concerns multimodality, intended not only as the capability of providing access to information in multiple media, but also as the ability of integrating access and retrieval over different media and languages in a way that best fits with user needs and tasks.
Experimental evaluation carried out in international large-scale campaigns is a fundamental pillar of the scientific and technological advancement of Information Retrieval (IR) systems. Such evaluation activities produce a large quantity of scientific and experimental data, which are the foundation for all the sub- sequent scientific production and development of new systems. In this work, we discuss how to semantically annotate and interlink this data, with the goal of enhancing their interpretation, sharing, and reuse. We discuss the underlying evaluation workflow and propose a Resource Description Framework (RDF) model for those workflow parts. We use expertise retrieval as a case study to demonstrate the benefits of our semantic representation approach. We employ this model as a means for exposing experimental data as Linked Open Data (LOD) on the Web and as a basis for enriching and automatically connecting this data with expertise topics and expert profiles.
In this context, a topic-centric approach for expert search is proposed, addressing the extraction of expertise topics, their semantic grounding with the LOD cloud, and their connection to IR experimental data. Several methods for expert profiling and expert finding are analysed and evaluated. Our results show that it is possible to construct expert profiles starting from automatically extracted expertise topics and that topic-centric approaches outperform state-of-the-art language modelling approaches for expert finding.
In this paper we run a systematic series of experiments for creating a grid of points where many combinations of retrieval methods and components adopted by MultiLingual Information Access (MLIA) systems are represented. This grid of points has the goal to provide insights about the effectiveness of the different components and their interaction and to identify suitable baselines with respect to which all the comparisons can be made.
We publicly release a large grid of points comprising more than 4K runs obtained by testing 160 IR systems combining different stop lists, stem- mers, n-grams components and retrieval models on CLEF monolingual tasks for eight European languages. Furthermore, we evaluate such grid of points by employing four different effectiveness measures and provide some insights about the quality of the created grid of points and the behaviour of the different systems.
This is the introduction to the special issue on data citation of the Bulletin of IEEE Technical Committee on Digital Libraries. In this introduction we state the “lay of the land” of research on data citation, we discuss some open issues and possible research directions and present the main contributions provided by the papers of the special issue.
Digital archives are one of the pillars of our cultural heritage and they are increasingly opening up to end-users by focusing on accessibility of their resources. Moreover, digi- tal archives are complex and distributed systems where interoperability plays a central role and efficient access and exchange of resources is a challenge. In this paper, we investigate user and interoperability requirements in the archival realm and we discuss how next generation archival systems should operate a paradigm shift bringing a new model of access to archival resources which allows to better address these needs. To this end, we employ the data structures and query primitives based on the NEsted SeTs for Object hieRarchies (NESTOR) model to efficiently access archival data overcoming the identified barriers and limitations.
Topic variance has a greater effect on performances than system variance but it cannot be controlled by system developers who can only try to cope with it. On the other hand, system variance is important on its own, since it is what system developers may affect directly by changing system components and it determines the differences among systems.
In this paper, we face the problem of studying system variance in order to better understand how much system components contribute to overall performances. To this end, we propose a methodology based on General Linear Mixed Model (GLMM) to develop statistical models able to isolate system variance, component effects as well as their interaction. We apply the proposed methodology to the analysis of TREC Ad-hoc data in order to show how it works and discuss some interesting outcomes of this new kind of analysis. Finally, we extend the analysis to different evaluation mea- sures, showing how they impact on the sources of variance.
We present the innovative visual analytics approach of the VATE2 system, which eases and makes more effective the experimental evaluation process by introducing the what-if analysis. The what-if analysis is aimed at estimating the possible effects of a modification to an IR system to select the most promising fixes before implementing them, thus saving a considerable amount of effort. VATE2 builds on an analytical framework which models the behavior of the systems in order to make estimations, and integrates this analytical framework into a visual part which, via proper interaction and animations, receives input and provides feedback to the user.
The participation of the Information Management System (IMS) Group of the University of Padua in the Total Recall track at TREC 2016 consisted in a set of fully automated experiments based on the two-dimensional probabilistic model. We trained the model in two ways that tried to mimic a real user, and we compared it to two versions of the BM25 model with different parameter settings. This initial set of experiments lays the ground for a wider study that will explore a gamification approach in the context of high recall situations.
The creation of a labelled dataset for Information Retrieval (IR) purposes is a costly process. For this reason, a mix of crowd-sourcing and active learning approaches have been proposed in the literature in order to assess the relevance of documents of a collection given a particular query at an affordable cost. In this paper, we present the design of the gamification of this interactive process that draws inspiration from recent works in the area of gamification for IR. In particular, we focus on three main points: i) we want to create a set of relevance judgements with the least effort by human assessors, ii) we use interactive search interfaces that use game mechanics, iii) we use Natural Language Processing (NLP) to collect different aspects of a query.
The creation of a labelled dataset for machine learning purposes is a costly process. In recent works, it has been shown that a mix of crowd-sourcing and active learning approaches can be used to annotate objects at an affordable cost. In this paper, we study the gamification of machine learning techniques; in particular, the problem of classification of objects. In this first pilot study, we designed a simple game, based on a visual interpretation of probabilistic classifiers, that consists in separating two sets of coloured points on a two-dimensional plane by means of a straight line. We present the current results of this first experiment that we used to collect the requirements for the next version of the game and to analyze i) what is the 'price' to build a reasonably accu- rate classifier with a small amount of labelled objects, ii) and compare the accuracy of the player to the state-of-the-art classification algorithms.
In this paper we present a formal framework, based on the representational theory of measurement and we define and study the properties of utility-oriented measurements of retrieval effectiveness like AP, RBP, ERR and many other popular IR evaluation measures.
XML is a pervasive technology for representing and accessing semi-structured data. XPath is the standard language for navigational queries on XML documents and there is a growing demand for its efficient processing.
In order to increase the efficiency in executing four navigational XML query primitives, namely descendants, ancestors, children and parent, we introduce a new paradigm where traditional approaches based on the efficient traversing of nodes and edges to reconstruct the requested subtrees are replaced by a brand new one based on basic set operations which allow us to directly return the desired subtree, avoiding to create it passing through nodes and edges.
Our solution stems from the NEsted SeTs for Object hieRarchies (NESTOR) formal model, which makes use of set-inclusion relations for representing and providing access to hierarchical data. We define in-memory efficient data structures to implement NESTOR, we develop algorithms to perform the descendants, ancestors, children and parent query primitives and we study their computational complexity.
We conduct an extensive experimental evaluation by using several datasets: digital archives (EAD collections), INEX 2009 Wikipedia collection, and two widely-used synthetic datasets (XMark and XGen). We show that NESTOR-based data structures and query primitives consistently outperform state-of-the-art solutions for XPath processing at execution time and they are competitive in terms of both memory occupation and pre-processing time.
Structured data sources promise to be the next driver of a significant socio-economic impact for both people and companies. Nevertheless, accessing them through formal languages, such as SQL or SPARQL, can become cumbersome and frustrating for end-users. To overcome this issue, keyword search in databases is becoming the technology of choice, even if it suffers from efficiency and effectiveness problems that prevent it from being adopted at Web scale.
In this paper, we motivate the need for a reference architecture for keyword search in databases to favor the development of scalable and effective components, also borrowing methods from neighbor fields, such as information retrieval and natural language processing. Moreover, we point out the need for a companion evaluation framework, able to assess the efficiency and the effectiveness of such new systems and in the light of real and compelling use cases.
In this paper we present a novel measure for ranking evaluation, called Twist (τ). It is a measure for informational intents, it handles both binary and graded relevance, and it shares the scene mainly with Average Precision (AP), cumulated-gain family of metrics as Discounted Cumulated Gain (DCG), and Rank-Biased Precision (RBP).
The above mentioned metrics adopt different user models but share a common approach: they measure the “utility” of a ranked list for the user and this “utility” is the user motivation for continuing to scan the result list when non-relevant documents are retrieved. The different user models adopted account for the way in which this “utility” (or gain) is computed.
τ stems from a different observation: searching is nowadays a commodity, like water, electricity and the like, and it is natural for users assume that it is available, it fits their needs, it works well. In this sense, they may not perceive the “utility” they have in finding relevant documents but rather they may perceive that the system is just doing what it is expected to do. On the other hand, they may feel uneasy when the system returns non-relevant documents in wrong positions since they are then forced to do additional work to get the desired information, work they would not have expected to do when using a commodity. Thus, τ tries to grasp the avoidable effort caused to the user by the actual ranking of the system with respect to an ideal ranking.
We provide a formal definition of τ as well as a demonstration of its properties. We introduce the notion of effort-gain plots, which allow us to easily spot those systems that look similar from a utility/gain perspective but are actually different in terms of the effort required of their users to attain that utility/gain. Finally, by means of an extensive experimental evaluation with TREC collections, τ is proven not to be highly correlated with existing metrics, to be stable when shallow pools are employed, and to have a good discriminative power.
In short, τ grasps different aspects of system performances with respect to traditional metrics, it does not require extensive and costly assessments, and it is a robust tool for detecting differences between systems.
Digital Library (DL) are the main conduits for accessing our cultural heritage and they have to address the requirements and needs of very diverse memory institutions, namely Libraries, Archives and Museums (LAM). Therefore, the interoperability among the Digital Library System (DLS) which manage the digital resources of these institutions is a key concern in the field.
DLS are rooted in two foundational models of what a digital library is and how it should work, namely the DELOS Reference Model and the Streams, Structures, Spaces, Scenarios, Societies (5S) model. Unfortunately these two models are not exploited enough to improve interoperability among systems.
To this end, we express these foundational models by means of ontologies which exploit the methods and technologies of Semantic Web and Linked Data. Moreover, we link the proposed ontologies for the foundational models to those currently used for publishing cultural heritage data in order to maximize interoperability.
We design an ontology which allows us to model and map the high level concepts of both the 5S model and the DELOS Reference Model. We provide detailed ontologies for all the domains of such models, namely the user, content, functionality, quality, policy and architectural component domains in order to make available a working tool for making DLS interoperate together at a high level of abstraction. Finally, we provide a concrete use case about digital annotation of illuminated manuscripts to show how to apply the proposed ontologies and illustrate the achieved interoperability between the 5S and DELOS Reference models.