Information Retrieval
The group has a long and strong research history in the process of information retrieval as a whole from theoretical modelling of the retrieval process to large-scale text retrieval systems building and to the interactive evaluation of multimedia information retrieval systems. The research interests of the group include:
- Theoretical modelling of IR systems
- Probabilistic retrieval
- Web information retrieval
- Implementation of large-scale IR systems
- Multimedia (Image, Video, Audio) information retrieval
- Intranet/Enterprise and Blog search
- Distributed and Peer-to-Peer retrieval
- User Modelling and the development of novel adaptive interaction techniques
- Evaluation of IR systems
- Text mining and knowledge discovery
- Multilingual information retrieval
- Semantic Web and information retrieval
The group maintains strong links with researchers in Machine Learning and Human-Computer Interaction, as well as with industry through knowledge and technology transfer
- RAENG Fellowship
Foundations research in information retrieval inspired by quantum theory.
Prof C.J. van Rijsbergen, 2007-2012.
Academic Staff: Dr Leif Azzopardi, Prof Joemon Jose, Dr Iadh Ounis, Prof C J (Keith) van Rijsbergen.
Research Fellows: Dr Sachi Arafat,
Research Assistants and Research Students: Dr Craig Macdonald, Mr Richard McCreadie, Mr Andrew McHugh, Mr Yashar Moshfeghi, Mr Jie Peng, Mr Rodrygo Santos, Mr Ke Zhou
- Theoretical development of probabilistic and logic-based models
- multimedia IR systems
- information analysis and access across media
- evaluation and usability of IR systems
- data mining of large data sets
- Web information retrieval
- citation/link analysis
- implementation and evaluation of large-scale IR systems
- performance prediction and optimisation
- information retrieval in context
- multilingual retrieval
- interaction techniques based on implicit relevance feedback and summarisation
- adaptive information retrieval
- Intranet/Enterprise and Blog search
Using emotion to diversify document rankings
Moshfeghi, Y.
A query-basis approach to parametrizing novelty-biased cumulative gain
Leelanupab, T.
Navigating the user query space
Cummins, R.
Quantum theory-inspired search
Aerts, D., Bruza, P., Hou, Y., Jose, J.
Temporal pseudo-relevance feedback in microblog retrieval
Whiting, S., Klampanos, I.A.
Exploiting semantics on external resources to gather visual examples for video retrieval
Vallet, D.
Bridging the gap between expert and novice users for video search
Halvey, M.
Using Historical Click Data to Increase Interleaving Sensitivity
Kharitonov, E., Macdonald, C.
About learning models with multiple query dependent features
Macdonald, C.
News Vertical Search: When and What to Display to Users
Mccreadie, R.
User model-based metrics for offline query suggestion evaluation
Kharitonov, E., Macdonald, C.
Can twitter replace newswire for breaking news?
Petrovic, S., Osborne, M., Mccreadie, R.
Selecting effective expansion terms for diversity
Vargas, S., Santos, R.L.T., Macdonald, C.
Identifying local events by using microblogs as social sensors
Albakour, M.-D.
Inferring conceptual relationships to improve medical records search
Limsopatham, N., Macdonald, C.
COPE: interactive image retrieval using conversational recommendation
Balcer, B., Halvey, M.
On the usefulness of query features for learning to rank
Macdonald, C.
An examination of content farms in web search using crowdsourcing
McCreadie, R., Macdonald, C.
On building a reusable Twitter corpus
McCreadie, R., Soboroff, I., Lin, J., Macdonald, C.
Effect of dynamic pruning safety on learning to rank effectiveness
Macdonald, C.
This Week’s EventsAll Upcoming EventsPast Events
This Week’s Events
There are no events scheduled for this week
Upcoming Events
There are no upcoming events scheduled.
Past Events
From Search to Adaptive Search (12 November, 2012)
Speaker: Udo Kruschwitz
Modern search engines have been moving away from very simplistic interfaces that aimed at satisfying a user's need with a single-shot query. Interactive features such as query suggestions and faceted search are now integral parts of Web search engines. Generating good query modification suggestions or alternative queries to assist a searcher remains however a challenging issue. Query log analysis is one of the major strands of work in this direction. While much research has been performed on query logs collected on the Web as a whole, query log analysis to enhance search on smaller and more focused collections (such as intranets, digital libraries and local Web sites) has attracted less attention. The talk will look at a number of directions we have explored at the University of Essex in addressing this problem by automatically acquiring continuously updated domain models using query and click logs (as well as other sources).
Context data in lifelog retrieval (19 November, 2012)
Speaker: Liadh Kelly
Advances in digital technologies for information capture combined with
massive increases in the capacity of digital storage media mean that it is
now possible to capture and store much of one's life experiences in a
personal lifelog. Information can be captured from a myriad of personal
information devices including desktop computers, mobile phones, digital
cameras, and various sensors, including GPS, Bluetooth, and biometric
devices. This talk centers on the investigation of the challenges of
retrieval in this emerging domain and on the examination of the utility of
several implicitly recorded and derived context types in meeting these
challenges. For these investigations unique rich multimodal personal
lifelog collections of 20 months duration are used. These collections
contain all items accessed on subjects' PCs and laptops (email, web pages,
word documents, etc), passively captured images depicting subjects' lives
using the SenseCam device (http://research.microsoft.com/sensecam), and
mobile text messages sent and received. Items are annotated with several
rich sources of automatically derived context data types including
biometric data (galvanic skin response, heart rate, etc), geo-location
(captured using GPS data), people present (captured using Bluetooth data),
weather conditions, light status, and several context types related to the
dates and times of accesses to items.
Probabilistic rule-based argumentation for norm-governed learning agents (28 January, 2013)
Speaker: Sebastian Riedel
There is a vast and ever-increasing amount of unstructured textual data at our disposal. The ambiguity, variability and expressivity of language makes this data difficult to analyse, mine, search, visualise, and, ultimately, base decisions on. These challenges have motivated efforts to enable machine reading: computers that can read text and convert it into semantic representations, such as the Google Knowledge Graph for general facts, or pathway databases in the biomedical domain. This representations can then be harnessed by machines and humans alike. At the heart of machine reading is relation extraction: reading text to create a semantic network of entities and their relations, such as employeeOf(Person,Company), regulates(Protein,Protein) or causes(Event,Event).
In this talk I will present a series of graphical models and matrix factorisation techniques that can learn to extract relations. I will start by contrasting a fully supervised approach with one that leverages pre-existing semantic knowledge (for example, in the Freebase database) to reduce annotation costs. I will then present ways to extract additional relations that are not yet part of the schema, and for which no pre-existing semantic knowledge is available. I will show that by doing so we cannot only extract richer knowledge, but also improve extraction quality of relations within the original schema. This helps to improve over previous state-of-the-art by more than 10% points mean average precision.
[IR] Searching the Temporal Web: Challenges and Current Approaches (04 February, 2013)
Speaker: Nattiya Kanhabua
In this talk, we will give a survey of current approaches to searching the
temporal web. In such a web collection, the contents are created and/or
edited over time, and examples are web archives, news archives, blogs,
micro-blogs, personal emails and enterprise documents. Unfortunately,
traditional IR approaches based on term-matching only can give
unsatisfactory results when searching the temporal web. The reason for this
is multifold: 1) the collection is strongly time-dependent, i.e., with
multiple versions of documents, 2) the contents of documents are about
events happened at particular time periods, 3) the meanings of semantic
annotations can change over time, and 4) a query representing an information
need can be time-sensitive, so-called a temporal query.
Several major challenges in searching the temporal web will be discussed,
namely, 1) How to understand temporal search intent represented by
time-sensitive queries? 2) How to handle the temporal dynamics of queries
and documents? and 3) How to explicitly model temporal information in
retrieval and ranking models? To this end, we will present current
approaches to the addressed problems as well as outline the directions for
future research.
Time-Biased Gain (21 February, 2013)
Speaker: Charlie Clark
Time-biased gain provides a unifying framework for information retrieval evaluation, generalizing many traditional effectiveness measures while accommodating aspects of user behavior not captured by these measures. By using time as a basis for calibration against actual user data, time-biased gain can reflect aspects of the search process that directly impact user experience, including document length, near-duplicate documents, and summaries. Unlike traditional measures, which must be arbitrarily normalized for averaging purposes, time-biased gain is reported in meaningful units, such as the total number of relevant documents seen by the user. In work reported at SIGIR 2012, we proposed and validated a closed-form equation for estimating time-biased gain, explored its properties, and compared it to standard approaches. In work reported at CIKM 2012, we used stochastic simulation to numerically approximate time-biased gain, an approach that provides greater flexibility, allowing us to accommodate different types of user behavior and increases the realism of the effectiveness measure. In work reported at HCIR 2012, we extended our stochastic simulation to model the variation between users. In this talk, I will provide an overview of time-biased gain, and outline our ongoing and future work, including extensions to evaluate query suggestion, diversity, and whole-page relevance. This is joint work with Mark Smucker.
Modelling Time & Demographics in Search Logs (01 March, 2013)
Speaker: Milad Shokouhi
Knowing users' context offers a great potential for personalizing web search results or related services such as query suggestion and query completion. Contextual features cover a wide range of signals; query time, user’s location, search history and demographics can all be regarded as contextual features that can be used for search personalization.
In this talk, we’ll focus on two main questions:
1) How can we use the existing contextual features, in particular time, for improving search results (Shokouhi & Radinsky, SIGIR’12).
2) How can we infer missing contextual features, in particular user-demographics, based on search history (Bi et al., WWW2013).
Our results confirm that (1) contextual features matter and (2) that many of them can be inferred from search history.
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS project (11 March, 2013)
Speaker: Alasdair Gray,
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS project
Alasdair Gray, University of Manchester
The discovery of new medicines requires pharmacologists to interact with a number of information sources ranging from tabular data to scientific papers, and other specialized formats. The Open PHACTS project, a collaboration of research institutions and major pharmaceutical companies, has developed a linked data platform for integrating multiple pharmacology datasets that form the basis for several drug discovery applications. The functionality offered by the platform has been drawn from a collection of prioritised drug discovery business questions created as part of the Open PHACTS project. Key features of the linked data platform are:
1) Domain specific API making drug discovery linked data available for a diverse range of applications without requiring the application developers to become knowledgeable of semantic web standards such as SPARQL;
2) Just-in-time identity resolution and alignment across datasets enabling a variety of entry points to the data and ultimately to support different integrated views of the data;
3) Centrally cached copies of public datasets to support interactive response times for user-facing applications.
Within complex scientific domains such as pharmacology, operational equivalence between two concepts is often context-, user- and task-specific. Existing linked data integration procedures and equivalence services do not take the context and task of the user into account. We enable users of the Open PHACTS platform to control the notion of operational equivalence by applying scientific lenses over linked data. The scientific lenses vary the links that are activated between the datasets which affects the data returned to the user
Bio
Alasdair is a researcher in the MyGrid team at the University of Manchester. He is currently working on the Open PHACTS project which is building an Open Pharmacological Space to integrate drug discovery data. Alasdair gained his PhD from Heriot-Watt University, Edinburgh, and then worked as a post-doctoral researcher in the Information Retrieval Group at the University of Glasgow. He has spent the last 10 years working on novel knowledge management projects investigating issues of relating data sets.
Reusing Historical Interaction Data for Faster Online Learning to Rank for IR (12 March, 2013)
Speaker: Anne Schuth
Online learning to rank for information retrieval (IR) holds promise for allowing the development of ³self-learning² search engines that can automatically adjust to their users. With the large amount of e.g., click data that can be collected in web search settings, such techniques could enable highly scalable ranking optimization. However, feedback obtained from user interactions is noisy, and developing approaches that can learn from this feedback quickly and reliably is a major challenge.
In this paper we investigate whether and how previously collected (historical) interaction data can be used to speed up learning in online learning to rank for IR. We devise the first two methods that can utilize historical data (1) to make feedback available during learning more reliable and (2) to preselect candidate ranking functions to be evaluated in interactions with users of the retrieval system. We evaluate both approaches on 9 learning to rank data sets and find that historical data can speed up learning, leading to substantially and significantly higher online performance. In particular, our preselection method proves highly effective at compensating for noise in user feedback. Our results show that historical data can be used to make online learning to rank for IR much more effective than previously possible, especially when feedback is noisy.
Query Classification for a Digital Library (18 March, 2013)
Speaker: Deirdre Lungley
The motivation for our query classification is the insight it gives the digital content provider into what his users are searching for and hence how his collection could be extended. This talk details two query classification methodologies we have implemented as part of the GALATEAS project (http://www.galateas.eu/): one log-based, the other using wikified queries to learn a Labelled LDA model. An analysis of their respective classification errors indicates the method best suited to particular category groups.
Entity Linking for Semantic Search (29 April, 2013)
Speaker: Edgar Meij
Semantic annotations have recently received renewed interest with the explosive increase in the amount of textual data being produced, the advent of advanced NLP techniques, and the maturing of the web of data. Such annotations hold the promise for improving information retrieval algorithms and applications by providing means to automatically understand the meaning of a piece of text. Indeed, when we look at the level of understanding that is involved in modern-day search engines (on the web or otherwise), we come to the obvious conclusion that there is still a lot of room for improvement. Although some recent advances are pushing the boundaries already, information items are still retrieved and ordered mainly using their textual representation, with little or no knowledge of what they actually mean. In this talk I will present my recent and ongoing work, which addresses the challenges associated with leveraging semantic annotations for intelligent information access. I will introduce a recently proposed method for entity linking and show how it can be applied to several tasks related to semantic search on collections of different types, genres, and origins.
Fast and Reliable Online Learning to Rank for Information Retrieval (06 May, 2013)
Speaker: Katja Hoffman
Online learning to rank for information retrieval (IR) holds promise for allowing the development of "self-learning search engines" that can automatically adjust to their users. With the large amount of e.g., click data that can be collected in web search settings, such techniques could enable highly scalable ranking optimization. However, feedback obtained from user interactions is noisy, and developing approaches that can learn from this feedback quickly and reliably is a major challenge.
In this talk I will present my recent work, which addresses the challenges posed by learning from natural user interactions. First, I will detail a new method, called Probabilistic Interleave, for inferring user preferences from users' clicks on search results. I show that this method allows unbiased and fine-grained ranker comparison using noisy click data, and that this is the first such method that allows the effective reuse of historical data (i.e., collected for previous comparisons) to infer information about new rankers. Second, I show that Probabilistic Interleave enables new online learning to rank approaches that can reuse historical interaction data to speed up learning by several orders of magnitude, especially under high levels of noise in user feedback. I conclude with an outlook on research directions in online learning to rank for IR, that are opened up by our results.
Personality Computing (13 May, 2013)
Speaker: Alessandro Vinciarelli
Personality is one of the driving factors behind everything we do and experience
in life. During the last decade, the computing community has been showing an ever
increasing interest for such a psychological construct, especially when it comes
to efforts aimed at making machines socially intelligent, i.e. capable of interacting with
people in the same way as people do. This talk will show the work being done in this
area at the School of Computing Science. After an introduction to the concept of
personality and its main applications, the presentation will illustrate experiments
on speech based automatic perception and recognition. Furthermore, the talk will
outline the main issues and challenges still open in the domain.
Discovering, Modeling, and Predicting Task-by-Task Behaviour of Search Engine Users (20 May, 2013)
Speaker: Salvatore Orlando
Users of web search engines are increasingly issuing queries to accomplish their daily tasks (e.g., “finding a recipe”, “booking a flight”, “read- ing online news”, etc.). In this work, we propose a two-step methodology for discovering latent tasks that users try to perform through search engines. Firstly, we identify user tasks from individual user sessions stored in query logs. In our vision, a user task is a set of possibly non-contiguous queries (within a user search session), which refer to the same need. Secondly, we discover collective tasks by aggregating similar user tasks, possibly performed by distinct users. To discover tasks, we propose to adopt clustering algorithms based on novel query similarity functions, in turn obtained by exploiting specific features, and both unsupervised and supervised learning approaches. All the proposed solutions were evaluated on a manually-built ground-truth.
Furthermore, we introduce the the Task Relation Graph (TGR) as a representation of users' search behaviors on a task-by-task perspective, by exploiting the collective tasks obtained so far. The task-by-task behavior is captured by weighting the edges of TGR with a relatedness score computed between pairs of tasks, as mined from the query log. We validated our approach on a concrete application, namely a task recommender system, which suggests related tasks to users on the basis of the task predictions derived from the TGR. Finally, we showed that the task recommendations generated by our technique are beyond the reach of existing query suggestion schemes, and that our solution is able to recommend tasks that user will likely perform in the near future.
Work in collaboration with Claudio Lucchese, Gabriele Tolomei, Raffaele Perego, and Fabrizio Silvestri.
References:
[1] C. Lucchese, S. Orlando, R. Perego, F. Silvestri, G. Tolomei. "Identifying Task-based Sessions in Search Engine Query Logs". Forth ACM Int.l Conference on Web Search and Data Mining (WSDM 2011), Hong Kong, February 9-12, 2011
[2] C. Lucchese, S. Orlando, R. Perego, F. Silvestri, G. Tolomei. "Discovering Tasks from Search Engine Query Logs", To appear on ACM Transactions on Information Systems (TOIS).
[3] C. Lucchese, S. Orlando, R. Perego, F. Silvestri, G. Tolomei. "Modeling and Predicting the Task-by-Task Behavior of Search Engine Users". To appear in Proc. OAIR 2013, Int.l Conference in the RIAO series.
A study of Information Management in the Patient Surgical Pathway in NHS Scotland (03 June, 2013)
Speaker: Matt-Mouley Bouamrane
We conducted a study of information management processes across the patient surgical pathway in NHS Scotland. While the majority of General Practitioners (GPs) consider electronic information systems as an essential and integral part of their work during the patient consultation, many were not fully satisfied with the functionalities of these systems. A majority of GPs considered that the national eReferral system streamlined referral processes. Almost all GPs reported marked variability in the quality of discharge information. Preoperative processes vary significantly across Scotland, with most services using paper based systems. There is insufficient use made of information provided through the patient electronic referral and a considerable duplication of effort with the work already performed in primary care. Three health-boards have implemented electronic preoperative information systems. These have transformed clinical practices and facilitated communication and information-sharing among the multi-disciplinary team and within the health boards. Substantial progress has been made towards improving information transfer and sharing within the surgical pathway in recent years but there remains scope for further improvements at the interface between services.
The Use of Correspondence Analysis in Information Retrieval (11 July, 2013)
Speaker: Dr Taner Dincer
This presentation will introduce the application of Correspondence Analysis (CA) to Information Retrieval. CA is a well-established multivariate, statistical, exploratory data analysis technique. Multivariate data analysis techniques usually operate on a rectangular array of real numbers called a data matrix whose rows represent r observations (for example, r terms/words in documents) and columns represent c variables (for the example, c documents, resulting in a rxc term-by-document matrix). Multivariate data analysis refers to analyze the data in a manner that takes into account the relationships among observations and also among variables. In contrast to univariate statistics, it is concerned with the joint nature of measurements. The objective of exploratory data analysis is to explore the relationships among objects and among variables over measurements for the purpose of visual inspection. In particular, by using CA one can visually study the “Divergence From Independence” (DFI) among observations and among variables.
For Information Retrieval (IR), CA can serve three different uses: 1) As an analysis tool to visually inspect the results of information retrieval experiments, 2) As a basis to unify the probabilistic approaches to term weighting problem such as Divergence From Randomness and Language Models, and 3) As a term weighting model itself, "term weighting based on measuring divergence from independence". In this presentation, the uses of CA for these three purposes are exemplified.
[SICSA DVF] Language variation and influence in social media (15 July, 2013)
Speaker: Dr. Jacob Eisenstein
Languages vary by speaker and situation, and change over time. While variation and change are inhibited in written corpora such as news text, they are endemic to social media, enabling large-scale investigation of language's social and temporal dimensions. The first part of this talk will describe a method for characterizing group-level language differences, using the Sparse Additive Generative Model (SAGE). SAGE is based on a re-parametrization of the multinomial distribution that is amenable to sparsity-inducing regularization and facilitates joint modeling across many author characteristics. The second part of the talk concerns change and influence. Using a novel dataset of geotagged word counts, we induce a network of linguistic influence between cities, aggregating across thousands of words. We then explore the demographic and geographic factors that drive spread of new words between cities. This work is in collaboration with Amr Ahmed, Brendan O'Connor, Noah A. Smith, and Eric P. Xing.
Biography
Jacob Eisenstein is an Assistant Professor in the School of Interactive Computing at Georgia Tech. He works on statistical natural language processing, focusing on social media analysis, discourse, and latent variable models. Jacob was a Postdoctoral researcher at Carnegie Mellon and the University of Illinois. He completed his Ph.D. at MIT in 2008, winning the George M. Sprowls dissertation award.
How cost affects search behaviour (21 July, 2013)
Speaker: Leif Azzopardi
In this talk, I will run through the work I will be presenting at SIGIR on "How cost affects search behavior". The empirical analysis is motivated and underpinned using the Search Economic Theory that I proposed at SIGIR 2011.
Toward Models and Measures of Findability (21 July, 2013)
Speaker: Colin Wilkie
In this 10 minute talk, I will provide an overview of the project I am working on, which is about Findability, and review some of the existing models and measures of findability, before outlining the models that I have working on.
Quantum Language Models (19 August, 2013)
Speaker: Alessandro Sordoni
A joint analysis of both Vector Space and Language Models for IR
using the mathematical framework of Quantum Theory revealed how both
models allocate the space of density matrices. A density matrix is
shown to be a general representational tool capable of leveraging
capabilities of both VSM and LM representations thus paving the way
for a new generation of retrieval models. The new approach is called
Quantum Language Modeling (QLM) and has shown its efficiency and
effectiveness in modeling term dependencies for Information
Retrieval.
Exploration and contextualization: towards reusable tools for the humanities. (16 September, 2013)
Speaker: Marc Bron
The introduction of new technologies, access to large electronic
cultural heritage repositories, and the availability of new
information channels continues to change the way humantities
researchers work and the questions they seek to answer. In this talk I
will discuss how the research cycle of humanities researchers has been
affected by these changes and argue for the continued development of
interactive information retrieval tools to support the research
practices of humanities researchers. Specifically, I will focus on two
phases in the humanities research cycle: the exploration phase and
contextualization phase. In the first part of the talk I discuss work
on the development and evaluation of search interfaces aimed at
supporting exploration. In the second part of the talk I will focus on
how information retrieval technology focused on identifying relations
between concepts may be used to develop applications that support
contextualization.
Validity and Reliability in Cranfield-like Evaluation in Information Retrieval (23 September, 2013)
Speaker: Julián Urbano
The Cranfield paradigm to Information Retrieval evaluation has been used for half a century now as the means to compare retrieval techniques and advance the state of the art accordingly. However, this paradigm makes certain assumptions that remain a research problem in Information Retrieval and that may invalidate our experimental results.
In this talk I will approach the Cranfield paradigm as an statistical estimator of certain probability distributions that describe the final user experience. These distributions are estimated with a test collection, which actually computes system-related distributions that are assumed to be correlated with the target user-related distributions. From the point of view of validity, I will discuss the strength of that correlation and how it affects the conclusions we draw from an evaluation experiment. From the point of view of reliability, I will discuss on past and current practice to measure the reliability of test collections and review several of them accordingly.
