Information Retrieval
The group has a long and strong research history in the process of information retrieval as a whole from theoretical modelling of the retrieval process to large-scale text retrieval systems building and to the interactive evaluation of multimedia information retrieval systems. The research interests of the group include:
- Theoretical modelling of IR systems
- Probabilistic retrieval
- Web information retrieval
- Implementation of large-scale IR systems
- Multimedia (Image, Video, Audio) information retrieval
- Intranet/Enterprise and Blog search
- Distributed and Peer-to-Peer retrieval
- User Modelling and the development of novel adaptive interaction techniques
- Evaluation of IR systems
- Text mining and knowledge discovery
- Multilingual information retrieval
- Semantic Web and information retrieval
The group maintains strong links with researchers in Machine Learning and Human-Computer Interaction, as well as with industry through knowledge and technology transfer
- RAENG Fellowship
Foundations research in information retrieval inspired by quantum theory.
Prof C.J. van Rijsbergen, 2007-2012.
Academic Staff: Dr Leif Azzopardi, Prof Joemon Jose, Dr Iadh Ounis, Prof C J (Keith) van Rijsbergen.
Research Fellows: Dr Sachi Arafat,
Research Assistants and Research Students: Dr Craig Macdonald, Mr Richard McCreadie, Mr Andrew McHugh, Mr Yashar Moshfeghi, Mr Jie Peng, Mr Rodrygo Santos, Mr Ke Zhou
- Theoretical development of probabilistic and logic-based models
- multimedia IR systems
- information analysis and access across media
- evaluation and usability of IR systems
- data mining of large data sets
- Web information retrieval
- citation/link analysis
- implementation and evaluation of large-scale IR systems
- performance prediction and optimisation
- information retrieval in context
- multilingual retrieval
- interaction techniques based on implicit relevance feedback and summarisation
- adaptive information retrieval
- Intranet/Enterprise and Blog search
COPE: interactive image retrieval using conversational recommendation
Balcer, B., Halvey, M.
On the usefulness of query features for learning to rank
Macdonald, C.
An examination of content farms in web search using crowdsourcing
McCreadie, R., Macdonald, C.
On building a reusable Twitter corpus
McCreadie, R., Soboroff, I., Lin, J., Macdonald, C.
Effect of dynamic pruning safety on learning to rank effectiveness
Macdonald, C.
Scheduling queries across replicas
Freire, A., Macdonald, C.
Exploiting term dependence while handling negation in medical search
Limsopatham, N., Macdonald, C.
CrowdTerrier: automatic crowdsourced relevance assessments with terrier
McCreadie, R., Macdonald, C.
Identifying top news using crowdsourcing
McCreadie, R., Macdonald, C.
A task-specific query and document representation for medical records search
Limsopatham, N., Macdonald, C.
Hybrid query scheduling for a replicated search engine
Freire, A., Macdonald, C.
Exploiting query logs and field-based models to address term mismatch in an HIV/AIDS FAQ retrieval system
Thuma, E., Rogers, S.
Evaluating bad query abandonment in an iterative SMS-Based FAQ retrieval system
Thuma, E., Rogers, S.
Efficient and Effective Retrieval using Selective Pruning
Tonellotto, N., Macdonald, C.
The whens and hows of learning to rank for web search
Macdonald, C.
Searching in peer-to-peer networks
Klampanos, I.A., and Jose, J.M.
Bayesian probabilistic models for image retrieval
Stathopoulos, V.
Assessing and predicting vertical intent for web queries
Zhou, K.
Finding data resources in a virtual observatory using SKOS vocabularies
Gray, A.J.G.
Can RDB2RDF tools feasibly expose large science archives for data integration?
Gray, A.J.G.
This Week’s EventsAll Upcoming EventsPast Events
This Week’s Events
There are no events scheduled for this week
Upcoming Events
Fast and Reliable Online Learning to Rank for Information Retrieval
Information Retrieval
Speaker: Katja Hoffman
Date: 06 May, 2013
Time: 16:00 - 17:00
Location: Sir Alwyn Williams Building, 423 Seminar Room
Online learning to rank for information retrieval (IR) holds promise for allowing the development of "self-learning search engines" that can automatically adjust to their users. With the large amount of e.g., click data that can be collected in web search settings, such techniques could enable highly scalable ranking optimization. However, feedback obtained from user interactions is noisy, and developing approaches that can learn from this feedback quickly and reliably is a major challenge.
In this talk I will present my recent work, which addresses the challenges posed by learning from natural user interactions. First, I will detail a new method, called Probabilistic Interleave, for inferring user preferences from users' clicks on search results. I show that this method allows unbiased and fine-grained ranker comparison using noisy click data, and that this is the first such method that allows the effective reuse of historical data (i.e., collected for previous comparisons) to infer information about new rankers. Second, I show that Probabilistic Interleave enables new online learning to rank approaches that can reuse historical interaction data to speed up learning by several orders of magnitude, especially under high levels of noise in user feedback. I conclude with an outlook on research directions in online learning to rank for IR, that are opened up by our results.
Past Events
From Search to Adaptive Search (12 November, 2012)
Speaker: Udo Kruschwitz
Modern search engines have been moving away from very simplistic interfaces that aimed at satisfying a user's need with a single-shot query. Interactive features such as query suggestions and faceted search are now integral parts of Web search engines. Generating good query modification suggestions or alternative queries to assist a searcher remains however a challenging issue. Query log analysis is one of the major strands of work in this direction. While much research has been performed on query logs collected on the Web as a whole, query log analysis to enhance search on smaller and more focused collections (such as intranets, digital libraries and local Web sites) has attracted less attention. The talk will look at a number of directions we have explored at the University of Essex in addressing this problem by automatically acquiring continuously updated domain models using query and click logs (as well as other sources).
Context data in lifelog retrieval (19 November, 2012)
Speaker: Liadh Kelly
Advances in digital technologies for information capture combined with
massive increases in the capacity of digital storage media mean that it is
now possible to capture and store much of one's life experiences in a
personal lifelog. Information can be captured from a myriad of personal
information devices including desktop computers, mobile phones, digital
cameras, and various sensors, including GPS, Bluetooth, and biometric
devices. This talk centers on the investigation of the challenges of
retrieval in this emerging domain and on the examination of the utility of
several implicitly recorded and derived context types in meeting these
challenges. For these investigations unique rich multimodal personal
lifelog collections of 20 months duration are used. These collections
contain all items accessed on subjects' PCs and laptops (email, web pages,
word documents, etc), passively captured images depicting subjects' lives
using the SenseCam device (http://research.microsoft.com/sensecam), and
mobile text messages sent and received. Items are annotated with several
rich sources of automatically derived context data types including
biometric data (galvanic skin response, heart rate, etc), geo-location
(captured using GPS data), people present (captured using Bluetooth data),
weather conditions, light status, and several context types related to the
dates and times of accesses to items.
Probabilistic rule-based argumentation for norm-governed learning agents (28 January, 2013)
Speaker: Sebastian Riedel
There is a vast and ever-increasing amount of unstructured textual data at our disposal. The ambiguity, variability and expressivity of language makes this data difficult to analyse, mine, search, visualise, and, ultimately, base decisions on. These challenges have motivated efforts to enable machine reading: computers that can read text and convert it into semantic representations, such as the Google Knowledge Graph for general facts, or pathway databases in the biomedical domain. This representations can then be harnessed by machines and humans alike. At the heart of machine reading is relation extraction: reading text to create a semantic network of entities and their relations, such as employeeOf(Person,Company), regulates(Protein,Protein) or causes(Event,Event).
In this talk I will present a series of graphical models and matrix factorisation techniques that can learn to extract relations. I will start by contrasting a fully supervised approach with one that leverages pre-existing semantic knowledge (for example, in the Freebase database) to reduce annotation costs. I will then present ways to extract additional relations that are not yet part of the schema, and for which no pre-existing semantic knowledge is available. I will show that by doing so we cannot only extract richer knowledge, but also improve extraction quality of relations within the original schema. This helps to improve over previous state-of-the-art by more than 10% points mean average precision.
[IR] Searching the Temporal Web: Challenges and Current Approaches (04 February, 2013)
Speaker: Nattiya Kanhabua
In this talk, we will give a survey of current approaches to searching the
temporal web. In such a web collection, the contents are created and/or
edited over time, and examples are web archives, news archives, blogs,
micro-blogs, personal emails and enterprise documents. Unfortunately,
traditional IR approaches based on term-matching only can give
unsatisfactory results when searching the temporal web. The reason for this
is multifold: 1) the collection is strongly time-dependent, i.e., with
multiple versions of documents, 2) the contents of documents are about
events happened at particular time periods, 3) the meanings of semantic
annotations can change over time, and 4) a query representing an information
need can be time-sensitive, so-called a temporal query.
Several major challenges in searching the temporal web will be discussed,
namely, 1) How to understand temporal search intent represented by
time-sensitive queries? 2) How to handle the temporal dynamics of queries
and documents? and 3) How to explicitly model temporal information in
retrieval and ranking models? To this end, we will present current
approaches to the addressed problems as well as outline the directions for
future research.
Time-Biased Gain (21 February, 2013)
Speaker: Charlie Clark
Time-biased gain provides a unifying framework for information retrieval evaluation, generalizing many traditional effectiveness measures while accommodating aspects of user behavior not captured by these measures. By using time as a basis for calibration against actual user data, time-biased gain can reflect aspects of the search process that directly impact user experience, including document length, near-duplicate documents, and summaries. Unlike traditional measures, which must be arbitrarily normalized for averaging purposes, time-biased gain is reported in meaningful units, such as the total number of relevant documents seen by the user. In work reported at SIGIR 2012, we proposed and validated a closed-form equation for estimating time-biased gain, explored its properties, and compared it to standard approaches. In work reported at CIKM 2012, we used stochastic simulation to numerically approximate time-biased gain, an approach that provides greater flexibility, allowing us to accommodate different types of user behavior and increases the realism of the effectiveness measure. In work reported at HCIR 2012, we extended our stochastic simulation to model the variation between users. In this talk, I will provide an overview of time-biased gain, and outline our ongoing and future work, including extensions to evaluate query suggestion, diversity, and whole-page relevance. This is joint work with Mark Smucker.
Modelling Time & Demographics in Search Logs (01 March, 2013)
Speaker: Milad Shokouhi
Knowing users' context offers a great potential for personalizing web search results or related services such as query suggestion and query completion. Contextual features cover a wide range of signals; query time, user’s location, search history and demographics can all be regarded as contextual features that can be used for search personalization.
In this talk, we’ll focus on two main questions:
1) How can we use the existing contextual features, in particular time, for improving search results (Shokouhi & Radinsky, SIGIR’12).
2) How can we infer missing contextual features, in particular user-demographics, based on search history (Bi et al., WWW2013).
Our results confirm that (1) contextual features matter and (2) that many of them can be inferred from search history.
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS project (11 March, 2013)
Speaker: Alasdair Gray,
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS project
Alasdair Gray, University of Manchester
The discovery of new medicines requires pharmacologists to interact with a number of information sources ranging from tabular data to scientific papers, and other specialized formats. The Open PHACTS project, a collaboration of research institutions and major pharmaceutical companies, has developed a linked data platform for integrating multiple pharmacology datasets that form the basis for several drug discovery applications. The functionality offered by the platform has been drawn from a collection of prioritised drug discovery business questions created as part of the Open PHACTS project. Key features of the linked data platform are:
1) Domain specific API making drug discovery linked data available for a diverse range of applications without requiring the application developers to become knowledgeable of semantic web standards such as SPARQL;
2) Just-in-time identity resolution and alignment across datasets enabling a variety of entry points to the data and ultimately to support different integrated views of the data;
3) Centrally cached copies of public datasets to support interactive response times for user-facing applications.
Within complex scientific domains such as pharmacology, operational equivalence between two concepts is often context-, user- and task-specific. Existing linked data integration procedures and equivalence services do not take the context and task of the user into account. We enable users of the Open PHACTS platform to control the notion of operational equivalence by applying scientific lenses over linked data. The scientific lenses vary the links that are activated between the datasets which affects the data returned to the user
Bio
Alasdair is a researcher in the MyGrid team at the University of Manchester. He is currently working on the Open PHACTS project which is building an Open Pharmacological Space to integrate drug discovery data. Alasdair gained his PhD from Heriot-Watt University, Edinburgh, and then worked as a post-doctoral researcher in the Information Retrieval Group at the University of Glasgow. He has spent the last 10 years working on novel knowledge management projects investigating issues of relating data sets.
Reusing Historical Interaction Data for Faster Online Learning to Rank for IR (12 March, 2013)
Speaker: Anne Schuth
Online learning to rank for information retrieval (IR) holds promise
for allowing the development of ³self-learning² search engines that can
automatically adjust to their users. With the large amount of e.g.,
click data that can be collected in web search settings, such tech-
niques could enable highly scalable ranking optimization. However,
feedback obtained from user interactions is noisy, and developing
approaches that can learn from this feedback quickly and reliably is a
major challenge.
In this paper we investigate whether and how previously collected
(historical) interaction data can be used to speed up learning in
online learning to rank for IR. We devise the first two methods that
can utilize historical data (1) to make feedback available during
learning more reliable and (2) to preselect candidate ranking functions
to be evaluated in interactions with users of the retrieval system. We
evaluate both approaches on 9 learning to rank data sets and find that
historical data can speed up learning, leading to substantially and
significantly higher online performance. In particular, our pre-
selection method proves highly effective at compensating for noise in
user feedback. Our results show that historical data can be used to
make online learning to rank for IR much more effective than previously
possible, especially when feedback is noisy.
Query Classification for a Digital Library (18 March, 2013)
Speaker: Deirdre Lungley
The motivation for our query classification is the insight it gives the digital content provider into what his users are searching for and hence how his collection could be extended. This talk details two query classification methodologies we have implemented as part of the GALATEAS project (http://www.galateas.eu/): one log-based, the other using wikified queries to learn a Labelled LDA model. An analysis of their respective classification errors indicates the method best suited to particular category groups.
Entity Linking for Semantic Search (29 April, 2013)
Speaker: Edgar Meij
Semantic annotations have recently received renewed interest with the explosive increase in the amount of textual data being produced, the advent of advanced NLP techniques, and the maturing of the web of data. Such annotations hold the promise for improving information retrieval algorithms and applications by providing means to automatically understand the meaning of a piece of text. Indeed, when we look at the level of understanding that is involved in modern-day search engines (on the web or otherwise), we come to the obvious conclusion that there is still a lot of room for improvement. Although some recent advances are pushing the boundaries already, information items are still retrieved and ordered mainly using their textual representation, with little or no knowledge of what they actually mean. In this talk I will present my recent and ongoing work, which addresses the challenges associated with leveraging semantic annotations for intelligent information access. I will introduce a recently proposed method for entity linking and show how it can be applied to several tasks related to semantic search on collections of different types, genres, and origins.
