Information Retrieval

The group has a long and strong research history in the process of information retrieval as a whole from theoretical modelling of the retrieval process to large-scale text retrieval systems building and to the interactive evaluation of multimedia information retrieval systems. The research interests of the group include:

- Theoretical modelling of IR systems
- Probabilistic retrieval
- Web information retrieval
- Implementation of large-scale IR systems
- Multimedia (Image, Video, Audio) information retrieval
- Intranet/Enterprise and Blog search
- Distributed and Peer-to-Peer retrieval
- User Modelling and the development of novel adaptive interaction techniques
- Evaluation of IR systems
- Text mining and knowledge discovery
- Multilingual information retrieval
- Semantic Web and information retrieval

The group maintains strong links with researchers in Machine Learning and Human-Computer Interaction, as well as with industry through knowledge and technology transfer

  • RAENG Fellowship
    Foundations research in information retrieval inspired by quantum theory.
    Prof C.J. van Rijsbergen, 2007-2012.
  • Theoretical development of probabilistic and logic-based models
  • multimedia IR systems
  • information analysis and access across media
  • evaluation and usability of IR systems
  • data mining of large data sets
  • Web information retrieval
  • citation/link analysis
  • implementation and evaluation of large-scale IR systems
  • performance prediction and optimisation
  • information retrieval in context
  • multilingual retrieval
  • interaction techniques based on implicit relevance feedback and summarisation
  • adaptive information retrieval
  • Intranet/Enterprise and Blog search

COPE: interactive image retrieval using conversational recommendation
Balcer, B., Halvey, M. , Jose, J.M. , and Brewster, S. (2012) COPE: interactive image retrieval using conversational recommendation. In: BCS-HCI '12 - 26th Annual BCS Interaction Specialist Group Conference on People and Computers, 12-14 Sep 2012, Birmingham, UK.

On the usefulness of query features for learning to rank
Macdonald, C. , Santos, R.L.T., and Ounis, I. (2012) On the usefulness of query features for learning to rank. In: CIKM 2012: 21st ACM International Conference on Information and Knowledge Management, 29 Oct - 2 Nov 2012, Maui HI, USA.

An examination of content farms in web search using crowdsourcing
McCreadie, R., Macdonald, C. , Ounis, I. , Giles, J., and Jabr, F. (2012) An examination of content farms in web search using crowdsourcing. In: CIKM 2012: 21st ACM International Conference on Information and Knowledge Management, 29 Oct - 2 Nov 2012, Maui HI, USA.

On building a reusable Twitter corpus
McCreadie, R., Soboroff, I., Lin, J., Macdonald, C. , Ounis, I. , and McCullough, D. (2012) On building a reusable Twitter corpus. In: SIGIR 2012: 35th Annual International ACM SIGIR Conference on Research and Development on Information Retrieval, 12-16 Aug 2012, Portland OR, USA.

Effect of dynamic pruning safety on learning to rank effectiveness
Macdonald, C. , Tonellotto, N., and Ounis, I. (2012) Effect of dynamic pruning safety on learning to rank effectiveness. In: SIGIR 2012: 35th Annual International ACM SIGIR Conference on Research and Development on Information Retrieval, 12-16 Aug 2012, Portland OR, USA.

Scheduling queries across replicas
Freire, A., Macdonald, C. , Tonellotto, N., Ounis, I. , and Cacheda, F. (2012) Scheduling queries across replicas. In: SIGIR 2012: 35th Annual International ACM SIGIR Conference on Research and Development on Information Retrieval, 12-16 Aug 2012, Portland OR, USA.

Exploiting term dependence while handling negation in medical search
Limsopatham, N., Macdonald, C. , McCreadie, R., and Ounis, I. (2012) Exploiting term dependence while handling negation in medical search. In: SIGIR 2012: 35th Annual International ACM SIGIR Conference on Research and Development on Information Retrieval, 12-16 Aug 2012, Portland OR, USA.

CrowdTerrier: automatic crowdsourced relevance assessments with terrier
McCreadie, R., Macdonald, C. , and Ounis, I. (2012) CrowdTerrier: automatic crowdsourced relevance assessments with terrier. In: SIGIR 2012: 35th Annual International ACM SIGIR Conference on Research and Development on Information Retrieval, 12-16 Aug 2012, Portland OR, USA.

Identifying top news using crowdsourcing
McCreadie, R., Macdonald, C. , and Ounis, I. (2013) Identifying top news using crowdsourcing. Information Retrieval , 16 (2). pp. 179-209. ISSN 1386-4564 (doi:10.1007/s10791-012-9186-z )

A task-specific query and document representation for medical records search
Limsopatham, N., Macdonald, C. , and Ounis, I. (2013) A task-specific query and document representation for medical records search. In: 35th European Conference on Information Retrieval (ECIR 2013), 24-27 Mar 2013, Moscow, Russia.

Hybrid query scheduling for a replicated search engine
Freire, A., Macdonald, C. , Tonellotto, N., Ounis, I. , and Cacheda, F. (2013) Hybrid query scheduling for a replicated search engine. In: 35th European Conference on Information Retrieval (ECIR 2013), 24-27 Mar 2013, Moscow, Russia.

Exploiting query logs and field-based models to address term mismatch in an HIV/AIDS FAQ retrieval system
Thuma, E., Rogers, S. , and Ounis, I. (2013) Exploiting query logs and field-based models to address term mismatch in an HIV/AIDS FAQ retrieval system. In: Proceedings of 18th International Conference on Application of Natural Language to Information Systems (NLDB2013), 19-21 June 2013, University of Salford, MediaCityUK. University of Salford, Salford, UK. (In Press)

Evaluating bad query abandonment in an iterative SMS-Based FAQ retrieval system
Thuma, E., Rogers, S. , and Ounis, I. (2013) Evaluating bad query abandonment in an iterative SMS-Based FAQ retrieval system. In: OAIR 2013, 22-24 May 2013, Lisbon, Portugal. (In Press)

Efficient and Effective Retrieval using Selective Pruning
Tonellotto, N., Macdonald, C. , and Ounis, I. (2013) Efficient and Effective Retrieval using Selective Pruning. In: 6th ACM International Conference on Web Search and Data Mining, Rome, Italy. (In Press)

The whens and hows of learning to rank for web search
Macdonald, C. , Santos, R.L.T., and Ounis, I. (2013) The whens and hows of learning to rank for web search. Information Retrieval . ISSN 1386-4564 (doi:10.1007/s10791-012-9209-9 ) (In Press)

Searching in peer-to-peer networks
Klampanos, I.A., and Jose, J.M. (2013) Searching in peer-to-peer networks. Computer Science Review . ISSN 1574-0137 (doi:10.1016/j.cosrev.2012.07.001 ) (In Press)

Bayesian probabilistic models for image retrieval
Stathopoulos, V. , and Jose, J.M. (2011) Bayesian probabilistic models for image retrieval. Journal of Machine Learning Research: Workshop and Conference Proceedings , 17 . pp. 41-47. ISSN 1938-7228

Assessing and predicting vertical intent for web queries
Zhou, K. , Cummins, R. , Halvey, M. , Lalmas, M. , and Jose, J.M. (2012) Assessing and predicting vertical intent for web queries. Lecture Notes in Computer Science , 7224 . pp. 499-502. ISSN 0302-9743 (doi:10.1007/978-3-642-28997-2_50 )

Finding data resources in a virtual observatory using SKOS vocabularies
Gray, A.J.G. , Gray, N. , and Ounis, I. (2008) Finding data resources in a virtual observatory using SKOS vocabularies. Lecture Notes in Computer Science , 5071 . pp. 189-192. ISSN 0302-9743 (doi:10.1007/978-3-540-70504-8_19 )

Can RDB2RDF tools feasibly expose large science archives for data integration?
Gray, A.J.G. , Gray, N. , and Ounis, I. (2009) Can RDB2RDF tools feasibly expose large science archives for data integration? Lecture Notes in Computer Science , 5554 . pp. 491-505. ISSN 0302-9743 (doi:10.1007/978-3-642-02121-3_37 )

This Week’s EventsAll Upcoming EventsPast Events

This Week’s Events

There are no events scheduled for this week

Upcoming Events

Fast and Reliable Online Learning to Rank for Information Retrieval

Information Retrieval
Speaker: Katja Hoffman
Date: 06 May, 2013
Time: 16:00 - 17:00
Location: Sir Alwyn Williams Building, 423 Seminar Room

Online learning to rank for information retrieval (IR) holds promise for allowing the development of "self-learning search engines" that can automatically adjust to their users. With the large amount of e.g., click data that can be collected in web search settings, such techniques could enable highly scalable ranking optimization. However, feedback obtained from user interactions is noisy, and developing approaches that can learn from this feedback quickly and reliably is a major challenge.

 

In this talk I will present my recent work, which addresses the challenges posed by learning from natural user interactions. First, I will detail a new method, called Probabilistic Interleave, for inferring user preferences from users' clicks on search results. I show that this method allows unbiased and fine-grained ranker comparison using noisy click data, and that this is the first such method that allows the effective reuse of historical data (i.e., collected for previous comparisons) to infer information about new rankers. Second, I show that Probabilistic Interleave enables new online learning to rank approaches that can reuse historical interaction data to speed up learning by several orders of magnitude, especially under high levels of noise in user feedback. I conclude with an outlook on research directions in online learning to rank for IR, that are opened up by our results.

Past Events

From Search to Adaptive Search (12 November, 2012)

Speaker: Udo Kruschwitz

Modern search engines have been moving away from very simplistic interfaces that aimed at satisfying a user's need with a single-shot query. Interactive features such as query suggestions and faceted search are now integral parts of Web search engines. Generating good query modification suggestions or alternative queries to assist a searcher remains however a challenging issue. Query log analysis is one of the major strands of work in this direction. While much research has been performed on query logs collected on the Web as a whole, query log analysis to enhance search on smaller and more focused collections (such as intranets, digital libraries and local Web sites) has attracted less attention. The talk will look at a number of directions we have explored at the University of Essex in addressing this problem by automatically acquiring continuously updated domain models using query and click logs (as well as other sources).

Context data in lifelog retrieval (19 November, 2012)

Speaker: Liadh Kelly

Advances in digital technologies for information capture combined with
massive increases in the capacity of digital storage media mean that it is
now possible to capture and store much of one's life experiences in a
personal lifelog. Information can be captured from a myriad of personal
information devices including desktop computers, mobile phones, digital
cameras, and various sensors, including GPS, Bluetooth, and biometric
devices. This talk centers on the investigation of the challenges of
retrieval in this emerging domain and on the examination of the utility of
several implicitly recorded and derived context types in meeting these
challenges. For these investigations unique rich multimodal personal
lifelog collections of 20 months duration are used. These collections
contain all items accessed on subjects' PCs and laptops (email, web pages,
word documents, etc), passively captured images depicting subjects' lives
using the SenseCam device (http://research.microsoft.com/sensecam), and
mobile text messages sent and received. Items are annotated with several
rich sources of automatically derived context data types including
biometric data (galvanic skin response, heart rate, etc), geo-location
(captured using GPS data), people present (captured using Bluetooth data),
weather conditions, light status, and several context types related to the
dates and times of accesses to items.

 

Probabilistic rule-based argumentation for norm-governed learning agents (28 January, 2013)

Speaker: Sebastian Riedel

There is a vast and ever-increasing amount of unstructured textual data at our disposal. The ambiguity, variability and expressivity of language makes this data difficult to analyse, mine, search, visualise, and, ultimately, base decisions on. These challenges have motivated efforts to enable machine reading: computers that can read text and convert it into semantic representations, such as the Google Knowledge Graph for general facts, or pathway databases in the biomedical domain. This representations can then be harnessed by machines and humans alike. At the heart of machine reading is relation extraction: reading text to create a semantic network of entities and their relations, such as employeeOf(Person,Company), regulates(Protein,Protein) or causes(Event,Event). 

In this talk I will present a series of graphical models and matrix factorisation techniques that can learn to extract relations. I will start by contrasting a fully supervised approach with one that leverages pre-existing semantic knowledge (for example, in the Freebase database) to reduce annotation costs. I will then present ways to extract additional relations that are not yet part of the schema, and for which no pre-existing semantic knowledge is available. I will show that by doing so we cannot only extract richer knowledge, but also improve extraction quality of relations within the original schema. This helps to improve over previous state-of-the-art by more than 10% points mean average precision. 

[IR] Searching the Temporal Web: Challenges and Current Approaches (04 February, 2013)

Speaker: Nattiya Kanhabua

In this talk, we will give a survey of current approaches to searching the

temporal web. In such a web collection, the contents are created and/or

edited over time, and examples are web archives, news archives, blogs,

micro-blogs, personal emails and enterprise documents. Unfortunately,

traditional IR approaches based on term-matching only can give

unsatisfactory results when searching the temporal web. The reason for this

is multifold:  1) the collection is strongly time-dependent, i.e., with

multiple versions of documents, 2) the contents of documents are about

events happened at particular time periods, 3) the meanings of semantic

annotations can change over time, and 4) a query representing an information

need can be time-sensitive, so-called a temporal query.

 

Several major challenges in searching the temporal web will be discussed,

namely, 1) How to understand temporal search intent represented by

time-sensitive queries? 2) How to handle the temporal dynamics of queries

and documents? and 3) How to explicitly model temporal information in

retrieval and ranking models? To this end, we will present current

approaches to the addressed problems as well as outline the directions for

future research.

Time-Biased Gain (21 February, 2013)

Speaker: Charlie Clark

Time-biased gain provides a unifying framework for information retrieval evaluation, generalizing many traditional effectiveness measures while accommodating aspects of user behavior not captured by these measures. By using time as a basis for calibration against actual user data, time-biased gain can reflect aspects of the search process that directly impact user experience, including document length, near-duplicate documents, and summaries. Unlike traditional measures, which must be arbitrarily normalized for averaging purposes, time-biased gain is reported in meaningful units, such as the total number of relevant documents seen by the user. In work reported at SIGIR 2012, we proposed and validated a closed-form equation for estimating time-biased gain, explored its properties, and compared it to standard approaches. In work reported at CIKM 2012, we used stochastic simulation to numerically approximate time-biased gain, an approach that provides greater flexibility, allowing us to accommodate different types of user behavior and increases the realism of the effectiveness measure. In work reported at HCIR 2012, we extended our stochastic simulation to model the variation between users. In this talk, I will provide an overview of time-biased gain, and outline our ongoing and future work, including extensions to evaluate query suggestion, diversity, and whole-page relevance. This is joint work with Mark Smucker.

Modelling Time & Demographics in Search Logs (01 March, 2013)

Speaker: Milad Shokouhi

Knowing users' context offers a great potential for personalizing web search results or related services such as query suggestion and query completion. Contextual features cover a wide range of signals; query time, user’s location,  search history and demographics can all  be regarded as contextual features that can be used for search personalization.

In this talk, we’ll focus on two main questions:

1)      How can we use the existing contextual features, in particular time, for improving search results (Shokouhi & Radinsky, SIGIR’12).

2)      How can we infer missing contextual features, in particular user-demographics, based on search history (Bi et al., WWW2013).

 

Our results confirm that (1) contextual features matter and (2) that many of them can be inferred from search history.

Scientific Lenses over Linked Data: Identity Management in the Open PHACTS project (11 March, 2013)

Speaker: Alasdair Gray,

Scientific Lenses over Linked Data: Identity Management in the Open PHACTS project

Alasdair Gray, University of Manchester

 

The discovery of new medicines requires pharmacologists to interact with a number of information sources ranging from tabular data to scientific papers, and other specialized formats. The Open PHACTS project, a collaboration of research institutions and major pharmaceutical companies, has developed a linked data platform for integrating multiple pharmacology datasets that form the basis for several drug discovery applications. The functionality offered by the platform has been drawn from a collection of prioritised drug discovery business questions created as part of the Open PHACTS project. Key features of the linked data platform are:

1) Domain specific API making drug discovery linked data available for a diverse range of applications without requiring the application developers to become knowledgeable of semantic web standards such as SPARQL;

2) Just-in-time identity resolution and alignment across datasets enabling a variety of entry points to the data and ultimately to support different integrated views of the data;

3) Centrally cached copies of public datasets to support interactive response times for user-facing applications.

 

Within complex scientific domains such as pharmacology, operational equivalence between two concepts is often context-, user- and task-specific. Existing linked data integration procedures and equivalence services do not take the context and task of the user into account. We enable users of the Open PHACTS platform to control the notion of operational equivalence by applying scientific lenses over linked data. The scientific lenses vary the links that are activated between the datasets which affects the data returned to the user

 

Bio

Alasdair is a researcher in the MyGrid team at the University of Manchester. He is currently working on the Open PHACTS project which is building an Open Pharmacological Space to integrate drug discovery data. Alasdair gained his PhD from Heriot-Watt University, Edinburgh, and then worked as a post-doctoral researcher in the Information Retrieval Group at the University of Glasgow. He has spent the last 10 years working on novel knowledge management projects investigating issues of relating data sets.

http://www.cs.man.ac.uk/~graya/

Reusing Historical Interaction Data for Faster Online Learning to Rank for IR (12 March, 2013)

Speaker: Anne Schuth

 

Online learning to rank for information retrieval (IR) holds promise

for allowing the development of ³self-learning² search engines that can

automatically adjust to their users. With the large amount of e.g.,

click data that can be collected in web search settings, such tech-

niques could enable highly scalable ranking optimization. However,

feedback obtained from user interactions is noisy, and developing

approaches that can learn from this feedback quickly and reliably is a

major challenge.

 

In this paper we investigate whether and how previously collected

(historical) interaction data can be used to speed up learning in

online learning to rank for IR. We devise the first two methods that

can utilize historical data (1) to make feedback available during

learning more reliable and (2) to preselect candidate ranking functions

to be evaluated in interactions with users of the retrieval system. We

evaluate both approaches on 9 learning to rank data sets and find that

historical data can speed up learning, leading to substantially and

significantly higher online performance. In particular, our pre-

selection method proves highly effective at compensating for noise in

user feedback. Our results show that historical data can be used to

make online learning to rank for IR much more effective than previously

possible, especially when feedback is noisy.

Query Classification for a Digital Library (18 March, 2013)

Speaker: Deirdre Lungley

The motivation for our query classification is the insight it gives the digital content provider into what his users are searching for and hence how his collection could be extended. This talk details two query classification methodologies we have implemented as part of the GALATEAS project (http://www.galateas.eu/): one log-based, the other using wikified queries to learn a Labelled LDA model. An analysis of their respective classification errors indicates the method best suited to particular category groups. 

Entity Linking for Semantic Search (29 April, 2013)

Speaker: Edgar Meij



Semantic annotations have recently received renewed interest with the explosive increase in the amount of textual data being produced, the advent of advanced NLP techniques, and the maturing of the web of data. Such annotations hold the promise for improving information retrieval algorithms and applications by providing means to automatically understand the meaning of a piece of text. Indeed, when we look at the level of understanding that is involved in modern-day search engines (on the web or otherwise), we come to the obvious conclusion that there is still a lot of room for improvement. Although some recent advances are pushing the boundaries already, information items are still retrieved and ordered mainly using their textual representation, with little or no knowledge of what they actually mean. In this talk I will present my recent and ongoing work, which addresses the challenges associated with leveraging semantic annotations for intelligent information access. I will introduce a recently proposed method for entity linking and show how it can be applied to several tasks related to semantic search on collections of different types, genres, and origins.