Information, Data and Analysis Section

Technological advances in sensing, data acquisition, mobile devices and the impact of the Internet are leading to increasing amounts of data sampled more rapidly and comprehensively than ever before. If we are to acquire novel insights and knowledge from this data, it needs to be matched by innovations in data management, storage and retrieval and ultimately in data analytics. The many forms of data, their complexity and variations present challenges from information and data systems, to algorithms and inference about patterns through modelling, leading to visualisation, communication and human-computer interaction.

The Information, Data and Analysis Section is led by Professor Roderick Murray-Smith, and has 13 academics, and 35 Post-Doctoral Fellows, Research Assistants and Ph.D. students active in this area. Our research is organised in four world-leading groups in data systems, human-computer interaction & machine learning, information retrieval and computer vision & autonomous systems:

This Week’s EventsAll Upcoming EventsPast EventsWebapp

This Week’s Events

Big Crisis Data - an exciting frontier for applied computing.

Group: Information Retrieval (IR)
Speaker: Carlos Castillo, Eurecat
Date: 24 April, 2017
Time: 15:00
Location: Sir Alwyn Williams Building, 422 Seminar Room

Social media is an invaluable source of time-critical information during a crisis. However, emergency response and humanitarian relief organizations that would like to use this information struggle with an avalanche of social media messages, exceeding their capacity to process them. In this talk, we will look at how interdisciplinary research has enabled the creation of new tools for emergency managers, decision makers, and affected communities. These tools typically incorporate a combination of automatic processing and crowdsourcing. The talk will also look at ethical issues of this line of research.

Upcoming Events

Big Crisis Data - an exciting frontier for applied computing.

Group: Information Retrieval (IR)
Speaker: Carlos Castillo, Eurecat
Date: 24 April, 2017
Time: 15:00
Location: Sir Alwyn Williams Building, 422 Seminar Room

Social media is an invaluable source of time-critical information during a crisis. However, emergency response and humanitarian relief organizations that would like to use this information struggle with an avalanche of social media messages, exceeding their capacity to process them. In this talk, we will look at how interdisciplinary research has enabled the creation of new tools for emergency managers, decision makers, and affected communities. These tools typically incorporate a combination of automatic processing and crowdsourcing. The talk will also look at ethical issues of this line of research.

IDA Seminar:

Group: Section for Information, Data and Analysis (IDA)
Speaker: Sebastian Nowozin, Microsoft Research Cambridge
Date: 04 May, 2017
Time: 14:00 - 15:00
Location: Sir Alwyn Williams Building, 423 Seminar Room

Past Events

ProbUI: Generalising Touch Target Representations to Enable Declarative Gesture Definition for Probabilistic GUIs (20 April, 2017)

Speaker: Daniel Buschek

We present ProbUI, a mobile touch GUI framework that merges ease of use of declarative gesture definition with the benefits of probabilistic reasoning. It helps developers to handle uncertain input and implement feedback and GUI adaptations. ProbUI replaces today's static target models (bounding boxes) with probabilistic gestures ("bounding behaviours"). It is the first touch GUI framework to unite concepts from three areas of related work: 1) Developers declaratively define touch behaviours for GUI targets. As a key insight, the declarations imply simple probabilistic models (HMMs with 2D Gaussian emissions). 2) ProbUI derives these models automatically to evaluate users' touch sequences. 3) It then infers intended behaviour and target. Developers bind callbacks to gesture progress, completion, and other conditions. We show ProbUI's value by implementing existing and novel widgets, and report developer feedback from a survey and a lab study.

Information Foraging in Environments (31 March, 2017)

Speaker: Kevin Ong

Kevin is a PhD student from ISAR Research Group at RMIT University, Australia. Kevin had previously worked on logs from National Archives UK, Peter MacCallum Cancer Institute, Westfield Group and Listcorp.

In this talk, he will talk about his work on information foraging in physical and virtual environments. The first part of his talk will be on "Understanding information foraging in physical environment - a log analysis" and the second part of his talk will be on "information foraging in virtual environments - an observational study".

Semantic Search at Bloomberg. (27 March, 2017)

Speaker: Edgar Meij


Large-scale knowledge graphs (KGs) store relationships between entities that are increasingly being used to improve the user experience in search applications. At Bloomberg we are currently in the process of rolling out our own knowledge graph and in this talk I will describe some of the semantic search applications that we aim to support. In particular, I will be discussing some of our recent papers on context-specific entity recommendations and automatically generating textual descriptions for arbitrary KG relationships.


Dr. Edgar Meij is a senior scientist at Bloomberg. Before this, he was a research scientist at Yahoo Labs and a postdoc at the University of Amsterdam, where he also obtained his PhD. His research focuses on advancing the state of the art in semantic search at Web scale, by designing entity-oriented search systems that employ knowledge graphs, entity linking, NLP, and machine learning techniques to improve the user experience, search, and recommendations. He has co-authored 50+ peer-reviewed papers and regularly teaches at the post-graduate level, including university courses, summer schools, and conference tutorials.

Assessing User Engagement in Information Retrieval Systems (20 March, 2017)

Speaker: Mengdie Zhuang


In this study, we investigated both using user actions from log files, and the results of the User Engagement Scale, both of which came from a study of people interacting with a retrieval interface containing an image collection, but with a non-purposeful task. Our results suggest that selected behaviour measures are associated with selected user perceptions of engagement  (i.e., focused attention, felt involvement, and novelty), while typical search and browse measures have no association with aesthetics and perceived usability. This is finding can lead towards a more systematic user-centered evaluation model.


Mengdie Zhuang is a PhD student from the University of Sheffield, UK. Her research focuses on evaluation metrics of Information Retrieval Systems.

Access, Search and Enrichment in Temporal Collections (06 March, 2017)

Speaker: Avishek Anand

There have been numerous efforts recently to digitize previously published content and preserving born-digital content leading to the widespread growth of large temporal text repositories. Temporal collections are continuously growing text collections which contain versions of documents spanning over long time periods and present many opportunities for historical, cultural and political analyses. Consequently there is a growing need for methods that can efficiently access, search and mine them. In this talk we deal with approaches in each of these aspects -- access, search and enrichment. First, I describe some of the access methods for searching temporal collections. Specifically, how do we index text to support temporal workloads? Secondly, I will describe retrieval models, which exploit historical information, essential in searching such collections. That is, how do we rank documents given temporal query intents? Finally, I will present some of the ongoing efforts to mine such collections for enriching Knowledge sources like Wikipedia.

A stochastic formulation of a dynamical singly constrained spatial interaction model (02 March, 2017)

Speaker: Mark Girolami

One of the challenges of 21st-century science is to model the evolution of complex systems.  One example of practical importance is urban structure, for which the dynamics may be described by a series of non-linear first-order ordinary differential equations.  Whilst this approach provides a reasonable model of spatial interaction as are relevant in areas diverse as public health and urban retail structure, it is somewhat restrictive owing to uncertainties arising in the modelling process. 

We address these shortcomings by developing a dynamical singly constrained spatial interaction model, based on a system of stochastic differential equations.   Our model is ergodic and the invariant distribution encodes our prior knowledge of spatio-temporal interactions.  We proceed by performing inference and prediction in a Bayesian setting, and explore the resulting probability distributions with a position-specific metropolis-adjusted Langevin algorithm. Insights from studies of interactions within the city of London from retail structure are used as illustration

Collaborative Information Retrieval. (27 February, 2017)

Speaker: Nyi Nyi Htun

Presentation of 2 papers to appear at CHIIR 2017.

Paper 1:

Title: How Can We Better Support Users with Non-Uniform Information Access in Collaborative Information Retrieval?

Abstract: The majority of research in Collaborative Information Retrieval (CIR) has assumed that collaborating team members have uniform information access. However, practice and research has shown that there may not always be uniform information access among team members, e.g. in healthcare, government, etc. To the best of our knowledge, there has not been a controlled user evaluation to measure the impact of non-uniform information access on CIR outcomes. To address this shortcoming, we conducted a controlled user evaluation using 2 non-uniform access scenarios (document removal and term blacklisting) and 1 full and uniform access scenario. Following this, a design interview was undertaken to provide interface design suggestions. Evaluation results show that neither of the 2 non-uniform access scenarios had a significant negative impact on collaborative and individual search outcomes. Design interview results suggested that awareness of team’s query history and intersecting viewed/judged documents could potentially help users share their expertise without disclosing sensitive information.

Paper 2:

Title: An Interface for Supporting Asynchronous Multi-Level Collaborative Information Retrieval

Abstract: Case studies and observations from different domains including government, healthcare and legal, have suggested that Collaborative Information Retrieval (CIR) sometimes involves people with unequal access to information. This type of scenario has been referred to as Multi-Level CIR (MLCIR). In addition to supporting collaboration, MLCIR systems must ensure that there is no unintended disclosure of sensitive information, this is an under investigated area of research. We present results of an evaluation of an interface we have designed for MLCIR scenarios. Pairs of participants used the interface under 3 different information access scenarios for a variety of search tasks. These scenarios included one CIR and two MLCIR scenarios, namely: full access (FA), document removal (DR) and term blacklisting (TR). Design interviews were conducted post evaluation to obtain qualitative feedback from participants. Evaluation results showed that our interface performed well for both DR and FA scenarios but for TR, team members with less access had a negative influence on their partner’s search performance, demonstrating insights into how different MLCIR scenarios should be supported. Design interview results showed that our interface helped the participants to reformulate their queries, understand their partner’s performance, reduce duplicated work and review their team’s search history without disclosing sensitive information.

A Comparison of Document-at-a-Time and Score-at-a-Time Query Evaluation (14 February, 2017)

Speaker: Joel Mackenzie

We present an empirical comparison between document-at-a-time (DaaT) and score-at-a-time (SaaT) document ranking strategies within a common framework. Although both strategies have been extensively explored, the literature lacks a fair, direct comparison: such a study has been difficult due to vastly different query evaluation mechanics and index organizations. Our work controls for score quantization, document processing, compression, implementation language, implementation effort, and a number of details, arriving at an empirical evaluation that fairly characterizes the performance of three specific techniques:WAND (DaaT), BMW (DaaT), and JASS (SaaT). Experiments reveal a number of interesting findings. The performance gap between WAND and BMW is not as clear as the literature suggests, and both methods are susceptible to tail queries that may take orders of magnitude longer than the median query to execute. Surprisingly, approximate query evaluation in WAND and BMW does not significantly reduce the risk of these tail queries. Overall, JASS is slightly slower than either WAND or BMW, but exhibits much lower variance in query latencies and is much less susceptible to tail query effects. Furthermore, JASS query latency is not particularly sensitive to the retrieval depth, making it an appealing solution for performance-sensitive applications where bounds on query latencies are desirable.


Joel is a PhD candidate at RMIT University, Melbourne, Australia. He works with Dr J. Shane Culpepper and Assoc Prof. Falk Scholer on efficient and effective candidate generation for multi-stage retrieval. His research interests include index efficiency, multi-stage retrieval and distributed IR.

Unsupervised Event Extraction and Storyline Generation from Text (13 February, 2017)

Speaker: Dr. Yulan He

This talk consists of two parts. In the first part, I will present our proposed Latent Event and Categorisation Model (LECM) which is an unsupervised Bayesian model for the extraction of structured representations of events from Twitter without the use of any labelled data. The extracted events are automatically clustered into coherence event type groups. The proposed framework has been evaluated on over 60 millions tweets and has achieved a precision of 70%, outperforming the state-of-the-art open event extraction system by nearly 6%. The LECM model has been extended to jointly modelling event extraction and visualisation which performs remarkably better than both the state-of-the-art event extraction method and a pipeline approach for event extraction and visualisation.

In the second part of my talk, I will present a non-parametric generative model to extract structured representations and evolution patterns of storylines simultaneously. In the model, each storyline is modelled as a joint distribution over some locations, organisations, persons, keywords and a set of topics. We further combine this model with the Chinese restaurant process so that the number of storylines can be determined automatically without human intervention. The proposed model is able to generate coherent storylines from new articles.

Yulan He is a Reader and Director of the Systems Analytics Research Institute at Aston University. She obtained her PhD degree in Spoken Language Understanding in 2004 from the University of Cambridge, UK. Prior joining Aston, she was a Senior Lecturer at the Open University, Lecturer at the University of Exeter and Lecturer at the University of Reading. Her current research interests lie in the integration of machine learning and natural language processing for text mining and social media analysis. Yulan has published over 140 papers with most appeared in high impact journals and at top conferences such as IEEE Transactions on Knowledge and Data Engineering, IEEE Intelligent Systems, KDD, CIKM, ACL, etc. She served as an Area Chair in NAACL 2016, EMNLP 2015, CCL 2015 and NLPCC 2015
and co-organised ECIR 2010 and IAPR 2007.

Applying Machine Learning to Data Exploration. (23 January, 2017)

Speaker: Charles Sutton

One of the first and most fundamental tasks in data mining is what we might call data understanding. Given a dump of data, what's in it? If modern machine learning methods are effective at finding patterns in data, then they should be effective at summarizing data sets so as to help data analysts develop a high-level understanding of them.

I'll describe several different approaches to this problem. First I'll describe a new approach to classic data mining problems, such as frequent itemset mining and frequent sequence mining, using a new principled model from probabilistic machine learning. Essentially, this casts the problem of pattern mining as one of structure learning in a probabilistic model. I'll describe an application to summarizing the usage of software libraries on Github.

A second attack to this general problem is based on cluster analysis. A good clustering can help a data analyst to explore and understand a data set, but what constitutes a good clustering may depend on domain-specific and application-specific criteria. I'll describe a new framework for interactive clustering that allows the analyst to examine a clustering and guide it in a way that is more useful for their information need.

Finally, topic modelling has proven to be a highly useful family of methods for data exploration, but it still requires a large amount of specialized effort to develop a new topic model for a specific data analysis scenario. I'll present new results on highly scalable inference for latent Dirichlet allocation based on recently proposed deep learning methods for probabilistic models.

Slides and relevant papers will be available at

Rethinking eye gaze for human-computer interaction (19 January, 2017)

Speaker: Hans Gellersen

Eye movements are central to most of our interactions. We use our eyes to see and guide our actions and they are a natural interface that is reflective of our goals and interests. At the same time, our eyes afford fast and accurate control for directing our attention, selecting targets for interaction, and expressing intent. Even though our eyes play such a central part to interaction, we rarely think about the movement of our eyes and have limited awareness of the diverse ways in which we use our eyes --- for instance, to examine visual scenes, follow movement, guide our hands, communicate non-verbally, and establish shared attention. 

This talk will reflect on use of eye movement as input in human-computer interaction. Jacob's seminal work showed over 25 years ago that eye gaze is natural for pointing, albeit marred by problems of Midas Touch and limited accuracy. I will discuss new work on eye gaze as input that looks beyond conventional gaze pointing. This includes work on: gaze and touch, where we use gaze to naturally modulate manual input; gaze and motion, where we introduce a new form of gaze input based on the smooth pursuit movement our eyes perform when they follow a moving object; and gaze and games, where we explore social gaze in interaction with avatars and joint attention as multi-user input . 

Hans Gellersen is Professor of Interactive Systems at Lancaster University. Hans' research interest is in sensors and devices for ubiquitous computing and human-computer interaction. He has worked on systems that blend physical and digital interaction, methods that infer context and human activity, and techniques that facilitate spontaneous interaction across devices. In recent work he is focussing on eye movement as a source of context information and modality for interaction. 

The Role of Relevance in Sponsored Search. (16 January, 2017)

Speaker: Fabrizio Silvestri

Sponsored search aims at retrieving the advertisements that in the one hand meet users’ intent reflected in their search queries, and in the other hand attract user clicks to generate revenue. Advertisements are typically ranked based on their expected revenue that is computed as the product between their predicted probability of being clicked (i.e., namely clickability) and their advertiser provided bid. The relevance of an advertisement to a user query is implicitly captured by the predicted clickability of the advertisement, assuming that relevant advertisements are more likely to attract user clicks. However, this approach easily biases the ranking toward advertisements having rich click history. This may incorrectly lead to showing irrelevant advertisements whose clickability is not accurately predicted due to lack of click history. Another side effect consists of never giving a chance to new advertisements that may be highly relevant due to their lack of click history. To address this problem, we explicitly measure the relevance between an advertisement and a query without relying on the advertisement’s click history, and present different ways of leveraging this relevance to improve user search experience without reducing search engine revenue. Specifically, we propose a machine learning approach that solely relies on text-based features to measure the relevance between an advertisement and a query. We discuss how the introduced relevance can be used in four important use cases: pre-filtering of irrelevant advertisements, recovering advertisements with little history, improving clickability prediction, and re-ranking of the advertisements on the final search result page. Offline experiments using large-scale query logs and online A/B tests demonstrate the superiority of the proposed click-oblivious relevance model and the important roles that relevance plays in sponsored search.

Working toward computer generated music traditions (12 January, 2017)

Speaker: Bob Sturm

I will discuss research aimed at making computers intelligent and sensitive enough to working with music data, whether acoustic or symbolic. Invariably, this includes a lot of work in applying machine learning to music collections in order to divine distinguishing and identifiable characteristics of practices that defy strict definition. Many of the resulting machine music listening systems appear to be musically sensitive and intelligent, but their fraudulent ways can be revealed when they are used to create music in the styles they have been taught to identify. Such "evaluation by generation” is a powerful way to gauge the generality of what a machine has learned to do. I will present several examples, focusing in particular on our work applying deep LSTM networks to modelling folk music transcriptions, and ultimately generating new music traditions.



Studies of Disputed Authorship (09 January, 2017)

Speaker: Michael P. Oakes

Automatic author identification is a branch of computational stylometry, which is the computer analysis of writing style. It is based on the idea that an author’s style can be described by a unique set of textual features, typically the frequency of use of individual words, but sometimes considering the use of higher level linguistic features. Disputed authorship studies assume that some of these features are outside the author’s conscious control, and thus provide a reliable means of discriminating between individual authors. Many studies have successfully made use of high frequency function words like “the”, “of” and “and”, which tend to have grammatical functions rather than reveal the topic of the text. Their usage is unlikely to be consciously regulated by authors, but varies substantially between authors, texts, and even individual characters in Jane Austen’s novels. Using stylometric techniques, Oakes and Pichler (2013) were able to show that the writing style of the document “Diktat für Schlick” was much more similar to that of Wittgenstein than that of other philosophers of the Vienna Circle. Michael Oakes is currently researching the authorship of “The Dark Tower”, normally attributed to C. S. Lewis.

Satisfying User Needs or Beating Baselines? Not always the same. (12 December, 2016)

Speaker: Walid Magdy

Information retrieval (IR) is mainly concerned with retrieving relevant documents to satisfy the information needs of users. Many IR tasks involving different genres and search scenarios have been studied for decades. Typically, researchers aim to improve retrieval effectiveness beyond the current “state-of-the-art”. However, revisiting the modeling of the IR task itself is often essential before seeking improvement of results. This includes reconsidering the assumed search scenario, the approach used to solve the problem, or even the conducted evaluation methodology. In this talk, some well-known IR tasks are explored to demonstrate that beating the state-of-the-art baseline is not always sufficient. Novel modeling, understanding, or approach to IR tasks could lead to significant improvements in user satisfaction compared to just improving “objective” retrieval effectiveness. The talk includes example IR tasks, such as printed document search, patent search, speech search, and social media search.

Supporting Evidence-based Medicine with Natural Language Processing (28 November, 2016)

Speaker: Dr. Mark Stevenson

The modern evidence-based approach to medicine is designed to ensure that patients are given the best possible care by basing treatment decisions on robust evidence. But the huge volume of information available to medical and health policy decision makers can make it difficult for them to decide on the best approach. Much of the current medical knowledge is stored in textual format and providing tools to help access it represents a significant opportunity for Natural Language Processing and Information Retrieval. However, automatically processing documents in this domain is not straightforward and doing so successfully requires a range of challenges to be overcome, including dealing with volume, ambiguity, complexity and inconsistency.  This talk will present a range of approaches from Natural Language Processing that support access to medical information. It will focus on three tasks: Word Sense Disambiguation, Relation Extraction and Contradiction Identification. The talk will outline the challenges faced when developing approaches for accessing information contained in medical documents, including the lack of available gold standard data to train systems. It will show how existing resources can help alleviate this problem by providing information that allows training data to be created automatically.

SHIP: The Single-handed Interaction Problem in Mobile and Wearable Computing (24 November, 2016)

Speaker: Hui-Shyong Yeo

Screen sizes on devices are becoming smaller (eg. smartwatch and music player) and larger (eg. phablets, tablets) at the same time. Each of these trends can make devices difficult to use with only one hand (eg. fat-finger or reachability problem). This Single-Handed Interaction Problem (SHIP) is not new but it has been evolving along with a growth of larger and smaller interaction surfaces. The problem is exacerbated when the other hand is occupied (encumbered) or not available (missing fingers/limbs). The use of voice command or wrist gestures can be less robust or perceived as awkward in the public. 

This talk will discuss several projects (RadarCat UIST 2016, WatchMI MobileHCI 2016, SWIM and WatchMouse) in which we are working towards achieving/supporting effective single-handed interaction for mobile and wearable computing. The work focusses on novel interaction techniques that are not being explored thoroughly for interaction purposes, using ubiquitous sensors that are widely available such as IMU, optical sensor and radar (eg. Google Soli, soon to be available).


Hui-Shyong Yeo is a second year PhD student in SACHI, University of St Andrews, advised by Prof. Aaron Quigley. Before that he worked as a researcher in KAIST for one year. Yeo has a wide range of interest within the field of HCI, including topics such as wearable, gestures, mixed reality and text entry. Currently he is focusing on single-handed interaction for his dissertation topic. He has published in conferences such as CHI, UIST, MobileHCI (honourable mention), SIGGRAPH and journals such as MTAP and JNCA.

Visit his homepage or twitter @hci_research

Demo of Google Soli Radar and Single Handed Smartwatch interaction (24 November, 2016)

Speaker: Hui-Shyong Yeo

This demo session will present the Google Soli Radar and Smartwatch interaction system


Hui-Shyong Yeo is a second year PhD student in SACHI, University of St Andrews, advised by Prof. Aaron Quigley. Before that he worked as a researcher in KAIST for one year. Yeo has a wide range of interest within the field of HCI, including topics such as wearable, gestures, mixed reality and text entry. Currently he is focusing on single-handed interaction for his dissertation topic. He has published in conferences such as CHI, UIST, MobileHCI (honourable mention), SIGGRAPH and journals such as MTAP and JNCA.

Visit his homepage or twitter @hci_research

IDA coffee breaks (22 November, 2016)

Speaker: everyone

A chance to catch up informally with members of the IDA section in the Computing Science Common Room.

Human Computation for Entity-Centric Information Access (21 November, 2016)

Speaker: Dr. Gianluca Demartini

Human Computation is a novel approach used to obtain manual data processing at scale by means of crowdsourcing. In this talk we will start introducing the dynamics of crowdsourcing platforms and provide examples of their use to build hybrid human-machine information systems. We will then present ZenCrowd: a hybrid system for entity linking and data integration problems over linked data showing how the use of human intelligence at scale in combination with machine-based algorithms outperforms traditional systems. In this context, we will then discuss efficiency and effectiveness challenges of micro-task crowdsourcing platforms including spam, quality control, and job scheduling in crowdsourcing.

IDA coffee breaks (15 November, 2016)

Speaker: everyone

A chance to catch up informally with members of the IDA section in the Computing Science Common Room.

Control Theoretical Models of Pointing (11 November, 2016)

Speaker: Rod Murray-Smith

I will present an empirical comparison of four models from manual control theory on their ability to model targeting behaviour by human users using a mouse: McRuer's Crossover, Costello's Surge, second-order lag (2OL), and the Bang-bang model. Such dynamic models are generative, estimating not only movement time, but also pointer position, velocity, and acceleration on a moment-to-moment basis. We describe an experimental framework for acquiring pointing actions and automatically fitting the parameters of mathematical models to the empirical data. We present the use of time-series, phase space and Hooke plot visualisations of the experimental data, to gain insight into human pointing dynamics. We find that the identified control models can generate a range of dynamic behaviours that captures aspects of human pointing behaviour to varying degrees. Conditions with a low index of difficulty (ID) showed poorer fit because their unconstrained nature leads naturally to more dynamic variability. We report on characteristics of human surge behaviour in pointing.

We report differences in a number of controller performance measures, including Overshoot, Settling time, Peak time, and Rise time. We describe trade-offs among the models. We conclude that control theory offers a promising complement to Fitts' law based approaches in HCI, with models providing representations and predictions of human pointing dynamics which can improve our understanding of pointing and inform design.

IDA coffee breaks (08 November, 2016)

Speaker: everyone

A chance to catch up informally with members of the IDA section in the Computing Science Common Room.

Analysis of the Cost and Benefits of Search Interactions (07 November, 2016)

Speaker: Dr. Leif Azzopardi

Interactive Information Retrieval (IR) systems often provide various features and functions, such as query suggestions and relevance feedback, that a user may or may not decide to use. The decision to take such an option has associated costs and may lead to some benefit. Thus, a savvy user would take decisions that maximises their net benefit. In this talk, we will go through a number of formal models which examine the costs and benefits of various decisions that users, implicitly or explicitly, make when searching. We consider and analyse the following scenarios: (i) how long a user's query should be? (ii) should the user pose a specific or vague query? (iii) should the user take a suggestion or re-formulate? (iv) when should a user employ relevance feedback? and (v) when would the "find similar" functionality be worthwhile to the user? To this end, we analyse a series of cost-benefit models exploring a variety of parameters that affect the decisions at play. Through the analyses, we are able to draw a number of insights into different decisions, provide explanations for observed behaviours and generate numerous testable hypotheses. This work not only serves as a basis for future empirical work, but also as a template for developing other cost-benefit models involving human-computer interaction.

This talk is based on the recent ICTIR 2016 paper with Guido Zuccon:

IDA coffee breaks (01 November, 2016)

Speaker: everyone

A chance to catch up informally with members of the IDA section in the Computing Science Common Room.

I'm an information scientist - let me in! (31 October, 2016)

Speaker: Martin White

For the last 46 years Martin has been a professional information scientist, though often in secret. Since founding Intranet Focus Ltd he has found that the awareness of research into topics such as information behaviour, information quality and information seeking in his clients is close to zero. This is especially true in information retrieval. In his presentation Martin will consider why this is the case, what the impact might be and what (if anything) should and could be done to change this situation.

IDA coffee breaks (25 October, 2016)

Speaker: everyone

A chance to catch up informally with members of the IDA section in the Computing Science Common Room.

The problem of quantification in Information Retrieval and on Social Networks. (17 October, 2016)

Speaker: Gianni Amati

There is a growing interest to know how fast information spreads on social networks, how many unique users are participating to an event, the leading opinion polarity in a stream. Quantifying distinct elements on a flow information is thus becoming a crucial problem in many real time information retrieval or streaming applications. We discuss the state-of-art of quantification and show that many problems can be interpreted within a common framework. We introduce a new probabilistic framework for quantification and show as examples how to count opinions in a stream and how to compute the degrees of separation of a network.

Analytics over Parallel Multi-view Data (03 October, 2016)

Speaker: Dr. Deepak Padmanabhan

Conventional unsupervised data analytics techniques have largely focused on processing datasets of single-type data, e.g., one of text, ECG, Sensor Readings and Image data. With increasing digitization, it has become common to have data objects having representations that encompass different "kinds" of information. For example, the same disease condition may be identified through EEG or fMRI data. Thus, a dataset of EEG-fMRI pairs would be considered as a parallel two-view dataset.  Datasets of text-image pairs (e.g., a description of a seashore, and an image of it) and text-text pairs (e.g., problem-solution text, or multi-language text from machine translation scenarios) are other common instances of multi-view data. The challenge in multi-view data analytics is about effectively leveraging such parallel multi-view data to perform analytics tasks such as clustering, retrieval and anomaly detection. This talk will cover some emerging trends in processing multi-view parallel data, and different paradigms for the same. In addition to looking at the different schools of techniques, and some specific techniques from each school, this talk will also be used to present some possibilities for future work in this area.


Dr. Deepak Padmanabhan is a lecturer with the Centre for Data Sciences and Scalable Computing at Queen's University Belfast. He obtained his B.Tech in Comp. Sc. and Engg. from Cochin University (Kerala, India), followed by his M.Tech and PhD, all in computer science, from Indian Institute of Technology Madras. Prior to joining Queen's, he was a researcher at IBM Research - India. He has over 40 publications across top venues in Data Mining, NLP, Databases and Information Retrieval. He co-authored a book on Operators for Similarity Search, published by Springer in 2015. He is the author on ~15 patent applications to the USPTO, including 4 granted patents. He is a recipient of the INAE Young Engineer Award 2015, and is a Senior Member of the ACM and the IEEE. His research interests include Machine Learning, Data Mining, NLP, Databases and Information Retrieval. Email:  URL:

Improvising minds: Improvisational interaction and cognitive engagement (29 August, 2016)

Speaker: Adam Linson

In this talk, I present my research on improvisation as a general form of adaptive expertise. My interdisciplinary approach takes music as a tractable domain for empirical studies, which I have used to ground theoretical insights from HCI, AI/robotics, psychology, and embodied cognitive science. I will discuss interconnected aspects of digital musical instrument (DMI) interface design a musical robotic AI system, and a music psychology study of sensorimotor influences on perceptual ambiguity. I will also show how I integrate this work with an inference-based model of neural functioning, to underscore implications beyond music. On this basis, I indicate how studies of musical improvisation can shed light on a domain-general capacity: our flexible, context-sensitive responsiveness to rapidly-changing environmental conditions.


Recognizing manipulation actions through visual accelerometer tracking, relational histograms, and user adaptation (26 August, 2016)

Speaker: Sebastian Stein

Activity recognition research in computer vision and pervasive computing has made a remarkable trajectory from distinguishing full-body motion patterns to recognizing complex activities. Manipulation activities as occurring in food preparation are particularly challenging to recognize, as they involve many different objects, non-unique task orders and are subject to personal idiosyncrasies. Video data and data from embedded accelerometers provide complementary information, which motivates an investigation of effective methods for fusing these sensor modalities.

In this talk I present a method for multi-modal recognition of manipulation activities that combines accelerometer data and video at multiple stages of the recognition pipeline. A method for accelerometer tracking is introduced that provides for each accelerometer-equipped object a location estimate in the camera view by identifying a point trajectory that matches well the accelerometer data. It is argued that associating accelerometer data with locations in the video provides a key link for modelling interactions between accelerometer-equipped objects and other visual entities in the scene. Estimates of accelerometer locations and their visual displacements are used to extract two new types of features: (i)

Reference Tracklet Statistics characterizes statistical properties of an accelerometer’s visual trajectory, and (ii) RETLETS, a feature representation that encodes relative motion, uses an accelerometer’s visual trajectory as a reference frame for dense tracklets. In comparison to a traditional sensor fusion approach where features are extracted from each sensor-type independently and concatenated for classification, it is shown that by combining RETLETS and Reference Tracklet Statistics with those sensor-specific features performs considerably better. Specifically addressing scenarios in which a recognition

system would be primarily used by a single person (e.g., cognitive situational support), this thesis investigates three methods for adapting activity models to a target user based on user-specific training data. Via randomized control trials it is shown that these methods indeed learn user idiosyncrasies.

The whole is greater than the sum of its parts: how semantic trajectories and recommendations may help tourism. (22 August, 2016)

Speaker: Dr. Chiara Renso

During the first part of this talk I will overview my recent activity in the field of mobility data mining with particular interest in the study of semantics in trajectory data and the experience with the SEEK Marie Curie project [1] recently concluded.  Then I will present two highlights of tourism recommendation works based on the idea of semantic trajectories: TripBuilder [2] and GroupFinder [3].  Tripbuilder is based on the analysis of enriched tourist trajectories extracted from Flickr photos to suggest itineraries constrained by a temporal budget and based on the travellers preferences.  The Groupfinder framework recommends a group of friends with whom to enjoy a visit to a place, balancing the friendship relations of the group members with the user individual interests in the destination location.

[2] Igo Ramalho Brilhante, José Antônio Fernandes de Macêdo, Franco Maria Nardini, Raffaele Perego,Chiara Renso. On planning sightseeing tours with TripBuilder. Inf. Process. Manage. 51(2): 1-15 (2015)
[3]  Igo Ramalho Brilhante, José Antônio Fernandes de Macêdo, Franco Maria Nardini, Raffaele Perego,Chiara Renso. Group Finder: An Item-Driven Group Formation Framework. MDM 2016: 8-17


Dr. Chiara Renso holds a PhD and M.Sc. degree in Computer Science from University of Pisa (1992, 1997).  She is permanent researcher at ISTI Institute of CNR, Italy.  Her research interests are related to spatio-temporal data mining, reasoning, data mining query languages, semantic data mining, trajectory data mining.  She has been involved in several EU projects about mobility data mining.  She has been the scientific coordinator of an FP7 Marie-Curie project on semantic trajectories knowledge discovery called SEEK (  She was also coordinator of a bilateral CNR-CNPQ Italy-Brazil project on mobility data mining with Federal University of Cearà.  She is author of more than 90 peer-reviewed publications.  She is co-editor of the book "Mobility Data: Modelling, Management, and Understanding" edited by Cambridge Press in 2013; co-editor of the special issue for Journal on Knowledge and Information system (KAIS) on Context aware data mining; co-editor of International Journal of Knowledge and Systems Science (IJKSS) on Modelling Tools for Extracting Useful Knowledge and Decision Making.  She has been co-chair of three editions of the Workshop on Semantic Aspects of Data Mining in conjunction with IEEE ICDM conference.  She is a regular reviewer of ACM CIKM, ACM KDD, ACM SIGSPATIAL and many journals on these topics.

Skin Reading: Encoding Text in a 6-Channel Haptic Display (11 August, 2016)

Speaker: Granit Luzhnica

In this talk I will present a study we performed in to investigate the communication of natural language messages using a wearable haptic display. Our research experiments investigated both the design of the haptic display, as well as the methods for communication that use it. First, three wearable configurations are proposed basing on haptic perception fundamentals and evaluated in the first study. To encode symbols, we use an overlapping spatiotemporal stimulation (OST) method, that distributes stimuli spatially and temporally with a minima gap. Second, we propose an encoding for the entire English alphabet and a training method for letters, words and phrases. A second study investigates communication accuracy. It puts four participants through five sessions, for an overall training time of approximately 5 hours per participant. 

Casual Interaction for Smartwatch Feedback and Communication (01 July, 2016)

Speaker: Henning Pohl
Casual interaction strives to enable people to scale back their engagement with interactive systems, while retaining the level of control they desire. In this talk, we will take a look on two recent developments in casual interaction systems. The first p

Casual interaction strives to enable people to scale back their engagement with interactive systems, while retaining the level of control they desire. In this talk, we will take a look on two recent developments in casual interaction systems. The first project to be presented is an indirect visual feedback system for smartwatches. Embedding LEDs into the back of a watch case enabled us to create a form of feedback that is less disruptive than vibration feedback and blends in with the body. We investigated how well such subtle feedback works in an in-the-wild study, which we will take a closer look at in this talk. Where the first project is a more casual form of feedback, the second project tries to support a more casual form of communication: emoji. Over the last years these characters have become more and more popular, yet entering them can take quite some effort. We have developed a novel emoji keyboard around zooming interaction, that makes it easier and faster to enter emoji.

Predicting Ad Quality for Native Advertisements (06 June, 2016)

Speaker: Dr Ke Zhou,

Native advertising is a specific form of online advertising where ads replicate the look-and-feel of their serving platform. In such context, providing a good user experience with the served ads is crucial to ensure long-term user engagement. 


In this talk, I will explore the notion of ad quality, namely the effectiveness of advertising from a user experience perspective. I will talk from both the pre-click and post-click perspective for predicting quality for native ads. With respect to pre-click ad quality, we design a learning framework to detect offensive native ads, showing that, to quantify ad quality, ad offensive user feedback rates are more reliable than the commonly used click-through rate metrics. We translate a set of user preference criteria into a set of ad quality features that we extract from the ad text, image and advertiser, and then use them to train a model able to identify offensive ads. In terms of post-click quality, we use ad landing page dwell time as our proxy and exploit various ad landing page features to predict ad landing page with high dwell time.

Efficient Web Search Diversification via Approximate Graph Coverage (25 April, 2016)

Speaker: Carsten Eickhoff

In the case of general or ambiguous Web search queries, retrieval systems rely on result set diversification techniques in order to ensure an adequate coverage of underlying topics such that the average user will find at least one of the returned documents.

In the case of general or ambiguous Web search queries, retrieval systems rely on result set diversification techniques in order to ensure an adequate coverage of underlying topics such that the average user will find at least one of the returned documents useful. Previous attempts at result set diversification employed computationally expensive analyses of document content and query intent. In this paper, we instead rely on the inherent structure of the Web graph. Drawing from the locally dense distribution of similar topics across the hyperlink graph, we cast the diversification problem as optimizing coverage of the Web graph. In order to reduce the computational burden, we rely on modern sketching techniques to obtain highly efficient yet accurate approximate solutions. Our experiments on a snapshot of Wikipedia as well as the ClueWeb'12 dataset show ranking performance and execution times competitive with the state of the art at dramatically reduced memory requirements.

Searching for better health: challenges and implications for IR (04 April, 2016)

Speaker: Dr. Guido Zuccon
A talk about why IR researchers should care about health search

In this talk I will discuss research problems and possible solutions related to helping the general public searching for health information online. I will show that although in the first instance this appears to be a domain-specific search task, research problems associated with this task have more general implications for IR and offer opportunities to develop advances that are applicable to the whole research field. In particular, in the talk I will focus on two aspects related to evaluation: (1) the inclusion of multiple dimensions of relevance in the evaluation of IR systems and (2) the modelling of query variations within the evaluation framework.

A Comparison of Primary and Secondary Relevance Judgements for Real-Life Topics (07 March, 2016)

Speaker: Dr Martin Halvey
n this talk I present a user study that examines in detail the differences between primary and secondary assessors on a set of

The notion of relevance is fundamental to the field of Information Retrieval. Within the field a generally accepted conception of relevance as inherently subjective has emerged, with an individual's assessment of relevance influenced by numerous contextual factors. In this talk I present a user study that examines in detail the differences between primary and secondary assessors on a set of "real-world" topics which were gathered specifically for the work. By gathering topics which are representative of the staff and students at a major university, at a particular point in time, we aim to explore differences between primary and secondary relevance judgements for real-life search tasks. Findings suggest that while secondary assessors may find the assessment task challenging in various ways (they generally possess less interest and knowledge in secondary topics and take longer to assess documents), agreement between primary and secondary assessors is high.  

Steps towards Profile-Based Web Site Search and Navigation (29 February, 2016)

Speaker: Prof. Udo Kruschwitz
Steps towards Profile-Based Web Site Search and Navigation

Web search in all its flavours has been the focus of research for decades with thousands of highly paid researchers competing for fame. Web site search has however attracted much less attention but is equally challenging. In fact, what makes site search (as well as intranet and enterprise search) even more interesting is that it shares some common problems with general Web search but also offers a good number of additional problems that need to be addressed in order to make search on a Web site no longer a waste of time. At previous visits to Glasgow I talked about turning the log files collected on a Web site into usable, adaptive data structures that can be used in search applications (and which we call user or cohort profiles). This time I will focus on applying these profiles to a navigation scenario and illustrate how the automatically acquired profiles provide a practical use case for combining natural language processing and information retrieval techniques (as that is what we really do at Essex).

Sentiment and Preference Guided Social Recommendation. (22 February, 2016)

Speaker: Yoke Yie Chen
In this talk, I will focus on two knowledge sources for product recommendation: product reviews and user purchase trails.

Social recommender systems harness knowledge from social media to generate recommendations. Previous works in social recommender systems use social knowledge such as social tags, social relationship (social network) and microblogs.  In this talk, I will focus on two knowledge sources for product recommendation: product reviews and user purchase trails. In particular, I will present how we exploit the sentiment expressed in product reviews and user preferences which are implicitly contained in user purchase trails as the basis for recommendation.

Recent Advances in Search Result Diversification for the Web and Social Media (17 February, 2016)

Speaker: Ismail Sengor Altingovde
I will focus on the web search result diversification problem and present our novel contributions in the field.

In this talk, I will start with a short potpourri of our most recent research, emphasis being on the topics related to the web search engines and social Web. Then, I will focus on the web search result diversification problem and present our novel contributions in three directions. Firstly, I will present how the normalizaton of query relevance scores can boost the performance of the state-of-the-art explicit diversification strategies. Secondly, I will introduce a set of new explicit diversification strategies based on the score(-based) and rank(-based) aggregation methods. As a third contribution, I will present how query performance prediction (QPP) can be utilized to weight query aspects during diversification. Finally, I will discuss how these diversification methods perform in the context of Tweet search, and how we improve them using word embeddings.

Practical and theoretical problems on the frontiers of multilingual natural language processing (16 February, 2016)

Speaker: Dr Adam Lopez
Multilingual natural language processing (NLP) has been enormously successful over the last decade, highlighted by offerings like Google translate. What is left to do?

Multilingual natural language processing (NLP) has been enormously successful over the last decade, highlighted by offerings like Google translate. What is left to do? I'll focus on two quite different, very basic problems that we don't yet know how to solve. The first is motivated by the development of new, massively-parallel hardware architectures like GPUs, which are especially tantalizing for computation-bound NLP problems, and may open up new possibilities for the application and scale of NLP. The problem is that classical NLP algorithms are inherently sequential, so harnessing the power of such processors requires complete rethinking the fundamentals of the field. The second is motivated by the fact that NLP systems often fail to correctly understand, translate, extract, or generate meaning. We're poised to make serious progress in this area using the reliable method of applying machine learning to large datasets—in this case, large quantities of natural language text annotated with explicit meaning representations, which take the form of directed acyclic graphs. The problem is that probabilities on graphs are surprisingly hard to define. I'll discuss work on both of these fronts.

Information retrieval challenges in conducting systematic reviews (08 February, 2016)

Speaker: Julie Glanville
The presentation will also describe other areas where software such as text mining and machine learning have potential to contribute to the Systematic Review process

Systematic review (SR) is a research method that seeks to provide an assessment of the state of the research evidence on a specific question.  Systematic reviews (SRs) aim to be objective, transparent and replicable and seek to minimise bias by means of extensive  searches.


The challenges of extensive searching will be summarised.  As software tools and internet interconnectivity increase, we are seeing increasing use of a range of tools within the SR process (not only for information retrieval).  This presentation will present some  of the tools we are currently using within the Cochrane SR community and UK SRs, and the challenges which remain for efficient information retrieval.  The presentation will also describe other areas where software such as text mining and machine learning have potential to contribute to the SR process.

Learning to Hash for Large Scale Image Retrieval (14 December, 2015)

Speaker: Sean Moran
In this talk I will introduce two novel data-driven models that significantly improve the retrieval effectiveness of locality sensitive hashing (LSH), a popular randomised algorithm for nearest neighbour search that permits relevant data-points to be ret

In this talk I will introduce two novel data-driven models that significantly improve the retrieval effectiveness of locality sensitive hashing (LSH), a popular randomised algorithm for nearest neighbour search that permits relevant data-points to be retrieved in constant time, independent of the database size.

To cut down the search space LSH generates similar binary hashcodes for similar data-points and uses the hashcodes to index database data-points into the buckets of a set of hashtables. At query time only those data-points that collide in the same hashtable buckets as the query are returned as candidate nearest neighbours. LSH has been successfully used for event detection in large scale streaming data such as Twitter [1] and for detecting 100,000 object classes on a single CPU [2].


The generation of similarity preserving binary hashcodes comprises two steps: projection of the data-points onto the normal vectors of a set of hyperplanes partitioning the input feature space followed by a quantisation step that uses a single threshold to binarise the resulting projections to obtain the hashcodes. In this talk I will argue that the retrieval effectiveness of LSH can be significantly improved by learning the thresholds and hyperplanes based on the distribution of the input data.


In the first part of my talk I will provide a high level introduction of LSH. I will then argue that LSH makes a set of limiting assumptions arising from its data-independence that hamper its retrieval effectiveness. This motivates the second and third parts of my talk in which I introduce two new models that address these limiting assumptions. 


Firstly, I will discuss a scalar quantisation model that can learn multiple thresholds per LSH hyperplane using a novel semi-supervised objective function [3]. Optimising this objective function results in thresholds that reduce information loss inherent in converting the real-valued projections to binary. Secondly, I will introduce a new two-step iterative model for learning the hashing hyperplanes [4]. In the first step the hashcodes of training data-points are regularised over an adjacency graph which encourages similar data-points to be assigned similar hashcodes. In the second step a set of binary classifiers are learnt so as to separate opposing bits (0,1) with maximum margin. Repeating both steps iteratively encourages the hyperplanes to evolve into positions that provide a much better bucketing of the input feature space compared to LSH.


For both algorithms I will present a set of query-by-example image retrieval results on standard image collections, demonstrating significantly improved retrieval effectiveness versus state-of-the-art hash functions, in addition to a set of interesting and previously unexpected results.

[1] Sasa Petrovic, Miles Osborne and Victor Lavrenko, Streaming First Story Detection with Application to Twitter, In NAACL'10.


[2] Thomas Dean, Mark Ruzon, Mark Segal, Jonathon Shlens, Sudheendra Vijayanarasimhan,  and Jay Yagnik, Fast, Accurate Detection of 100,000 Object Classes on a Single Machine, In CVPR'13.


[3] Sean Moran, Victor Lavrenko and Miles Osborne. Neighbourhood Preserving Quantisation for LSH, In SIGIR'13.


[4] Sean Moran and Victor Lavrenko. Graph Regularised Hashing. In ECIR'15.




An electroencephalograpy (EEG)-based real-time feedback training system for cognitive brain-machine interface (cBMI) (04 November, 2015)

Speaker: Kyuwan Choi

In this presentation, I will present a new cognitive brain-machine interface (cBMI) using cortical activities in the prefrontal cortex. In the cBMI system, subjects conduct directional imagination which is more intuitive than the existing motor imagery. The subjects control a bar on the monitor freely by extracting the information of direction from the prefrontal cortex, and that the subject’s prefrontal cortex is activated by giving them the movement of the bar as feedback. Furthermore, I will introduce an EEG-based wheelchair system using the cBMI concept. If we use the cBMI, it is possible to build a more intuitive BMI system. It could help improve the cognitive function of healthy people and help activate the area around the damaged area of the patients with prefrontal damage such as patients with dementia, autism, etc. by consistently activating their prefrontal cortex.

Adapting biomechanical simulation for physical ergonomics evaluation of new input methods (28 October, 2015)

Speaker: Myroslav Bachynskyi

Recent advances in sensor technology and computer vision allowed new computer input methods to rapidly emerge. These methods are often considered as more intuitive and easier to learn comparing to the conventional keyboard or mouse, however most of them are poorly assessed with respect to their physical ergonomics and health impact of their usage. The main reasons for this are large input spaces provided by these interfaces, absence of a reliable, cheap and easy-to-apply physical ergonomics assessment method and absence of biomechanics expertize in user interface designers. The goal of my research is to develop a physical ergonomics assessment method, which provides support to interface designers on all stages of the design process for low cost and without specialized knowledge. Our approach is to extend biomechanical simulation tools developed for medical and rehabilitation purposes to adapt them for Human-Computer Interaction setting. The talk gives an overview of problems related to the development of the method and shows answers to some of the fundamental questions.

Detecting Swipe Errors on Touchscreens using Grip Modulation (22 October, 2015)

Speaker: Faizuddin Mohd Noor

We show that when users make errors on mobile devices, they make immediate and distinct physical responses that can be observed with standard sensors. We used three

standard cognitive tasks (Flanker, Stroop and SART) to induce errors from 20 participants. Using simple low-resolution capacitive touch sensors placed around a standard device and a built-in accelerometer, we demonstrate that errors can be predicted using micro-adjustments to hand grip and movement in the period after swiping the touchscreen. In a per-user model, our technique predicted error with a mean AUC of 0.71 in Flanker and 0.60 in Stroop and SART using hand grip, while with the accelerometer the mean AUC in all tasks was ≥ 0.90. Using a pooled, non-user-specific, model, our technique achieved mean AUC of 0.75 in Flanker and 0.80 in Stroop and SART using hand grip and an AUC for all tasks > 0.90 for the accelerometer. When combining these features we achieved an AUC of 0.96 (with false accept and reject rates both below 10%). These results suggest that hand grip and movement provide strong and very low latency evidence for mistakes, and could be a valuable component in interaction error detection and correction systems.

A conceptual model of the future of input devices (14 October, 2015)

Speaker: John Williamson

Turning sensor engineering into advances into human computer interaction is slow, ad hoc and unsystematic. I'll discuss a fundamental approach to input device engineering, and illustrate how machine learning could have the exponentially-accelerating impact in HCI that it has had in other fields.

[caveat: This is a proposal: there are only words, not results!]

Haptic Gaze Interaction - EVENT CANCELLED (05 October, 2015)

Speaker: Poika Isokoski
Eye trackers that can be (somewhat) comfortably worn for long periods are now available. Thus, computing systems can track the gaze vector and it can be used in interactions with mobile and embedded computing systems together with other input and output

Eye trackers that can be (somewhat) comfortably worn for long periods are now available. Thus, computing systems can track the gaze vector and it can be used in interactions with mobile and embedded computing systems together with other input and output modalities. However, interaction techniques for these activities are largely missing. Furthermore, it is unclear how feedback from eye movements should be given to best support user's goals. This talk will give an overview of the results of our recent work in exploring haptic feedback on eye movements and building multimodal interaction techniques that utilize the gaze data. I will also discuss some possible future directions in this line of research.

Challenges in Metabolomics, and some Machine Learning Solutions (30 September, 2015)

Speaker: Simon Rogers

Large scale measurement of the metabolites present in an organism is very challenging, but potentially highly rewarding in the understanding of disease and the development of drugs. In this talk I will describe some of the challenges in analysis of data from Liquid Chromatography - Mass Spectrometry, one of the most popular platforms for metabolomics. I will present Statistical Machine Learning solutions to several of these challenges, including the alignment of spectra across experimental runs, the identification of metabolites within the spectra, and finish with some recent work on using text processing techniques to discover conserved metabolite substructures.

Engaging with Music Retrieval (09 September, 2015)

Speaker: Daniel Boland

Music collections available to listeners have grown at a dramatic pace, now spanning tens of millions of tracks. Interacting with a music retrieval system can thus be overwhelming, with users offered ‘too-much-choice’. The level of engagement required for such retrieval interactions can be inappropriate, such as in mobile or multitasking contexts. Using listening histories and work from music psychology, a set of engagement-stratified profiles of listening behaviour are developed. The challenge of designing music retrieval for different levels of user engagement is explored with a system allowing users to denote their level of engagement and thus the specificity of their music queries. The resulting interaction has since been adopted as a component in a commercial music system.

Building Effective and Efficient Information Retrieval Systems (26 June, 2015)

Speaker: Jimmy Lin
Machine learning has become the tool of choice for tackling challenges in a variety of domains, including information retrieval

Machine learning has become the tool of choice for tackling challenges in a variety of domains, including information retrieval. However, most approaches focus exclusively on effectiveness---that is, the quality of system output. Yet, real-world production systems need to search billions of documents in tens of milliseconds, which means that techniques also need to be efficient (i.e., fast).  In this talk, I will discuss two approaches to building more effective and efficient information retrieval systems. The first is to directly learn ranking functions that are inherently more efficient---a thread of research dubbed "learning to efficiently rank". The second is through architectural optimizations that take advantage of modern processor architectures---by paying attention to low-level details such as cache misses and branch mispredicts. The combination of both approaches, in essence, allow us to "have our cake and eat it too" in building systems that are both fast and good.

Deep non-parametric learning with Gaussian processes (10 June, 2015)

Speaker: Andreas Damianou

This talk will discuss deep Gaussian process models, a recent approach to combining deep probabilistic structures with Bayesian nonparametrics. The obtained deep belief networks are constructed using continuous variables connected with Gaussian process mappings; therefore, the methodology used for training and inference deviates from traditional deep learning paradigms. The first part of the talk will thus outline the associated computational tools, revolving around variational inference. In the second part, we will discuss models obtained as special cases of the deep Gaussian process, namely dynamical / multi-view / dimensionality reduction models and nonparametric autoencoders. The above concepts and algorithms will be demonstrated with examples from computer vision (e.g. high-dimensional video, images) and robotics (motion capture data, humanoid robotics).

Intermittent Control in Man and Machine (30 April, 2015)

Speaker: Henrik Gollee

An intermittent controller generates a sequence of (continuous-time) parametrised trajectories whose parameters are adjusted intermittently, based on continuous observation. This concept is related to "ballistic" control and differs from i) discrete-time control in that the control is not constant between samples, and ii) continuous-time control in that the trajectories are reset intermittently.  The Intermittent Control paradigm evolved separately in the physiological and engineering literature. The talk will give details on the experimental verification of intermittency in biological systems and its applications in engineering.

Advantages of intermittent control compared to the continuous paradigm in the context of adaptation and learning will be discussed.

Get A Grip: Predicting User Identity From Back-of-Device Sensing (19 March, 2015)

Speaker: Mohammad Faizuddin Md Noor

We demonstrate that users can be identified using back-of-device handgrip changes during the course of the interaction with mobile phone, using simple, low-resolution capacitive touch sensors placed around a standard device. As a baseline, we replicated the front-of-screen experiments of Touchalytics and compare with our results. We show that classifiers trained using back-of-device could match or exceed the performance of classifiers trained using the Touchalytics approach. Our technique achieved mean AUC, false accept rate and false reject rate of 0.9481, 3.52% and 20.66% for a vertical scrolling reading task and 0.9974, 0.85% and 2.62% for horizontal swiping game task. These results suggest that handgrip provides substantial evidence of user identity, and can be a valuable component of continuous authentication systems.

Towards Effective Non-Invasive Brain-Computer Interfaces Dedicated to Ambulatory Applications (19 March, 2015)

Speaker: Matthieu Duvinage

Disabilities affecting mobility, in particular, often lead to exacerbated isolation and thus fewer communication opportunities, resulting in a limited participation in social life. Additionally, as costs for the health-care system can be huge, rehabilitation-related devices and lower-limb prostheses (or orthoses) have been intensively studied so far. However, although many devices are now available, they rarely integrate the direct will of the patient. Indeed, they basically use motion sensors or the residual muscle activities to track the next move.

Therefore, to integrate a more direct control from the patient, Brain-Computer Interfaces

(BCIs) are here proposed and studied under ambulatory conditions. Basically, a BCI allows you to control any electric device without the need of activating muscles. In this work, the conversion of brain signals into a prosthesis kinematic control is studied following two approaches. First, the subject transmits his desired walking speed to the BCI. Then, this high-level command is converted into a kinematics signal thanks to a Central Pattern Generator (CPG)-based gait model, which is able to produce automatic gait patterns. Our work thus focuses on how BCIs do behave in ambulatory conditions. The second strategy is based on the assumption that the brain is continuously controlling the lower limb. Thus, a direct interpretation, i.e. decoding, from the brain signals is performed. Here, our work consists in determining which part of the brain signals can be used.

Gait analysis from a single ear-worn sensor (17 March, 2015)

Speaker: Delaram Jarchi

Objective assessment of detailed gait patterns is important for clinical applications. One common approach to clinical gait analysis is to use multiple optical or inertial sensors affixed to the patient body for detailed bio-motion and gait analysis. The complexity of sensor placement and issues related to consistent sensor placement have limited these methods only to dedicated laboratory settings, requiring the support of a highly trained technical team. The use of a single sensor for gait assessment has many advantages, particularly in terms of patient compliance, and the possibility of remote monitoring of patients in home environment. In this talk we look into the assessment of a single ear-worn sensor (e-AR sensor) for gait analysis by developing signal processing techniques and using a number of reference platforms inside and outside the gait laboratory. The results are provided considering two clinical applications such as post-surgical follow-up and rehabilitation of orthopaedic patients and investigating the gait changes of the Parkinson's Disease (PD) patients.

Imaging without cameras (05 March, 2015)

Speaker: Matthew Edgar

Conventional cameras rely upon a pixelated sensor to provide spatial resolution. An alternative approach replaces the sensor with a pixelated transmission mask encoded with a series of binary patterns. Combining knowledge of the series of patterns and the associated filtered intensities, measured by single-pixel detectors, allows an image to be deduced through data inversion. At Glasgow we have been extending the concept of a `single-pixel camera' to provide continuous real-time video in excess of 10 Hz, at non-visible wavelengths, using efficient computer algorithms. We have so far demonstrated some applications for our camera such as imaging through smoke, through tinted screens, and detecting gas leaks, whilst performing sub-Nyquist sampling. We are currently investigating the most effective image processing strategies and basis scanning procedures for increasing the image resolution and frame rates for single-pixel video systems.

Analysing UK Annual Report Narratives using Text Analysis and Natural Language Processing (23 February, 2015)

Speaker: Mahmoud El-Haj
In this presentation I will show the work we’ve done in our Corporate Financial Information Environment (CFIE) project.

In this presentation I will show the work we’ve done in our Corporate Financial Information Environment (CFIE) project.  The Project, funded by ESRC and ICAEW, seeks to analyse UK financial narratives, their association with financial statement information, and their informativeness for investors using Computational Linguistics, heuristic Information Extraction (IE) and Natural Language Processing (NLP).  We automatically collected and analysed a number of 14,000 UK annual reports covering a period between 2002 and 2014 for the UK largest firms listed on the London Stock Exchange. We developed software for this purpose which is available online for general use by academics.  The talk includes a demo on the software that we developed and used in our analysis: Wmatrix-import and Wmatrix.  Wmatrix-import is a web-based tool to automatically detect and parse the structure of UK annual reports; the tool provides sectioning, word frequency and readability metrics.  The output from Wmatrix-import goes as input for further NLP and corpus linguistic analysis by Wmatrix - a web based corpus annotation and retrieval tool which currently supports the analysis of small to medium sized English corpora.




CFIE Project

Compositional Data Analysis (CoDA) approaches to distance in information retrieval (20 February, 2015)

Speaker: Dr Paul Thomas
Many techniques in information retrieval produce counts from a sample

Many techniques in information retrieval produce counts from a sample, and it is common to analyse these counts as proportions of the whole—term frequencies are a familiar example.  Proportions carry only relative information and are not free to vary independently of one another: for the proportion of one term to increase, one or more others must decrease.  These constraints are hallmarks of compositional data.  While there has long been discussion in other fields of how such data should be analysed, to our knowledge, Compositional Data Analysis (CoDA) has not been considered in IR. In this work we explore compositional data in IR through the lens of distance measures, and demonstrate that common measures, naïve to compositions, have some undesirable properties which can be avoided with composition-aware measures.  As a practical example, these measures are shown to improve clustering.

Users versus Models: What observation tells us about effectiveness metrics (16 February, 2015)

Speaker: Dr. Paul Thomas
This work explores the link between users and models by analysing the assumptions and implications of a number of effectiveness metrics, and exploring how these relate to observable user behaviours

Retrieval system effectiveness can be measured in two quite different ways: by monitoring the behaviour of users and gathering data about the ease and accuracy with which they accomplish certain specified information-seeking tasks; or by using numeric effectiveness metrics to score system runs in reference to a set of relevance judgements.  In the second approach, the effectiveness metric is chosen in the belief that it predicts ease or accuracy.

This work explores that link, by analysing the assumptions and implications of a number of effectiveness metrics, and exploring how these relate to observable user behaviours.  Data recorded as part of a user study included user self-assessment of search task difficulty; gaze position; and click activity.  Our results show that user behaviour is influenced by a blend of many factors, including the extent to which relevant documents are encountered, the stage of the search process, and task difficulty.  These insights can be used to guide development of batch effectiveness metrics.

Towards Effective Retrieval of Spontaneous Conversational Spoken Content (08 January, 2015)

Speaker: Gareth J. F. Jones
Spoken content retrieval (SCR) has been the focus of various research initiatives for more then 20 years.

Spoken content retrieval (SCR) has been the focus of various research initiatives for more then 20 years. Early research focused on retrieval of clear defined spoken documents principally from the broadcast news domain. The main focus of this work was spoken document retrieval (SDR) task at TREC-6-9. The end of which saw SDR declared a largely solved problem. However, this was soon found to be a premature conclusion relating to controlled recordings of professional news content and overlooking many of the potential challenges of searching more complex spoken content. Subsequent research has focused on more challenging tasks such as search of interview recordings and semi-professional internet content.  This talk will begin by reviewing early work in SDR, explaining its successes and limitations, it will then move to outline work exploring SCR for more challenging tasks, such as identifying relevant elements in long spoken recordings such as meetings and presentations, provide a detailed analysis of the characteristics of retrieval behaviour of spoken content elements when indexed using manual and automatic transcripts, and conclude with a summary of the challenges of delivering effective SCR for complex spoken content and initial attempts to address these challenges. 

On Inverted Index Compression for Search Engine Efficiency (01 September, 2014)

Speaker: Matteo Catena

Efficient access to the inverted index data structure is a key aspect for a search engine to achieve fast response times to users’ queries. While the performance of an information retrieval (IR) system can be enhanced through the compression of its posting lists, there is little recent work in the literature that thoroughly compares and analyses the performance of modern integer compression schemes across different types of posting information (document ids, frequencies, positions). In this talk, we show the benefit of compression for different types of posting information to the space- and time-efficiency of the search engine. Comprehensive experiments have been conducted on two large, widely used document corpora and large query sets; using different modern integer compression algorithms, integrated into a modern IR system, the Terrier IR platform. While reporting the compression scheme which results in the best query response times, the presented analysis will also show the impact of compression on frequency and position posting information in Web corpora that have large volumes of anchor text.

Interactive Visualisation of Big Music Data. (22 August, 2014)

Speaker: Beatrix Vad

Musical content can be described by a variety of features that are measured or inferred through the analysis of audio data. For a large music collection this establishes the possibility to retrieve information about its structure and underlying patterns. Dimensionality reduction techniques can be used to gain insight into such a high-dimensional dataset and to enable visualisation on two-dimensional screens. In this talk we investigate the usability of these techniques with respect to an interactive exploration interface for large music collections based on moods. A method employing Gaussian Processes to extend the visualisation with additional information about its composition is presented and evaluated

Behavioural Biometrics for Mobile Touchscreen Devices (22 August, 2014)

Speaker: Daniel Buschek

Inference in non‐linear dynamical systems – a machine learning perspective, (08 July, 2014)

Speaker: Carl Rasmussen

Inference in discrete-time non-linear dynamical systems is often done using the Extended Kalman Filtering and Smoothing (EKF) algorithm, which provides a Gaussian approximation to the posterior based on local linearisation of the dynamics. In challenging problems, when the non-linearities are significant and the signal to noise ratio is poor, the EKF performs poorly. In this talk we will discuss an alternative algorithm developed in the machine learning community which is based message passing in Factor Graphs and the Expectation Propagation (EP) approximation. We will show this method provides a consistent and accurate Gaussian approximation to the posterior enabling system identification using Expectation Maximisation (EM) even in cases when the EKF fails.

Adaptive Interaction (02 June, 2014)

Speaker: Professor Andrew Howes
A utility maximization approach to understanding human interaction with technology

This lecture describes a theoretical framework for the behavioural sciences that holds high promise for theory-driven research and design in Human-Computer Interaction. The framework is designed to tackle the adaptive, ecological, and bounded nature of human behaviour. It is designed to help scientists and practitioners reason about why people choose to behave as they do and to explain which strategies people choose in response to utility, ecology, and cognitive information processing mechanisms. A key idea is that people choose strategies so as to maximise utility given constraints. The framework is illustrated with a number of examples including pointing, multitasking, skim- reading, online purchasing, Signal-Detection Theory and diagnosis, and the influence of reputation on purchasing decisions. Importantly, these examples span from perceptual/motor coordination, through cognition to social interaction. Finally, the lecture discusses the challenging idea that people seek to find optimal strategies and also discusses the implications for behavioral investigation in HCI.

Web-scale Semantic Ranking (16 May, 2014)

Speaker: Dr Nick Craswell
Bing Ranking Techniques

Semantic ranking models score documents based on closeness in meaning to the query rather than by just matching keywords. To implement semantic ranking at Web-scale, we have designed and deployed a new multi-level ranking systems that combines the best of inverted index and forward index technologies. I will describe this infrastructure which is currently serving many millions of users and explore several types of semantic models: translation models, syntactic pattern matching and topical matching models. Our experiments demonstrate that these semantic ranking models significantly improve relevance over our existing baseline system. This is the repeat of a WWW2014 industry track talk.

Optimized Interleaving for Retrieval Evaluation (28 April, 2014)

Speaker: Filip Radlinski

Interleaving is an online evaluation technique for comparing the relative quality of information retrieval functions by combining their result lists and tracking clicks. A sequence of such algorithms have been proposed, each being shown to address problems in earlier algorithms. In this talk, I will formalize and generalize this process, while introducing a formal model: After identifying a set of desirable properties for interleaving, I will show that an interleaving algorithm can be obtained as the solution to an optimization problem within those constraints. This approach makes explicit the parameters of the algorithm, as well as assumptions about user behavior. Further, this approach leads to an unbiased and more efficient interleaving algorithm than any previous approach, as I will show a novel log-based analysis of user search behaviour.

Gaussian Processes for Big Data (03 April, 2014)

Speaker: Dr James Hensman

Gaussian Process (GP) models are widely applicable models of functions, and are used extensively in statistics and machine learning for regression, classification and as components of more complex models. Inference in a Gaussian process model usually costs O(n^3) operations, where n is the number of data. In the Big Data (tm) world, it would initially seem unlikely that GPs might contribute due to this computational requirement.

Parametric models have been successfully applied to Big Data (tm) using the Robbins-Monro gradient method, which allows data to be processed individually or in small batches. In this talk, I'll show how these ideas can be applied to Gaussian Processes. To do this, I'll form a variational bound on the marginal likelihood: we discuss the properties of this bound, including the conditions where we recover exact GP behaviour.

Our methods have allowed GP regression on hundreds of thousands of data, using a standard desktop machine. for more details, see .

Composite retrieval of heterogeneous web search (24 March, 2014)

Speaker: Horatiu Bota

Traditional search systems generally present a ranked list of documents as answers to user queries. In aggregated search systems, results from different and increasingly diverse verticals (image, video, news, etc.) are returned to users. For instance, many such search engines return to users both images and web documents as answers to the query ``flower''. Aggregated search has become a very popular paradigm. In this paper, we go one step further and study a different search paradigm: composite retrieval. Rather than returning and merging results from different verticals, as is the case with aggregated search, we propose to return to users a set of "bundles", where a bundle is composed of "cohesive'' results from several verticals. For example, for the query "London Olympic'', one bundle per sport could be returned, each containing results extracted from news, videos, images, or Wikipedia. Composite retrieval can promote exploratory search in a way that helps users understand the diversity of results available for a specific query and decide what to explore in more detail. 


We proposed and evaluated a variety of approaches to construct bundles that are relevant, cohesive and diverse. We also utilize both entitiy and term as a surrogate to represent items and demonstrate their effectiveness of bridging the "mismatch" gap among heterogeneous sources. Compared with three baselines (traditional "general web only'' ranking, federated search ranking and aggregated search),  our evaluation results demonstrate significant performance improvement for a highly heterogeneous web collection.

Query Auto-completion & Composite retrieval (17 March, 2014)

Speaker: Stewart Whiting and Horatiu Bota

=Recent and Robust Query Auto-Completion by Stewart Whiting=

Query auto-completion (QAC) is a common interactive feature that assists users in formulating queries by providing completion suggestions as they type. In order for QAC to minimise the user’s cognitive and physical effort, it must: (i) suggest the user’s intended query after minimal input keystrokes, and (ii) rank the user’s intended query highly in completion suggestions. QAC must be both robust and time-sensitive – that is, able to sufficiently rank both consistently and recently popular queries in completion suggestions. Addressing this trade-off, we propose several practical completion suggestion ranking approaches, including: (i) a sliding window of query popularity evidence from the past 2-28 days, (ii) the query popularity distribution in the last N queries observed with a given prefix, and (iii) short-range query popularity prediction based on recently observed trends. Through real-time simulation experiments, we extensively investigated the parameters necessary to maximise QAC effectiveness for three openly available query log datasets with prefixes of 2-5 characters: MSN and AOL (both English), and Sogou 2008 (Chinese). Results demonstrate consistent and language-independent improvements of up to 9.2% over a non-temporal QAC baseline for all query logs with prefix lengths of 2-3 characters. Hence, this work is an important step towards more effective QAC approaches.


=Composite retrieval of heterogeneous web search by Horatiu Bota=

Traditional search systems generally present a ranked list of documents as answers to user queries. In aggregated search systems, results from different and increasingly diverse verticals (image, video, news, etc.) are returned to users. For instance, many such search engines return to users both images and web documents as answers to the query ``flower''. Aggregated search has become a very popular paradigm. In this paper, we go one step further and study a different search paradigm: composite retrieval. Rather than returning and merging results from different verticals, as is the case with aggregated search, we propose to return to users a set of "bundles", where a bundle is composed of "cohesive'' results from several verticals. For example, for the query "London Olympic'', one bundle per sport could be returned, each containing results extracted from news, videos, images, or Wikipedia. Composite retrieval can promote exploratory search in a way that helps users understand the diversity of results available for a specific query and decide what to explore in more detail. 


We proposed and evaluated a variety of approaches to construct bundles that are relevant, cohesive and diverse. We also utilize both entitiy and term as a surrogate to represent items and demonstrate their effectiveness of bridging the "mismatch" gap among heterogeneous sources. Compared with three baselines (traditional "general web only'' ranking, federated search ranking and aggregated search),  our evaluation results demonstrate significant performance improvement for a highly heterogeneous web collection.

Studying the performance of semi-structured p2p information retrieval (10 March, 2014)

Speaker: Rami Alkhawaldeh

In recent decades, retrieval systems deployed over peer-to-peer (P2P) overlay networks have been investigated as an alternative to centralised search engines. Although modern search engines provide efficient document retrieval, there are several drawbacks, including: a single point of failure, maintenance costs, privacy risks, information monopolies from search engines companies, and difficulty retrieving hidden documents in the web (i.e. the deep web). P2P information retrieval (P2PIR) systems promise an alternative distributed system to the traditional centralised search engine architecture. Users and creators of web content in such networks have full control over what information they wish to share as well as how they share it.




Researchers have been tackling several challenges to build effective P2PIR systems: (i) collection (peer) representation during indexing, (ii) peer selection during search to route queries to relevant peers and (iii) final peer result merging. Semi-structured P2P networks (i.e, a partially decentralised unstructured overlay network) offer an intermediate design that minimizes the weakness of both centralised and completely decentralised overlay networks and combines the advantages of those two topologies. So, an evaluation framework for this kind of network is necessary to compare the performance of different P2P approaches and to be a guide for developing new and more powerful approaches. In this work, we study the performance of three cluster-based semi-structured P2PIR models and explain the effectiveness of several important design considerations and parameters on retrieval performance, as well as the robustness of these types of network.


4pm @ Level 4

Inside The World’s Playlist (23 February, 2014)

Speaker: Manos Tsagkias


We describe the algorithms behind Streamwatchr, a real-time system for analyzing the music listening behavior of people around the world. Streamwatchr collects music-related tweets, extracts artists and songs, and visualises the results in two ways: (i)~currently trending songs and artists, and (ii)~newly discovered songs.


Machine Learning for Back-of-the-Device Multitouch Typing (17 December, 2013)

Speaker: Daniel Buschek

IDI Seminar: Machine Learning for Back-of-the-Device Multitouch Typing (17 December, 2013)

Speaker: Daniel Buscheck

Dublin City Search: An evolution of search to incorporate city data (24 November, 2013)

Speaker: Dr Veli Bicer, IBM Research Dublin
ors, devices, social networks, governmental applications, or service networks. In such a diversity of information, answering specific information needs of city inhabitants requires holistic information retrieval techniques, capable of harnessing differen

Dr Veli Bicer is a researcher at Smarter Cities Technology Center of IBM Research in Dublin. His research interests include semantic data management, semantic search, software engineering and statistical relational learning. He obtained his PhD from Karlsruhe Institute of Technology, Karlsruhe, Germany and B.Sc. and M.Sc. degrees in computer engineering from Middle East Technical University, Ankara, 

IDI Seminar: Uncertain Text Entry on Mobile Devices (21 November, 2013)

Speaker: Daryl Weir

Modern mobile devices typically rely on touchscreen keyboards for input. Unfortunately, users often struggle to enter text accurately on virtual keyboards. We undertook a systematic investigation into how to best utilize probabilistic information to improve these keyboards. We incorporate a state-of-the-art touch model that can learn the tap idiosyncrasies of a particular user, and show in an evaluation that character error rate can be reduced by up to 7% over a baseline, and by up to 1.3% over a leading commercial keyboard. We furthermore investigate how users can explicitly control autocorrection via how hard they touch.

Economic Models of Search (18 November, 2013)

Speaker: Leif Azzopardi


Predicting Screen Touches From Back-of-Device Grip Changes (14 November, 2013)

Speaker: Faizuddin Mohd Noor

We demonstrate that front-of-screen targeting on mobile phones can be predicted from back-of-device grip manipulations. Using simple, low-resolution capacitive touch sensors placed around a standard phone, we outline a machine learning approach to modelling the grip modulation and inferring front-of-screen touch targets. We experimentally demonstrate that grip is a remarkably good predictor of touch, and we can predict touch position 200ms before contact with an accuracy of 18mm.

IDI Seminar: Predicting Screen Touches From Back-of-Device Grip Changes (14 November, 2013)

Speaker: Faizuddin Mohd Noor

We demonstrate that front-of-screen targeting on mobile phones can be predicted from back-of-device grip manipulations. Using simple, low-resolution capacitive touch sensors placed around a standard phone, we outline a machine learning approach to modelling the grip modulation and inferring front-of-screen touch targets. We experimentally demonstrate that grip is a remarkably good predictor of touch, and we can predict touch position 200ms before contact with an accuracy of 18mm.

Online Learning in Explorative Multi Period Information Retrieval (11 November, 2013)

Speaker: Marc Sloan


In Multi Period Information Retrieval we consider retrieval as a stochastic yet controllable process, the ranking action during the process continuously controls the retrieval system's dynamics and an optimal ranking policy is found in order to maximise the overall users' satisfaction. Different aspects of this process can be fixed giving rise to different search scenarios. One such application is to fix search intent and learn from a population of users over time. Here we use a multi-armed bandit algorithm and apply techniques from finance to learn optimally diverse and explorative search results for a query. We can also fix the user and dynamically model the search over multiple pages of results using relevance feedback. Likewise we are currently investigating using the same technique over session search using a Markov Decision Process.

Stopping Information Search: An fMRI Investigation (04 November, 2013)

Speaker: Eric Walden

Information search has become an increasingly important factor in people's use of information systems.  In both personal and workplace environments, advances in information technology and the availability of information have enabled people to perform far more search and access much more information for decision making than in the very recent past.  One consequence of this abundance of information has been an increasing need for people to develop better heuristic methods for stopping search, since information available for most decisions now overwhelms people's cognitive processing capabilities and in some cases is almost infinite.  Information search has been studied in much past research, and cognitive stopping rules have also been investigated.  The present research extends and expands on previous behavioral research by investigating brain activation during searching and stopping behavior using functional Magnetic Resonance Imaging (fMRI) techniques.  We asked subjects to search for information about consumer products and to stop when they believed they had enough information to make a subsequent decision about whether to purchase that product.  They performed these tasks while in an MRI machine.  Brain scans were taken that measured brain activity throughout task performance.  Results showed that different areas of the brain were active for searching and stopping, that different brain regions were used for several different self-reported stopping rules, that stopping is a neural correlate of inhibition, suggesting a generalized stopping mechanism in the brain, and that certain individual difference variables make no difference in brain regions active for stopping.  The findings extend our knowledge of information search, stopping behavior, and inhibition, contributing to both the information systems and neuroscience literatures.  Implications of our findings for theory and practice are discussed.

Towards Technically assisted Sensitivity Review of UK Digital Public Records (21 October, 2013)

Speaker: Tim Gollins

There are major difficulties involved in identifying sensitive information in digital public records. These difficulties, if not addressed, will together with the challenge of managing the risks of failing to identify sensitive documents, force government departments into the precautionary closure of large swaths of digital records. Such closures will inhibit timely, open and transparent access by citizens and others in civic society. Precautionary closures will also prevent social scientists’ and contemporary historians’ access to valuable qualitative information, and their ability to contextualise studies of emerging large scale quantitative data. Closely analogous problems exist in UK local authorities, the third sector, and in other countries which are covered by the same or similar legislation and regulation. In 2012, having conducted investigations and earlier research into this problem, and with new evidence of immediate need emerging from the 20 year rule transition process, The UK National Archives (TNA) highlighted this serious issue facing government departments in the UK Public Records system; the Abaca project is the response.


The talk will outline the role of TNA, the background to sensitivity review, the impact of the move to born digital records, the nature of the particular challenge of reviewing them for sensitivity, and the broad approach that the Abaca Project is taking.



Next Monday, 4pm at 423

Accelerating research on big datasets with Stratosphere (14 October, 2013)

Speaker: Moritz Schubotz
Stratosphere is a research project investigating new paradigms for scalable, complex analytics on massively-parallel data sets.

Stratosphere is a research project investigating new paradigms for scalable, complex analytics on massively-parallel data sets. The core concept of Stratosphere is the PACT programming model that extends MapReduce with second order functions like Match, CoGroup and Cross, which allows researchers to describe complex analytics task naturally. The result are directed acyclic that are optimized for parallel execution, by a cost based optimizer that incorporates user code properties, and executed by the Nephele Data Flow Engine. Nephele is a massively parallel data flow engine dealing with resource management, work scheduling, communication, and fault tolerance.

In the seminar session we introduce and showcase how researchers can set their working environment quickly and start doing research right away. As a proof of concept, we present how a simple java program parallelized optimized by Stratosphere obtained top results at the "exotic" Math search task at NTCIR-10. While other research groups optimized index structures and data formats and waited several hours for their indices to be build on high end hardware, we could focus on the essential program logic use basic data types and run the experiments on a heterogenous desktop cluster in several minutes.

IDI Seminar: Around-device devices: utilizing space and objects around the phone (07 October, 2013)

Speaker: Henning Pohl

For many people their phones have become their main everyday tool. While phones can fulfill many different roles, they also require users to (1) make do with affordance not specialized for the specific task, and (2) closely engage with the device itself. In this talk, I propose utilizing the space and objects around the phone to offer better task affordance and to create an opportunity for casual interactions. Around-device devices are a class of interactors, that do not require the user to bring special tangibles, but repurpose items already found in the user’s surroundings. I'll present a survey study, where we determined which places and objects are available to around-device devices. I'll also talk about a prototype implementation of hand interactions and object tracking for future mobiles with built-in depth sensing.

IDI Seminar: Extracting meaning from audio – a machine learning approach (03 October, 2013)

Speaker: Jan Larsen

Validity and Reliability in Cranfield-like Evaluation in Information Retrieval (23 September, 2013)

Speaker: Julián Urbano

The Cranfield paradigm to Information Retrieval evaluation has been used for half a century now as the means to compare retrieval techniques and advance the state of the art accordingly. However, this paradigm makes certain assumptions that remain a research problem in Information Retrieval and that may invalidate our experimental results.

In this talk I will approach the Cranfield paradigm as an statistical estimator of certain probability distributions that describe the final user experience. These distributions are estimated with a test collection, which actually computes system-related distributions that are assumed to be correlated with the target user-related distributions. From the point of view of validity, I will discuss the strength of that correlation and how it affects the conclusions we draw from an evaluation experiment. From the point of view of reliability, I will discuss on past and current practice to measure the reliability of test collections and review several of them accordingly.

Exploration and contextualization: towards reusable tools for the humanities. (16 September, 2013)

Speaker: Marc Bron

The introduction of new technologies, access to large electronic

cultural heritage repositories, and the availability of new

information channels continues to change the way humanities

researchers work and the questions they seek to answer. In this talk I

will discuss how the research cycle of humanities researchers has been

affected by these changes and argue for the continued development of

interactive information retrieval tools to support the research

practices of humanities researchers. Specifically, I will focus on two

phases in the humanities research cycle: the exploration phase and

contextualization phase. In the first part of the talk I discuss work

on the development and evaluation of search interfaces aimed at

supporting exploration. In the second part of the talk I will focus on

how information retrieval technology focused on identifying relations

between concepts may be used to develop applications that support


Quantum Language Models (19 August, 2013)

Speaker: Alessandro Sordoni

A joint analysis of both Vector Space and Language Models for IR

using the mathematical framework of Quantum Theory revealed how both

models allocate the space of density matrices. A density matrix is

shown to be a general representational tool capable of leveraging

capabilities of both VSM and LM representations thus paving the way

for a new generation of retrieval models. The new approach is called

Quantum Language Modeling (QLM) and has shown its efficiency and

effectiveness in modeling term dependencies for Information


Toward Models and Measures of Findability (21 July, 2013)

Speaker: Colin Wilkie
A summary of the work being undertaken on Findability

In this 10 minute talk, I will provide an overview of the project I am working on, which is about Findability, and review some of the existing models and measures of findability, before outlining the models that I have working on.

How cost affects search behaviour (21 July, 2013)

Speaker: Leif Azzopardi
Find out about how microeconomic theory predicts user behaviour...

In this talk, I will run through the work I will be presenting at SIGIR on "How cost affects search behavior". The empirical analysis is motivated and underpinned using the Search Economic Theory that I proposed at SIGIR 2011. 

[SICSA DVF] Language variation and influence in social media (15 July, 2013)

Speaker: Dr. Jacob Eisenstein
Dr. Eisenstein works on statistical natural language processing, focusing on social media analysis, discourse, and latent variable models

Languages vary by speaker and situation, and change over time.  While variation and change are inhibited in written corpora such as news text, they are endemic to social media, enabling large-scale investigation of language's social and temporal dimensions. The first part of this talk will describe a method for characterizing group-level language differences, using the Sparse Additive Generative Model (SAGE). SAGE is based on a re-parametrization of the multinomial distribution that is amenable to sparsity-inducing regularization and facilitates joint modeling across many author characteristics. The second part of the talk concerns change and influence. Using a novel dataset of geotagged word counts, we induce a network of linguistic influence between cities, aggregating across thousands of words. We then explore the demographic and geographic factors that drive spread of new words between cities. This work is in collaboration with Amr Ahmed, Brendan O'Connor, Noah A. Smith, and Eric P. Xing.

Jacob Eisenstein is an Assistant Professor in the School of Interactive Computing at Georgia Tech. He works on statistical natural language processing, focusing on social media analysis, discourse, and latent variable models. Jacob was a Postdoctoral researcher at Carnegie Mellon and the University of Illinois. He completed his Ph.D. at MIT in 2008, winning the George M. Sprowls dissertation award.


The Use of Correspondence Analysis in Information Retrieval (11 July, 2013)

Speaker: Dr Taner Dincer
This presentation will introduce the application of Correspondence Analysis in Information Retrieval

This presentation will introduce the application of Correspondence Analysis (CA) to Information Retrieval. CA is a well-established multivariate, statistical, exploratory data analysis technique. Multivariate data analysis techniques usually operate on a rectangular array of real numbers called a data matrix whose rows represent r observations (for example, r terms/words in documents) and columns represent c variables (for the example, c documents, resulting in a rxc term-by-document matrix). Multivariate data analysis refers to analyze the data in a manner that takes into account the relationships among observations and also among variables. In contrast to univariate statistics, it is concerned with the joint nature of measurements. The objective of exploratory data analysis is to explore the relationships among objects and among variables over measurements for the purpose of visual inspection. In particular, by using CA one can visually study the “Divergence From Independence” (DFI) among observations and among variables.

For Information Retrieval (IR), CA can serve three different uses: 1) As an analysis tool to visually inspect the results of information retrieval experiments, 2) As a basis to unify the probabilistic approaches to term weighting problem such as Divergence From Randomness and Language Models, and 3) As a term weighting model itself, "term weighting based on measuring divergence from independence". In this presentation, the uses of CA for these three purposes are exemplified.

A study of Information Management in the Patient Surgical Pathway in NHS Scotland (03 June, 2013)

Speaker: Matt-Mouley Bouamrane

We conducted a study of information management processes across the patient surgical pathway in NHS Scotland. While the majority of General Practitioners (GPs) consider electronic information systems as an essential and integral part of their work during the patient consultation, many were not fully satisfied with the functionalities of these systems. A majority of GPs considered that the national eReferral system streamlined referral processes. Almost all GPs reported marked variability in the quality of discharge information. Preoperative processes vary significantly across Scotland, with most services using paper based systems. There is insufficient use made of information provided through the patient electronic referral and a considerable duplication of effort with the work already performed in primary care. Three health-boards have implemented electronic preoperative information systems. These have transformed clinical practices and facilitated communication and information-sharing among the multi-disciplinary team and within the health boards. Substantial progress has been made towards improving information transfer and sharing within the surgical pathway in recent years but there remains scope for further improvements at the interface between services.

Interdependence and Predictability of Human Mobility and Social Interactions (23 May, 2013)

Speaker: Mirco Musolesi

The study of the interdependence of human movement and social ties of individuals is one of the most interesting research areas in computational social science. Previous studies have shown that human movement is predictable to a certain extent at different geographic scales. One of the open problems is how to improve the prediction exploiting additional available information. In particular, one of the key questions is how to characterise and exploit the correlation between movements of friends and acquaintances to increase the accuracy of the forecasting algorithms.

In this talk I will discuss the results of our analysis of the Nokia Mobile Data Challenge dataset showing that, by means of multivariate nonlinear predictors, it is possible to exploit mobility data of friends in order to improve user movement forecasting. This can be seen as a process of discovering correlation patterns in networks of linked social and geographic data. I will also show how mutual information can be used to quantify this correlation; I will demonstrate how to use this quantity to select individuals with correlated mobility patterns in order to improve movement prediction. Finally, I will show how the exploitation of data related to friends improves dramatically the prediction with respect to the case of information of people that do not have social ties with the user.

Discovering, Modeling, and Predicting Task-by-Task Behaviour of Search Engine Users (20 May, 2013)

Speaker: Salvatore Orlando

Users of web search engines are increasingly issuing queries to accomplish their daily tasks (e.g., “finding a recipe”, “booking a flight”, “read- ing online news”, etc.). In this work, we propose a two-step methodology for discovering latent tasks that users try to perform through search engines. Firstly, we identify user tasks from individual user sessions stored in query logs. In our vision, a user task is a set of possibly non-contiguous queries (within a user search session), which refer to the same need. Secondly, we discover collective tasks by aggregating similar user tasks, possibly performed by distinct users. To discover tasks, we propose to adopt clustering algorithms based on novel query similarity functions, in turn obtained by exploiting specific features, and both unsupervised and supervised learning approaches.  All the proposed solutions were evaluated on a manually-built ground-truth.

Furthermore, we introduce the the Task Relation Graph (TGR) as a representation of users' search behaviors on a task-by-task perspective, by exploiting the collective tasks obtained so far. The task-by-task behavior is captured by weighting the edges of TGR with a relatedness score computed between pairs of tasks, as mined from the query log.  We validated our approach on a concrete application, namely a task recommender system, which suggests related tasks to users on the basis of the task predictions derived from the TGR. Finally, we showed that the task recommendations generated by our technique are beyond the reach of existing query suggestion schemes, and that our solution is able to recommend tasks that user will likely perform in the near future. 


Work in collaboration with Claudio Lucchese, Gabriele Tolomei, Raffaele Perego, and Fabrizio Silvestri.



[1] C. Lucchese, S. Orlando, R. Perego, F. Silvestri, G. Tolomei. "Identifying Task-based Sessions in Search Engine Query Logs". Forth ACM Int.l Conference on Web Search and Data Mining (WSDM 2011), Hong Kong, February 9-12, 2011

[2] C. Lucchese, S. Orlando, R. Perego, F. Silvestri, G. Tolomei. "Discovering Tasks from Search Engine Query Logs", To appear on ACM Transactions on Information Systems (TOIS). 

[3] C. Lucchese, S. Orlando, R. Perego, F. Silvestri, G. Tolomei. "Modeling and Predicting the Task-by-Task Behavior of Search Engine Users". To appear in Proc. OAIR 2013, Int.l Conference in the RIAO series.

Personality Computing (13 May, 2013)

Speaker: Alessandro Vinciarelli



Personality is one of the driving factors behind everything we do and experience

in life. During the last decade, the computing community has been showing an ever

increasing interest for such a psychological construct, especially when it comes

to efforts aimed at making machines socially intelligent, i.e. capable of interacting with

people in the same way as people do. This talk will show the work being done in this

area at the School of Computing Science. After an introduction to the concept of

personality and its main applications, the presentation will illustrate experiments

on speech based automatic perception and recognition. Furthermore, the talk will

outline the main issues and challenges still open in the domain.  

Fast and Reliable Online Learning to Rank for Information Retrieval (06 May, 2013)

Speaker: Katja Hoffman

Online learning to rank for information retrieval (IR) holds promise for allowing the development of "self-learning search engines" that can automatically adjust to their users. With the large amount of e.g., click data that can be collected in web search settings, such techniques could enable highly scalable ranking optimization. However, feedback obtained from user interactions is noisy, and developing approaches that can learn from this feedback quickly and reliably is a major challenge.


In this talk I will present my recent work, which addresses the challenges posed by learning from natural user interactions. First, I will detail a new method, called Probabilistic Interleave, for inferring user preferences from users' clicks on search results. I show that this method allows unbiased and fine-grained ranker comparison using noisy click data, and that this is the first such method that allows the effective reuse of historical data (i.e., collected for previous comparisons) to infer information about new rankers. Second, I show that Probabilistic Interleave enables new online learning to rank approaches that can reuse historical interaction data to speed up learning by several orders of magnitude, especially under high levels of noise in user feedback. I conclude with an outlook on research directions in online learning to rank for IR, that are opened up by our results.

Entity Linking for Semantic Search (29 April, 2013)

Speaker: Edgar Meij

Semantic annotations have recently received renewed interest with the explosive increase in the amount of textual data being produced, the advent of advanced NLP techniques, and the maturing of the web of data. Such annotations hold the promise for improving information retrieval algorithms and applications by providing means to automatically understand the meaning of a piece of text. Indeed, when we look at the level of understanding that is involved in modern-day search engines (on the web or otherwise), we come to the obvious conclusion that there is still a lot of room for improvement. Although some recent advances are pushing the boundaries already, information items are still retrieved and ordered mainly using their textual representation, with little or no knowledge of what they actually mean. In this talk I will present my recent and ongoing work, which addresses the challenges associated with leveraging semantic annotations for intelligent information access. I will introduce a recently proposed method for entity linking and show how it can be applied to several tasks related to semantic search on collections of different types, genres, and origins. 

Flexible models for high-dimensional probability distributions (04 April, 2013)

Speaker: Iain Murray

Statistical modelling often involves representing high-dimensional probability distributions. The textbook baseline methods, such as mixture models (non-parametric Bayesian or not), often don’t use data efficiently. Whereas the machine learning literature has proposed methods, such as Gaussian process density models and undirected neural network models, that are often too computationally expensive to use. Using a few case-studies, I will argue for increased use of flexible autoregressive models as a strong baseline for general use.

Query Classification for a Digital Library (18 March, 2013)

Speaker: Deirdre Lungley

The motivation for our query classification is the insight it gives the digital content provider into what his users are searching for and hence how his collection could be extended. This talk details two query classification methodologies we have implemented as part of the GALATEAS project ( one log-based, the other using wikified queries to learn a Labelled LDA model. An analysis of their respective classification errors indicates the method best suited to particular category groups. 

Reusing Historical Interaction Data for Faster Online Learning to Rank for IR (12 March, 2013)

Speaker: Anne Schuth


Online learning to rank for information retrieval (IR) holds promise for allowing the development of ³self-learning² search engines that can automatically adjust to their users. With the large amount of e.g., click data that can be collected in web search settings, such techniques could enable highly scalable ranking optimization. However, feedback obtained from user interactions is noisy, and developing approaches that can learn from this feedback quickly and reliably is a major challenge.


In this paper we investigate whether and how previously collected (historical) interaction data can be used to speed up learning in online learning to rank for IR. We devise the first two methods that can utilize historical data (1) to make feedback available during learning more reliable and (2) to preselect candidate ranking functions to be evaluated in interactions with users of the retrieval system. We evaluate both approaches on 9 learning to rank data sets and find that historical data can speed up learning, leading to substantially and significantly higher online performance. In particular, our preselection method proves highly effective at compensating for noise in user feedback. Our results show that historical data can be used to make online learning to rank for IR much more effective than previously possible, especially when feedback is noisy.

Scientific Lenses over Linked Data: Identity Management in the Open PHACTS project (11 March, 2013)

Speaker: Alasdair Gray,

Scientific Lenses over Linked Data: Identity Management in the Open PHACTS project

Alasdair Gray, University of Manchester


The discovery of new medicines requires pharmacologists to interact with a number of information sources ranging from tabular data to scientific papers, and other specialized formats. The Open PHACTS project, a collaboration of research institutions and major pharmaceutical companies, has developed a linked data platform for integrating multiple pharmacology datasets that form the basis for several drug discovery applications. The functionality offered by the platform has been drawn from a collection of prioritised drug discovery business questions created as part of the Open PHACTS project. Key features of the linked data platform are:

1) Domain specific API making drug discovery linked data available for a diverse range of applications without requiring the application developers to become knowledgeable of semantic web standards such as SPARQL;

2) Just-in-time identity resolution and alignment across datasets enabling a variety of entry points to the data and ultimately to support different integrated views of the data;

3) Centrally cached copies of public datasets to support interactive response times for user-facing applications.


Within complex scientific domains such as pharmacology, operational equivalence between two concepts is often context-, user- and task-specific. Existing linked data integration procedures and equivalence services do not take the context and task of the user into account. We enable users of the Open PHACTS platform to control the notion of operational equivalence by applying scientific lenses over linked data. The scientific lenses vary the links that are activated between the datasets which affects the data returned to the user



Alasdair is a researcher in the MyGrid team at the University of Manchester. He is currently working on the Open PHACTS project which is building an Open Pharmacological Space to integrate drug discovery data. Alasdair gained his PhD from Heriot-Watt University, Edinburgh, and then worked as a post-doctoral researcher in the Information Retrieval Group at the University of Glasgow. He has spent the last 10 years working on novel knowledge management projects investigating issues of relating data sets.

Modelling Time & Demographics in Search Logs (01 March, 2013)

Speaker: Milad Shokouhi

Knowing users' context offers a great potential for personalizing web search results or related services such as query suggestion and query completion. Contextual features cover a wide range of signals; query time, user’s location,  search history and demographics can all  be regarded as contextual features that can be used for search personalization.

In this talk, we’ll focus on two main questions:

1)      How can we use the existing contextual features, in particular time, for improving search results (Shokouhi & Radinsky, SIGIR’12).

2)      How can we infer missing contextual features, in particular user-demographics, based on search history (Bi et al., WWW2013).


Our results confirm that (1) contextual features matter and (2) that many of them can be inferred from search history.

Pre-interaction Identification By Dynamic Grip Classification (28 February, 2013)

Speaker: Faizuddin Mohd Noor

We present a novel authentication method to identify users at they pick up a mobile device. We use a combination of back-of-device capacitive sensing and accelerometer measurements to perform classification, and obtain increased performance compared to previous accelerometer-only approaches. Our initial results suggest that users can be reliably identified during the pick-up movement before interaction commences.

Time-Biased Gain (21 February, 2013)

Speaker: Charlie Clark
Time-biased gain provides a unifying framework for information retrieval evaluation

Time-biased gain provides a unifying framework for information retrieval evaluation, generalizing many traditional effectiveness measures while accommodating aspects of user behavior not captured by these measures. By using time as a basis for calibration against actual user data, time-biased gain can reflect aspects of the search process that directly impact user experience, including document length, near-duplicate documents, and summaries. Unlike traditional measures, which must be arbitrarily normalized for averaging purposes, time-biased gain is reported in meaningful units, such as the total number of relevant documents seen by the user. In work reported at SIGIR 2012, we proposed and validated a closed-form equation for estimating time-biased gain, explored its properties, and compared it to standard approaches. In work reported at CIKM 2012, we used stochastic simulation to numerically approximate time-biased gain, an approach that provides greater flexibility, allowing us to accommodate different types of user behavior and increases the realism of the effectiveness measure. In work reported at HCIR 2012, we extended our stochastic simulation to model the variation between users. In this talk, I will provide an overview of time-biased gain, and outline our ongoing and future work, including extensions to evaluate query suggestion, diversity, and whole-page relevance. This is joint work with Mark Smucker.

Evaluating Bad Query Abandonment in an Iterative SMS-Based FAQ Retrieval System (14 February, 2013)

Speaker: Edwin Thuma

We investigate how many iterations users are willing to tolerate in an iterative Frequently Asked Question (FAQ) system that provides information on HIV/AIDS. This is part of work in progress that aims to develop an automated Frequently Asked Question system that can be used to provide answers on HIV/AIDS related queries to users in Botswana. Our system engages the user in the question answering process by following an iterative interaction approach in order to avoid giving inappropriate answers to the user. Our findings provide us with an indication of how long users are willing to engage with the system. We subsequently use this to develop a novel evaluation metric to use in future developments of the system. As an additional finding, we show that the previous search experience of the users has a significant effect on their future behaviour.

[IR] Searching the Temporal Web: Challenges and Current Approaches (04 February, 2013)

Speaker: Nattiya Kanhabua

In this talk, we will give a survey of current approaches to searching the

temporal web. In such a web collection, the contents are created and/or

edited over time, and examples are web archives, news archives, blogs,

micro-blogs, personal emails and enterprise documents. Unfortunately,

traditional IR approaches based on term-matching only can give

unsatisfactory results when searching the temporal web. The reason for this

is multifold:  1) the collection is strongly time-dependent, i.e., with

multiple versions of documents, 2) the contents of documents are about

events happened at particular time periods, 3) the meanings of semantic

annotations can change over time, and 4) a query representing an information

need can be time-sensitive, so-called a temporal query.


Several major challenges in searching the temporal web will be discussed,

namely, 1) How to understand temporal search intent represented by

time-sensitive queries? 2) How to handle the temporal dynamics of queries

and documents? and 3) How to explicitly model temporal information in

retrieval and ranking models? To this end, we will present current

approaches to the addressed problems as well as outline the directions for

future research.

Probabilistic rule-based argumentation for norm-governed learning agents (28 January, 2013)

Speaker: Sebastian Riedel

There is a vast and ever-increasing amount of unstructured textual data at our disposal. The ambiguity, variability and expressivity of language makes this data difficult to analyse, mine, search, visualise, and, ultimately, base decisions on. These challenges have motivated efforts to enable machine reading: computers that can read text and convert it into semantic representations, such as the Google Knowledge Graph for general facts, or pathway databases in the biomedical domain. This representations can then be harnessed by machines and humans alike. At the heart of machine reading is relation extraction: reading text to create a semantic network of entities and their relations, such as employeeOf(Person,Company), regulates(Protein,Protein) or causes(Event,Event). 

In this talk I will present a series of graphical models and matrix factorisation techniques that can learn to extract relations. I will start by contrasting a fully supervised approach with one that leverages pre-existing semantic knowledge (for example, in the Freebase database) to reduce annotation costs. I will then present ways to extract additional relations that are not yet part of the schema, and for which no pre-existing semantic knowledge is available. I will show that by doing so we cannot only extract richer knowledge, but also improve extraction quality of relations within the original schema. This helps to improve over previous state-of-the-art by more than 10% points mean average precision. 

IDI Seminar (29 November, 2012)

Speaker: Konstantinos Georgatzis
Efficient Optimisation for Data Visualisation as an Information Retrieval Task

Visualisation of multivariate data sets is often done by mapping data onto a low-dimensional display with nonlinear dimensionality reduction (NLDR) methods. We have introduced a formalism where NLDR for visualisation is treated as an information retrieval task, and a novel NLDR method called the Neighbor Retrieval

Visualiser (NeRV) which outperforms previous methods. The remaining concern is that NeRV has quadratic computational complexity with respect to the number of data. We introduce an efficient learning algorithm for NeRV where relationships between data are approximated through mixture modeling, yielding efficient computation with near-linear computational complexity with respect to the number of data. The method is much faster to optimise as the number of data grows, and it maintains good visualisation performance.

Context data in lifelog retrieval (19 November, 2012)

Speaker: Liadh Kelly
Context data in lifelog retrieval

Advances in digital technologies for information capture combined with
massive increases in the capacity of digital storage media mean that it is
now possible to capture and store much of one's life experiences in a
personal lifelog. Information can be captured from a myriad of personal
information devices including desktop computers, mobile phones, digital
cameras, and various sensors, including GPS, Bluetooth, and biometric
devices. This talk centers on the investigation of the challenges of
retrieval in this emerging domain and on the examination of the utility of
several implicitly recorded and derived context types in meeting these
challenges. For these investigations unique rich multimodal personal
lifelog collections of 20 months duration are used. These collections
contain all items accessed on subjects' PCs and laptops (email, web pages,
word documents, etc), passively captured images depicting subjects' lives
using the SenseCam device (, and
mobile text messages sent and received. Items are annotated with several
rich sources of automatically derived context data types including
biometric data (galvanic skin response, heart rate, etc), geo-location
(captured using GPS data), people present (captured using Bluetooth data),
weather conditions, light status, and several context types related to the
dates and times of accesses to items.


From Search to Adaptive Search (12 November, 2012)

Speaker: Udo Kruschwitz
Generating good query modification suggestions or alternative queries to assist a searcher remains however a challenging issue

Modern search engines have been moving away from very simplistic interfaces that aimed at satisfying a user's need with a single-shot query. Interactive features such as query suggestions and faceted search are now integral parts of Web search engines. Generating good query modification suggestions or alternative queries to assist a searcher remains however a challenging issue. Query log analysis is one of the major strands of work in this direction. While much research has been performed on query logs collected on the Web as a whole, query log analysis to enhance search on smaller and more focused collections (such as intranets, digital libraries and local Web sites) has attracted less attention. The talk will look at a number of directions we have explored at the University of Essex in addressing this problem by automatically acquiring continuously updated domain models using query and click logs (as well as other sources).

Events Webapp

Roderick Murray-Smith

Professor Roderick Murray-Smith

Professor (Computing Science)

Research interests: Mobile Human Computer Interaction; Machine Learning; Brain Computer Interaction; Dynamic Systems; Probabilistic Inference

Peter Triantafillou

Professor Peter Triantafillou

Chair of Data Systems Engineering (School of Computing Science)

Research interests: Big Data Management Systems: NoSQL systems, index-based and Hadoop-style massively parallel data access, statistical structures and query processing and optimization; Cloud Infrastructures; Crowdsourcing to improve Data System Internals and Information/Knowledge Acquisition.

Iadh Ounis

Professor Iadh Ounis

Professor of Information Retrieval (Computing Science)

Research interests: Web and enterprise search engines; Large-scale information retrieval systems; Opinion finding and results diversification; Searching and mining within electronic health records; Social media retrieval (blog, twitter, news, etc).

Joemon Jose

Professor Joemon Jose

Professor of Information Retrieval (Computing Science)

Research interests: Adaptive and personalized search systems; Multimodal interaction for information retrieval; Emotion based Search and browsing; Temporal Information Retrieval; Search result diversification and aggregation; Recommendation and collaborative filtering

Paul Siebert

Dr Paul Siebert

Reader (Computing Science)

Simon Rogers

Dr Simon Rogers

Senior Lecturer (Computing Science)

Dr John Williamson

Lecturer (School of Computing Science)

Dr Ke Yuan

Lecturer in Computing Science (Machine Learning in Computational Biology) (Computing Science)

Dr Craig MacDonald

Lord Kelvin Adam Smith Fellow in Sensor Systems (School of Computing Science)

Nikos Ntarmos

Dr Nikos Ntarmos

Lord Kelvin Adam Smith Fellow (School of Computing Science)

Research interests: data management, distributed systems, big data, indexing, query processing, NoSQL, graph data, scale out, scale up, databases, systems

Christos Anagnostopoulos

Dr Christos Anagnostopoulos

Big Data Research Fellow (Computing Science)

Research interests: Large-scale Mobile and Distributed Computing Systems, Machine and Statistical Learning, Stochastic Optimization

Dr Bjorn Jensen

Lecturer in Computing Science( Applied Machine LEarning) (Computing Science)