The explosion of digital media content is driving the need for more effective solutions for managing and searching large repositories of images, video and audio. Media enterprises engage in costly and time-consuming processes of manually indexing content, which typically produces inconsistent and inadequate results. At the same time, there is increasing user expectation that content will be searchable. Clearly, today’s manual annotation processes cannot satisfy user demands. To address these problems, digital media indexing needs to be automated to unlock the value of the repositories. To address this problem, IBM Research is developing a novel solution called Marvel.
The Marvel multimedia analysis and retrieval system use statistical machine learning techniques to build models of semantic concepts in image and video content, including events, objects, people, places, scenes, and topics. Marvel uses the models to automatically index new content. This allows users to search without requiring the repositories to be manually indexed. An essential first step in Marvel is the manual annotation of a representative subset of content for training purposes. This annotated content is then processed using statistical machine learning techniques to build models used by the indexing system.
IBM’s Efficient Video Annotation (EVA) System allows users to rapidly index the semantic concepts in large video datasets by clicking on positive and negative examples
IBM’s Efficient Video Annotation (EVA) System, which is being developed by the Intelligent Information Management Department, is a novel Web tool designed to support distributed collaborative indexing of semantic concepts in large image and video collections. The EVA Web-based user-interface is easy to use, requiring annotators to simply click on positive and negative examples of semantic concepts in the content shown on the users’ screen (see Figure above). The interface allows users to use either mouse or keyboard for rapidly labeling the image and video content. IBM’s EVA system additionally includes functions allowing user to set entire pages to positive or negative, which can greatly speed up indexing of both rare and very popular concepts. The annotators can also customize the EVA screen layout and optionally work on one or more semantic concepts at a time as they go through a large image or video data set.
The EVA system was recently used in the TRECVID Annotation Forum, which developed a large corpus of training and evaluation data for the annual TREC Video Evaluation Benchmark. The EVA system was used by more than 100 participants in the TRECVID Annotation Forum to label 39 semantic concepts in 80 hours of video. This annotated dataset was then made available to participating institutions to use for developmental purposes in creating systems for automatically indexing new video content. This provided a common ground-truth for systems for the TRECVID high-level feature task, which involved the detection of 10 of these semantic concepts in a new large video data set. The remainder of the concepts, along with the 10 from the high-level feature task, were made available for use in the TRECVID search task for answering the benchmark query topics.
IBM’s EVA System allows users zoom-in on each key-frame image as they rapidly navigate through the video data set
Besides producing high-quality image and video training data, an important goal of the EVA project is to investigate research issues related to image understanding, image indexing subjectivity and human-machine interaction. The use of the EVA System for TRECVID annotation effort is providing tremendous opportunities for studying these issues.
The design goals for the EVA system were primarily to allow rapid indexing of semantic concepts by end-users, provide basic administrative functions allowing creation of user accounts and assignment of user workloads, and allow full metering of the collaborative annotation process. The administration functions of the EVA system allow creation of user-groups and accounts and the dynamic allocation of workload. Furthermore, the EVA system collects user data during the annotation process, including time spent on each page, number and size of thumbnails, and statistics about the usage of keyboard and mouse. Metering the annotation process has provided valuable feedback for not only improving the EVA system but also for improving quality of annotations produced in the first large annotation effort for TRECVID.
The performance of semantic concept modeling and retrieval systems, such as Marvel, greatly depends on the quality of the training data. As a result, false positive and false negative errors in indexing during training can adversely impact performance. However, arriving at high-quality annotations for large image and video collections is a daunting task— it is inherently time consuming, subjective and error prone. Ideally, redundant annotations should be obtained when possible to resolve mistakes and problems in subjectivity. The added overhead for redundant annotations must be consider as a trade-off, though, since it requires greater overall human effort and can slow down the overall indexing process. The EVA system allows great flexibility in configuring the redundancy factor individually for each semantic concept, such as to tune to the popularity of each concept. This can have great impact on the overall performance of modeling and detecting the semantic concepts in new video content.
IBM’s EVA System provides administrator views that report the progress of users in completing an annotation workload in terms of number of concepts and amount of video data completed
Related Publications
Timo Volkmer, John R. Smith, Apostol (Paul) Natsev, Murray Campbell and Milind Naphade. A Web-based System for Collaborative Annotation of Large Image and Video Collections. In Proceedings of the 13th annual ACM international conference on Multimedia. November 2005.






