ESPS team members are involved in various real-world pilot projects that stress various requirements of stream-processing applications.
Market Data Processing : Financial Market (FM) customers are in an arms race to process increasingly large amounts of market data with shorter and shorter latencies, and ever more sophisticated in-line analytics. This goal can only be achieved by keeping the shortest and fastest possible processing path from information receipt to transmitted result. Most FM organizations build proprietary solutions in which streams of information such as market data, are processed and correlated through the minimum necessary sequence of stages, to produce the required results. System S not only supports this pipelined approach to processing market data, but also provides a holistic platform that supports scalable, distributed stream processing with automated resource allocation, seamless leveraging of advanced hardware, failure resiliency, programming abstractions and tooling for high-performance streaming applications.
Mining Astronomical Data : The primary goal of this pilot project is to enable the discovery of new celestial objects through real-time exploration of the massive volume of signals captured by the next generation of software radio telescope arrays as high bitrate data streams. We define signal exploration as the detection of novel statistical patterns that are dynamically “learnt” as being of interest. Even under this restricted definition, signal exploration is hard to formulate rigorously. We thus partition the problem space into two broad research domains. The first domain deals with the design of approximate and distributed real-time signal processing algorithms for specific, well-posed signal detection and tracking problems. Cosmic ray shower detection is one example of a well-posed signal detection problem, while tracking celestial objects with known signal characteristics falls in the well-posed signal tracking category of problems. The second domain pertains to the discovery of novel and meaningful statistical patterns in the signal; first tackled under the assumption of unlimited available resources and no real-time constraints, before progressively adding these constraints. Solving these problems involves determining the right set of signal processing stream-processing operators that can produce the highest accuracy results given the available resources. Research problems in this project span a wide range from the design of approximate analytic operators for this domain, to the associated algebra for operator composition and optimization, and high-performance data transport and flow optimizations to handle the high data rates..
Manufacturing Process Control and Monitoring : In the semiconductor manufacturing environment there are several different distributed data monitors gathering information (data streams) about different steps in the process, tool operation, wafer defects, and test results etc. The streams of data collected in such environments are extremely heterogeneous, both in type - ranging from structured sensor measurements (e.g. temperature, pressure, chemical composition) to unstructured wafer images - as well as in temporal granularity, data rates etc. Significant improvements in the performance of the manufacturing process may be achieved by appropriately analyzing the available streams using statistical techniques, and using this to drive process control. We propose to perform multivariate analysis across several of these available data streams to identify cross-tool, cross-step and cross-process dependencies that cannot be captured by the limited analysis of current SPC techniques. In order to deal with the volumes of data, we propose to use hierarchical and incremental statistical analysis techniques. In our proposed approach, small groups of data streams are first statistically analyzed to create intermediate summary statistics (each of which may be independently used for analysis and process control), and then these summary statistics are aggregated across several data streams and analyzed using other (potentially similar) statistical techniques to obtain more comprehensive results. Computation savings arise from this hierarchical evaluation, where results are reused across different time-scales, and groups of data streams, allowing large-scale analysis across many different tools, data modalities, steps and processes. At the same time, in order to drive real-time operation, the intermediate results included in these summary statistics may also individually be used to drive process control incrementally. Refinements to any control decision (made by this independent analysis) may be provided after the hierarchical analysis.
Building an Expert Network : This project is the basis for the Atlas for Lotus Connections offering from IBM. It is a social networking application that allows users to visualize their current network of contacts and see how they can extend that network to tap into valuable resources and trusted experts across an entire organization. By mining information from the different components of Lotus Connections, Atlas compiles and displays information that will help people better understand professional networks and who they can reach out to for information. The different components of Atlas allow users to: visualize and analyze social networks in an organization, identify the shortest path to reach someone, find expertise across extended networks, and visualize and manage their personal networks. Here again, participating personnel's email, chat and other streams of data are mined continuously and information in the social network is enriched and maintained current.
