Annotated Publications

Recent, Query Processing and Optimization, Autonomic and Grid Computing, Data Cleansing and Data Analysis, Miscellaneous

Recent

  How to Barter Bits for Chronons: Compression and Bandwidth Trade Offs for Database Scans  (A. Holloway, V. Raman, G. Swart, D. DeWitt). ACM SIGMOD Intl. Conf. on Management of Data (SIGMOD), 2007. To appear.
  Lazy, Adaptive RID-List Intersection, and its application to Index Anding  (V. Raman, L. Qiao, W. Han, I. Narang, Y. Chen, A. Yang, and F. Lin). ACM Intl. Conf. on Management of Data (SIGMOD), 2007. To appear.
  How to Wring a Table Dry: Entropy Compression of Relations and Querying of Compressed Relations (slides)  (V. Raman, G. Swart). Intl. Conf. on Very Large Data Bases (VLDB), 2006. Describes a method that compresses relations down to their entropy, achieving compression factors to 8 to 40x, and techniques for applying equality and range predicates on compressed data.

Grid Computing and Autonomic Computing
  Automatic Query Parallelization with Non-Dedicated Computers: An Evaluation of Adaptivity Options  (N. W. Paton, V. Raman, G. Swart, I. Narang). Intl. Conf. on Autonomic Computing (ICAC), 2006. Experimental comparison of various ways of adapting to bursty loads during parallel query processing.
  Automatic Statistics Collection in Action  (P. J. Haas, M. Kandil, A. Lerner, V. Markl, I. Popianov, V. Raman, D. C. Zilio). ACM SIGMOD Intl. Conf on Management of Data (SIGMOD), 2005. Demonstration of the automatic statistics collection feature of DB2 UDB.
  Autonomic Index Evolution  (V. Raman, K. Ranganathan). Unpublished Manuscript, 2005. This paper describes a microeconomic mechanism for an index structure that evolves and restructures itself based on suppy and demand: demand for search requests and supply of indexer nodes. Work in Progress.
  System and method for asynchronous data replication without persistence for distributed computing  (B. Lindsay, I. Narang, V. Raman). US Patent Application 20050044088, 2005.
  Parallel Querying with Non-Dedicated Computers.  (V. Raman, W. Han, and I. Narang). International Conference on Very Large Data Bases (VLDB), Trondheim, Norway, 2005. This paper describes a new method of parallel query processing called DITN, which is based on dynamic outsourcing of join processing tasks to non-dedicated, heterogeneous computers. In DITN, partitioning is not the means of parallelism. Data layout decisions are taken outside the scope of the DBMS, and handled within the storage software; query processors see a Data In The Network image. DITN also introduces an intra-fragment parallelism where each node executes an independent select-project-join- aggregate-group by block, with no tuple exchange between nodes. The main advantage of DITN is gradual scaleout as the workload grows, by using non-dedicated computers. Furthermore, DITN cleanly handles heterogeneous nodes, and well adapts during execution to node failures or load spikes. The paper has detailed algorithms and performance measurements.
  QOS-based Data Access and Placement for Federated Information Systems.  (W. Li, V. Batra, V. Raman, W. Han, and I. Narang). International Conf. on Very Large Data Bases (VLDB), Norway, 2005.
  Load and Network Aware Query Routing for Information Integration  (W. Li, V. Batra, V. Raman, W. Han, K. Candan, and I. Narang). IEEE International Conf. on Data Engineering (ICDE), Tokyo, 2005.
  Tesla: an On Demand Information System  (V. Raman and I. Narang). IBM Data Management Potpourri Workshop, Toronto, 2004.
  Towards an Information Infrastructure for the Grid  (S. Bourbonnais, V. M. Gogate, L. M. Haas, R. W. Horman, S. Malaika, I. Narang, and V. Raman). IBM Systems Journal, 43(4), 2004. In this paper we present our vision of an information infrastructure for grid computing, which is based on a service-oriented architecture. The infrastructure supports a virtualized view of the computing and data resources, is autonomic (driven by policies) in order to meet application goals for quality of service, and is compatible with the standards being developed in the technical community. We describe how we are implementing this vision in IBM today and how we expect the implementation to evolve in the future.
  Automatic Statistics Collection in DB2 UDB  (A. Aboulnaga, P. J. Haas, S. Lightstone, G. M. Lohman, V. Markl, I. Popivanov, and V. Raman). International Conference on Very Large Data Bases (VLDB), Toronto, 2004. Describes the automatic statistics collection feature of DB2 UDB. This manual has more details.
  Services for Data Access and Processing on Grids  (V. Raman, I. Narang, C. Crone, L. Haas, S. Malaika, T. Mukai, D. Wolfson, and C. Baru). Database Access and Integration Services Working Group, Global Grid Forum (GGF) 5, 2002

Query Processing and Optimization

  POP/FED: Progressive Query Optimization for Federated Queries in DB2  (H. Kache, W. Han, V. Markl, V. Raman, and S. Ewen). International Conf. on Very Large Data Bases (VLDB), 2006. Demonstration of progressive optimization in Information Integrator.
  Adaptive Query Processing: Why, how, when, what next  (A. Deshpande, J. M. Hellerstein, V. Raman). ACM SIGMOD Intl Conf on Management of Data (SIGMOD), 2006. Tutorial.
  Progressive Query Optimization for Federated Queries  (S. Ewen, H. Kache, V. Markl, and V. Raman). Intl. Conf. on Extending Data Base Technology (EDBT), 2006.
  Adaptive Query Processing  (A. Deshpande, V. Raman). Intl Conf on Management of Data (COMAD), 2005. Tutorial.
  Determining validity ranges of query plans based on suboptimality.  (V. Markl and V. Raman). US Patent Application 20050267866, 2005.
  System, method, and computer program product for progressive query processing  (G. Lohman, V. Markl, H. Pirahesh, V. Raman, and D. Simmen). US Patent Application 20050097078, 2005.
  Progressive Optimization in Action  (V. Raman, V. Markl, D. Simmen, G. M. Lohman, and H. Pirahesh). International Conf. on Very Large Data Bases (VLDB), Toronto, 2004. Demonstration of the progressive optimization software prototype.
  Robust Query Processing through Progressive Optimization (acm)  (V. Markl, V. Raman, D. Simmen, G. M. Lohman, and H. Pirahesh). ACM SIGMOD International Conf. on Management of Data (SIGMOD), Paris, 2004. Relational DBMSs choose query execution strategies using an optimizer which relies heavily on accurate statistics about the data distribution, especially cardinalities. These cardinalities can be wrongly estimated due to many reasons: inaccurate statistics, data correlations, parameter markers, and so on. The problem is that such wrong estimates can cause the RDBMS to choose a poor execution strategy, that takes much longer to execute than the ideal strategy. This paper describes a new approach called Progressive Query Optimization (POP), which can automatically detect and recover from cardinality estimation errors. POP instruments query execution with checkpoint operators, validates cardinality estimates against actual values, and dynamically switches query plans mid-flight if necessary. The paper details the POP algorithms, implementation, and an experimental study on real workloads showing up to 2 orders of magnitude speedup.
  LEO: An Autonomic Query Optimizer for DB2  (G. Lohman, V. Markl, and V. Raman). IBM Systems Journal, 42(1), 2003. Overview of the LEO project.
  Using State Modules for Adaptive Query Processing (ieee)  (V. Raman, A. V. Deshpande, and J. M. Hellerstein). International Conference on Data Engineering (ICDE), Bangalore, March 2003. We present a query architecture in which join operators are decomposed into their constituent data structures (State Modules, or SteMs), and data¤ow among these SteMs is managed adaptively by an Eddy routing operator. Breaking the encapsulation of joins serves two purposes. First, it allows the Eddy to observe multiple physical operations embedded in a join algorithm, allowing for better calibration and control of these operations. Second, the SteM on a relation serves as a shared materialization point, enabling multiple competing access methods to share results, which can be leveraged by multiple competing join algorithms. Our architecture extends prior work signi£cantly, allowing continuously adaptive decisions for most major aspects of traditional query optimization: choice of access methods and join algorithms, ordering of operators, and choice of a query spanning tree. SteMs introduce signi£cant routing flexibility to the Eddy, enabling more opportunities for adaptation, but also introducing the possibility of incorrect query results. We present constraints on Eddy routing through SteMs that ensure correctness while preserving a great deal of flexibility. We also demonstrate the bene£ts of our architecture via experiments in the Telegraph data¤ow system. We show that even a simple routing policy allows signi£cant flexibility in adaptation, including novel effects like the automatic “hybridization” of multiple algorithms for a single join.
  TelegraphCQ: Continuous Dataflow Processing for an Uncertain World  (S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. Madden, V. Raman, F. Reiss, and M. Shah). Conference on Innovative Data Systems Research (CIDR), Asilomar, January 2003. Describes stream query processing in the Telegraph dataflow system.
  Continuously Adaptive Continuous Query Processing  (S. Madden, M. Shah, J. M. Hellerstein, and V. Raman). ACM SIGMOD International Conf. on Management of Data, 2002. We present a continuously adaptive, continuous query (CACQ) implementation based on the eddy query processing framework. We show that our design provides significant performance benefits over existing approaches to evaluating continuous queries, not only because of its adaptivity, but also because of the aggressive crossquery sharing of work and space that it enables. By breaking the abstraction of shared relational algebra expressions, our Telegraph CACQ implementation is able to share physical operators – both selections and join state – at a very fine grain. We augment these features with a grouped-filter index to simultaneously evaluate multiple selection predicates. We include measurements of the performance of our core system, along with a comparison to existing continuous query approaches.
  Partial Results for Interactive Query Processing  (V.Raman and J. M. Hellerstein). ACM SIGMOD International Conf. on Management of Data, 2002. Traditional query processors generate full, accurate query results, either in batch or in pipelined fashion. We argue that this strict model is too rigid for exploratory queries over diverse and distributed data sources, such as sources on the Internet. Instead, we propose a looser model of querying in which a user submits a broad initial query outline, and the system continually generates partial result tuples that may contain values for only some of the output fields. The user can watch these partial results accumulate at the user interface, and accordingly refine the query by specifying their interest in different kinds of partial results. After describing our querying model and user interface, we present a query processing architecture for this model which is implemented in the Telegraph dataflow system. Our architecture ...
  Interactive Query Processing  (Vijayshankar Raman). PhD disseration, University of California at Berkeley, December 2001.
  Online Dynamic Reordering (acm)  (V. Raman, B. Raman, and J. M. Hellerstein). The VLDB Journal, 9(3), 2000.
  Informix Under CONTROL: Online Query Processing   (J. M. Hellerstein, R. Avnur, and V. Raman). Data Mining and Knowledge Discovery Journal, 4(4), October 2000. Describes the implementation of interactive query processing features inside a prototype version of the Informix Dynamic Server.
  Adaptive Query Processing: Technology in Evolution   (J. M. Hellerstein, M. Franklin, S. Chandrasekaran, A. Deshpande, K. Hildrum, S. Madden, V. Raman, and M. Shah). IEEE Data Engineering Bulletin, 23(2), June 2000.
  Online Dynamic Reordering for Interactive Data Processing (pdf)   (V. Raman, B. Raman, and J. M. Hellerstein). Proc. International Conf. on Very Large Data Bases (VLDB), Edinburgh, September 1999.Selected as one of best papers.

Data Analysis, Data Cleansing and Transformation

  Potter's Wheel: An Interactive Framework for Data Cleaning and Transformation (pdf)  (V. Raman and J. M. Hellerstein). Proc. International Conf. on Very Large Data Bases (VLDB), Rome, 2001. This paper describes the engineering and algorithms behind the Potter's Wheel data cleansing software. One really nifty idea here is called "super split" -- it learns the best way to split a structured file into columns using minimum desciption length (MDL) method. Another neat idea is the user interface, that uses a spreadsheet to tightly integrate transformation and anomaly detection -- users specify transformations on a spreadsheet by showing their effect on example values; concurrently, the system mines the data for anomalies and report them to the user, who can then fix them with further transformations. This UI has influenced the design of the Cohera product, later bought by People Soft and now Oracle.
  Interactive Data Analysis   (V. Raman).CASCON Workshop on Databases for Decision Support, Toronto, Dec 1999.
  Interactive Data Analysis: The CONTROL Project  ( R. Avnur, A. Chou, J.M. Hellerstein, C. Hidber, C. Olston, V. Raman, T. Roth, and P. J. Haas). IEEE Computer, 32(8), August 1999. Data analysis is fundamentally an iterative process in which you issue a query, receive a response, formulate the next query based on the response, and repeat. You usually don't issue a single, perfectly chosen query and get the information you want from a database; indeed, the purpose of data analysis is to extract unknown information, and in most situations, there is no one perfect query. People naturally start by asking broad, big-picture questions and then continually refine their questions based on feedback and domain knowledge. In the Control (Continuous Output and Navigation Technology with Refinement Online) project at the University of California, Berkeley, the authors are working with collaborators at IBM, Informix, and elsewhere to explore ways to improve human-computer interaction during data analysis. The Control project's goal is to develop interactive, intuitive techniques for analyzing massive data sets.
  Scalable Spreadsheets for Interactive Data Analysis  (V. Raman, A. Chou, and J. M. Hellerstein). Proc ACM-SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD), Philadelphia, May 1999. Describes ABC, the precursor of Potter's Wheel. ABC was a scalable spreadsheet for data analysis that combines exploration, grouping, and aggregation. The idea was that spreadsheets are more natural to use than DBMSs, and so we should scale up spreadsheets so that they can handle a wider range of applications. It aimed to provide interactive responses in place of long delays, and support intuitive, direct manipulation operations in place of complex queries. The paper has a bunch of nice high-level ideas but not too many specifics -- the Potter's Wheel paper has some more details, but much of the challenges remain for future work.
  CONTROL: Continuous Output and Navigation Technology with Refinement On-Line  (R. Avnur, J.M. Hellerstein, B. Lo, C. Olston, B. Raman, V. Raman, T. Roth, and K. Wylie). Proc. ACM-SIGMOD International Conference on Management of Data, Seattle, 1998. Demonstration of Online Reordering and Ripple joins in a prototype of the Informix Dynamic Server DBMS.

Miscellaneous -- Indexing, Clustering, Networks, ...

  Locality Preserving Dictionaries: Theory and Application to Clustering in Databases  (V. Raman). Proc. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), Philadelphia, 1999. Describes a new index structure called a Locality Preserving Dictionaries (LPD) -- one where all data items within a range lie together, within a space that is a small function of the number of items in the range. The idea is to partition the memory space and place items in sorted order, with judiciously placed gaps between them, so that insert, delete, and search take logarithmic time (amortized). The paper gives algorithms and analysis that is backed by no code -- I hope to implement these at some point!
  Permutation routing in wavelength-routed wrapped-around shuffle networks using fewer wavelengths  (G. Mohan, C. Siva Ram Murthy, and R. Vijayshankar).Computer Networks and ISDN Systems, 30(24), December 1998.

IBM publications

Wei Han, Vijayshankar Raman and Inderpal S. Narang. Parallel Querying with Non-Dedicated Computers. VLDB 2005 -- International Conference on Very Large Data Bases. May 2005.

Mokhtar Kandil, Alberto Lerner, Volker G. Markl, Ivan Popivanov, Vijayshankar Raman, Danny Zilio and Peter J. Haas. Automated Statistics Collection in Action. SIGMOD 2005. February 2005.

Inderpal S. Narang, Wen-Syan Li and Vijayshankar Raman. Tesla: An On Demand Information System. IBM Data Management Potpourri Workshop. IBM, August 2004.

Vijayshankar Raman, Volker G. Markl, David E. Simmen, Guy M. Lohman and Mir Hamid Pirahesh. Progressive Optimization in Action. VLDB 2004. June 2004.

Ashraf I. Aboulnaga, Peter J. Haas, Sam Lightstone, Guy M. Lohman, Volker G. Markl, Ivan Popivanov and Vijayshankar Raman. Automated Statistics Collection in DB2 Stinger. VLDB 2004. June 2004.

Serge Bourbonnais, Vitthal M. Gogate, Laura Haas, Randy Horman, Susan Malaika, Inderpal S. Narang and Vijayshankar Raman. Towards an Information Infrastructure for the Gird. IBM Systems Journal 43(4):665-88, June 2004.