Recent,
Query Processing and Optimization,
Autonomic and Grid Computing,
Data Cleansing and Data Analysis,
Miscellaneous
Recent
| |
How to Barter Bits for Chronons: Compression and Bandwidth Trade Offs for Database Scans (A. Holloway, V. Raman, G. Swart, D. DeWitt). ACM SIGMOD Intl. Conf. on Management of Data (SIGMOD), 2007. To appear. |
| |
Lazy, Adaptive RID-List Intersection, and its application to
Index Anding (V. Raman, L. Qiao, W. Han, I. Narang, Y. Chen, A. Yang, and F. Lin).
ACM Intl. Conf. on Management of Data (SIGMOD), 2007. To appear.
|
Grid Computing and Autonomic Computing
| |
Automatic Statistics Collection in Action (P. J. Haas, M. Kandil, A. Lerner, V. Markl, I. Popianov, V. Raman, D. C. Zilio). ACM SIGMOD Intl. Conf on Management of Data (SIGMOD), 2005. Demonstration of the automatic statistics collection feature of DB2 UDB. |
| | Autonomic Index Evolution (V. Raman, K. Ranganathan). Unpublished Manuscript, 2005. This paper describes a microeconomic mechanism for an index structure that evolves and restructures itself based on suppy and demand: demand for search requests and supply of indexer nodes. Work in Progress. |
| | Parallel Querying with Non-Dedicated Computers. (V. Raman, W. Han, and I. Narang). International Conference on Very Large Data Bases (VLDB), Trondheim, Norway, 2005. This paper describes a new method of parallel query processing called DITN, which is based on dynamic
outsourcing of join processing tasks to non-dedicated, heterogeneous computers. In DITN, partitioning is not the means of parallelism. Data layout decisions are taken outside the scope of the DBMS, and handled within the storage software; query processors see a Data In The Network image. DITN also introduces an intra-fragment parallelism where each node executes an independent select-project-join- aggregate-group by block, with no tuple exchange between nodes. The main advantage of DITN is gradual scaleout as the workload grows, by using non-dedicated computers. Furthermore, DITN cleanly handles heterogeneous nodes, and well adapts during execution to node failures or load spikes. The paper has detailed algorithms and performance measurements. |
| |
Towards an Information Infrastructure for the Grid (S. Bourbonnais, V. M. Gogate, L. M. Haas, R. W. Horman, S. Malaika, I. Narang, and V. Raman). IBM Systems Journal, 43(4), 2004. In this paper we present our vision of an information infrastructure for grid computing, which is based on a service-oriented architecture. The infrastructure supports a virtualized view of the computing and data resources, is autonomic (driven by policies) in order to meet application goals for quality of service, and is compatible with the standards being developed in the technical community. We describe how we are implementing this vision in IBM today and how we expect the implementation to evolve in the future.
|
| |
Automatic Statistics Collection in DB2 UDB (A. Aboulnaga, P. J. Haas, S. Lightstone, G. M. Lohman,
V. Markl, I. Popivanov, and V. Raman). International
Conference on Very Large Data Bases (VLDB), Toronto, 2004. Describes the automatic statistics collection feature of DB2 UDB. This manual has more details. |
Query Processing and Optimization
| |
Progressive Optimization in Action (V. Raman,
V. Markl, D. Simmen, G. M. Lohman, and H. Pirahesh). International Conf. on Very Large Data Bases (VLDB), Toronto, 2004. Demonstration of the progressive optimization software prototype.
|
| |
Robust Query Processing through Progressive Optimization (acm) (V. Markl, V. Raman, D. Simmen, G. M. Lohman, and H. Pirahesh). ACM SIGMOD International Conf. on Management of Data (SIGMOD), Paris, 2004. Relational DBMSs choose query execution strategies using an optimizer which relies heavily on accurate statistics about the data distribution, especially cardinalities. These cardinalities can be wrongly estimated due to many reasons: inaccurate statistics, data correlations, parameter markers, and so on. The problem is that such wrong estimates can cause the RDBMS to choose a poor execution strategy, that takes much longer to execute than the ideal strategy. This paper describes a new approach called Progressive Query Optimization (POP), which can automatically detect and recover from cardinality estimation errors. POP instruments query execution with checkpoint operators, validates cardinality estimates against actual values, and dynamically switches query plans mid-flight if necessary. The paper details the POP algorithms, implementation, and an experimental study on real workloads showing up to 2 orders of magnitude speedup. |
| |
Using State Modules for Adaptive Query Processing (ieee) (V. Raman, A. V. Deshpande, and J. M. Hellerstein). International Conference on Data Engineering (ICDE), Bangalore, March 2003. We present a query architecture in which join operators are decomposed into their constituent data
structures (State Modules, or SteMs), and data¤ow among these SteMs is managed adaptively by an
Eddy routing operator. Breaking the encapsulation of joins serves two purposes. First, it allows the Eddy
to observe multiple physical operations embedded in a join algorithm, allowing for better calibration
and control of these operations. Second, the SteM on a relation serves as a shared materialization point,
enabling multiple competing access methods to share results, which can be leveraged by multiple competing
join algorithms. Our architecture extends prior work signi£cantly, allowing continuously adaptive decisions for most major aspects of traditional query optimization: choice of access methods and join
algorithms, ordering of operators, and choice of a query spanning tree. SteMs introduce signi£cant routing flexibility to the Eddy, enabling more opportunities for adaptation, but also introducing the possibility of incorrect query results. We present constraints on Eddy routing through SteMs that ensure correctness while preserving a great deal of flexibility. We also demonstrate the bene£ts of our architecture via experiments in the Telegraph data¤ow system. We show that even a simple routing policy allows signi£cant flexibility in adaptation, including novel effects like the automatic “hybridization” of multiple algorithms for a single join.
|
| |
TelegraphCQ:
Continuous Dataflow Processing for an Uncertain
World (S.
Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M.
Hellerstein, W. Hong, S. Krishnamurthy, S. Madden, V. Raman,
F. Reiss, and M. Shah). Conference on Innovative Data Systems Research (CIDR), Asilomar, January 2003. Describes stream query processing in the Telegraph dataflow system.
|
| |
Continuously
Adaptive Continuous Query Processing (S. Madden, M. Shah, J. M. Hellerstein, and V. Raman). ACM SIGMOD International Conf. on Management of Data, 2002. We present a continuously adaptive, continuous query (CACQ) implementation based on the eddy query processing framework. We show that our design provides significant performance benefits over existing approaches to evaluating continuous queries, not only because of its adaptivity, but also because of the aggressive crossquery sharing of work and space that it enables. By breaking the abstraction of shared relational algebra expressions, our Telegraph CACQ implementation is able to share physical operators – both selections and join state – at a very fine grain. We augment these features with a grouped-filter index to simultaneously evaluate multiple selection predicates. We include measurements of the performance of our core system, along with a comparison to existing continuous query approaches. |
| |
Partial Results
for Interactive Query Processing (V.Raman and J. M. Hellerstein). ACM SIGMOD International Conf. on Management of Data, 2002. Traditional query processors generate full, accurate query results, either in batch or in pipelined fashion. We argue that this strict model is too rigid for exploratory queries over diverse and distributed data sources, such as sources on the
Internet. Instead, we propose a looser model of querying in which a user submits a broad initial query outline, and the system continually generates partial result tuples that may contain values for only some of the output fields. The user can watch these partial results accumulate at the user interface, and accordingly refine the query by specifying their interest in different kinds of partial results. After describing our querying model and user interface, we present a query processing architecture for this model which is implemented in the Telegraph dataflow system. Our architecture ... |
| |
Interactive Query Processing
(Vijayshankar
Raman). PhD disseration, University of California at Berkeley,
December 2001.
|
| |
Informix Under CONTROL: Online Query Processing (J. M. Hellerstein, R. Avnur, and V. Raman). Data Mining and Knowledge Discovery Journal, 4(4), October 2000. Describes the implementation of interactive query processing features inside a prototype version of the Informix Dynamic Server. |
Data Analysis, Data Cleansing and Transformation
| |
Potter's Wheel: An Interactive Framework for Data Cleaning and Transformation (pdf) (V. Raman and J. M. Hellerstein). Proc. International Conf. on Very
Large Data Bases (VLDB), Rome, 2001. This paper describes the engineering and algorithms behind the Potter's Wheel data cleansing software. One really nifty idea here is called "super split" -- it learns the best way to split a structured file into columns using minimum desciption length (MDL) method. Another neat idea is the user interface, that uses a spreadsheet to tightly integrate transformation and anomaly detection -- users specify transformations on a spreadsheet by showing their effect on example values; concurrently, the system mines the data for anomalies and report them to the user, who can then fix them with further transformations. This UI has influenced the design of the Cohera product, later bought by People Soft and now Oracle.
|
| |
Interactive Data Analysis: The CONTROL Project
(
R. Avnur, A. Chou, J.M. Hellerstein, C. Hidber, C. Olston, V. Raman, T. Roth,
and P. J. Haas). IEEE Computer, 32(8), August 1999. Data analysis is fundamentally an iterative process in which you issue a query, receive a response, formulate the next query based on the response, and repeat. You usually don't issue a single, perfectly chosen query and get the information you want from a database; indeed, the purpose of data analysis is to extract unknown information, and in most situations, there is no one perfect query. People naturally start by asking broad, big-picture questions and then continually refine their questions based on feedback and domain knowledge. In the Control (Continuous Output and Navigation Technology with Refinement Online) project at the University of California, Berkeley, the authors are working with collaborators at IBM, Informix, and elsewhere to explore ways to improve human-computer interaction during data analysis. The Control project's goal is to develop interactive, intuitive techniques for analyzing massive data sets.
|
| |
Scalable Spreadsheets for Interactive Data Analysis (V. Raman, A. Chou, and J. M. Hellerstein). Proc ACM-SIGMOD Workshop on Research
Issues in Data Mining and Knowledge Discovery (DMKD), Philadelphia, May 1999. Describes ABC, the precursor of Potter's Wheel. ABC was a scalable spreadsheet for data analysis that combines exploration, grouping, and aggregation. The idea was that spreadsheets are more natural to use than DBMSs, and so we should scale up spreadsheets so that they can handle a wider range of applications. It aimed to provide interactive responses in place of long delays, and support intuitive, direct manipulation operations in place of complex queries. The paper has a bunch of nice high-level ideas but not too many specifics -- the Potter's Wheel paper has some more details, but much of the challenges remain for future work. |
Miscellaneous -- Indexing, Clustering, Networks, ...
| | Locality Preserving Dictionaries: Theory and Application to Clustering in Databases (V. Raman). Proc. ACM
SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), Philadelphia, 1999. Describes a new index structure called a Locality Preserving Dictionaries (LPD) -- one where all data items within a range lie together, within a space that is a small function of the number of items in the range. The idea is to partition the memory space and place items in sorted order, with judiciously placed gaps between them, so that insert, delete, and search take logarithmic time (amortized). The paper gives algorithms and analysis that is backed by no code -- I hope to implement these at some point!
|
IBM publications
Wei Han, Vijayshankar Raman and Inderpal S. Narang. Parallel Querying with Non-Dedicated Computers.
VLDB 2005 -- International Conference on Very Large Data Bases. May 2005.
Mokhtar Kandil, Alberto Lerner, Volker G. Markl, Ivan Popivanov, Vijayshankar Raman, Danny Zilio and Peter J. Haas. Automated Statistics Collection in Action.
SIGMOD 2005. February 2005.
Inderpal S. Narang, Wen-Syan Li and Vijayshankar Raman. Tesla: An On Demand Information System.
IBM Data Management Potpourri Workshop. IBM, August 2004.
Vijayshankar Raman, Volker G. Markl, David E. Simmen, Guy M. Lohman and Mir Hamid Pirahesh.
Progressive Optimization in Action.
VLDB 2004. June 2004.
Ashraf I. Aboulnaga, Peter J. Haas, Sam Lightstone, Guy M. Lohman, Volker G. Markl, Ivan Popivanov and Vijayshankar Raman.
Automated Statistics Collection in DB2 Stinger.
VLDB 2004. June 2004.
Serge Bourbonnais, Vitthal M. Gogate, Laura Haas, Randy Horman, Susan Malaika, Inderpal S. Narang and Vijayshankar Raman. Towards an Information Infrastructure for the Gird.
IBM Systems Journal 43(4):665-88, June 2004.