Mastery

Recovering logical models from legacy applications

The objective of the Mastery project is to develop tools for recovering higher-level logical models from legacy applications. Logical models assist programmers in understanding large, complex applications, and enable them to plan and carry out various commonly recurring program transformation tasks.
While our long-term goal is to reverse engineer various kinds of logical models, including data models, process models, and business rules, our focus so far has been on recovering logical data models (resembling UML class diagrams) through semantic analysis of Cobol programs.

Our work in this project follows two tracks:

  • Developing algorithms based on static analysis for inferring logical data models from programs
  • Creating infrastructure for analyzing Cobol programs, and implementing tools for logical data model recovery

What are logical data models?

A logical data model is different from the physical data model, which is the declared structure of variables and records in the program. In fact, the logical data model is inferred by an analysis of the code that uses the declared data; it is a value-added model obtained by eliminating irrelevant details from the physical model, adding relevant details that may be missing or implicit in the physical model, and reorganizing it into a more logical structure that is consistent with the way the declared data is used in the code. Typically the physical model is not as easy to understand as the logical model because it may reflect implementation concerns (such as efficiency), because of deficiencies in the programming language in expressing higher-level abstractions, and because it may have degenerated over decades of continuous maintenance and evolution.

Automatic logical model recovery

Our static-analysis-based algorithms for inferring logical data models have the following features:

  • Two variables X and Y are assigned the same logical type if one of them is copied to or compared to the other.
  • We have two distinct approaches for inferring implicit record-structure within logical types and subtyping relationships between types. An efficient approach, which we have implemented in our tool, is based on certain fast heuristics. A more expensive approach uses a path-sensitive dataflow analysis that is based on identying "tag" fields in records (fields that indicate subtyping) using predicate analysis (see our publications); this approach is sound, but is currently unimplemented (the efficient approach is not provably sound, but is likely to be so in most cases in practice.)
  • Associations between types are inferred via an analysis of the operations the program performs on declared primary-key fields.


Tool implementation

The tool is implemented as a plugin in the Eclipse IDE platform. It features an automated engine for recovering logical data models from programs, a facility (within the recovery engine) for building links between logical model elements and corresponding physical model elements, a browser for the logical and physical models that also allows navigation of the links, a set of pre-defined queries (plus, support for adhoc OCL queries) on both the models to support program exploration and understanding, and facitilies for manually editing automatically inferred logical models. Logical and physical models are represented persistently using the Eclipse Modeling Framework infrastructure.



Last updated 12 Jun 2008

Content navigation