Research: Program Development Environment

As-is Parser for Program Fragments

The Target of Analysis: Program Fragments

The main work of programming is to understand and modify program fragments. A program fragment means a part or the whole of a source file, which lacks definitions and is ill-formed. A program template is also a fragment, which has extra tokens and lacks a part of code, for an input of an automatic generation tool of customized program texts. In the case of the C language, a source file including preprocessor directives is also ill-formed. Though programmers read and modify fragments during the programming process, there is no support to build small convenient tools for analyzing and modifying fragments in order to achieve their current works. "Small convenient tools" means, in this case, small analysis and/or modification program specialized for target program fragments. Because these tools depend on the target fragments, they are unreusable and we need an environment for building those tools at very low costs. Those tools are very similar with scripts of sed and AWK, whose processes are specialized for target files and is never to be used after finishing the tasks. Sed and AWK enable us to build a small script at low cost.

As-is Parser, tolerant of ill-formed program fragments

For building small convenient tools at low costs, we need a common parser of program fragments which lack definitions and are ill-formed. In the case of the C language, a standard parser requires all definitions of identifiers to parse a declaration correctly, because typedef-ed identifiers and variables are used in the same name space. For example, the declaration "T (v);" is not able to be parsed as a declaration without the definition of T, because this fragments has a possibility of being a statement calling the function T().

To overcome the problems of lack of definitions and ill-form, the parser "TEBA" is designed to allow partially inaccurate sub-trees in a syntax tree. Even if a fragment is ill-formed, TEBA parses it in as-is.

The Program Analyzer "TEBA"

TEBA is a translator of a program fragment into an attributed token sequence, representing an abstract syntax tree. Token is well-defined in any fragment, and it is easy to parse them. After parsing tokens, TEBA tries to inject marker tokens of syntactic elements by applying coarse-grained parsing and heuristic rules. If a part of fragment is well-defined, it is to be parsed accurately. If not, the part is kept as a sequence of tokens.

Tokens generated by TEBA have attributes, such as sort and correspondence. Attribute "sort" is token type defined for all tokens, which is same with token name used in lexical analyzers such as lex. Attribute "correspondence" is an identifier representing a pair of two tokens, such as left and right parentheses. The marker tokens are special one, which represents the empty token and has only attribute values.

Based on token sequences, we can build program transformation tools by using pattern matching to a token sequence. After transforming the token sequence, the modified text of the fragment is available by concatenating all tokens.

An example of token sequences:
(Please use the latest Firefox or Safari, if you can't convert. This demo uses JavaScript.)

Current and Feature Works for TEBA

TEBA has been tested by parsing open-sources, such as GNU coreutils, GNU Emacs, FreeBSD. The current works in progress are:

Feature works are:

Program Pattern Transformation, preserving program styles and comments.

Program Pattern Transformation

In programming, we often repeat same edit operations to change specification of functions, variables, and so on. For example, when we remove an argument of a function definition, we must search all function calls and edit them. If we have sed or AWK specialized for program structure, our works would become simple. Some tools and environments, such as Eclipse and it's tools, can support automatic modification of program fragments based on syntax tree. However, it is difficult to build small convenient tools in those environments, because we need to translate all edit works on program text into operations on syntax trees in our head, and then write down the operations in a program. We need program transformation tools which we can build more intuitively and quickly.

One of the simplest solution is to support program pattern, just like the regular expression of string used in sed and AWK. Because TEBA can parse ill-formed program fragments and is extensible to support new extra tokens, it is easy to build a generation tool of the pattern of token sequences from a program fragment described as a program pattern. Fortunately, TEBA has a pattern transformation module of token sequences for implementing partial parsing.

Examples of Program Transformation

We often define a wrapper function to pull up a function call and a test code of the post conditions together. The following is an example to change malloc() calls into the wrapper function calls.

/*--- before transformed ---*/
void test(void)
{
    Obj *p;  /* Obj is an undefined type. */
    if ((p = (Obj *)malloc(sizeof(Obj) * SIZE)) == NULL) {
        fprintf(stderr, "no memory\n");
        exit(1);
    }
    proc(p);
    free(p);
}


/*--- after transformed ---*/
void test(void)
{
    Obj *p;  /* Obj is an undefined type. */
    p =  (Obj *)my_malloc(sizeof(Obj) * SIZE);
    proc(p);
    free(p);
}

Program pattern should be almost same with program fragments, but has some abstractions to allow differences for each occurrences. TEBA adopts the following pattern description:

%before

if ((${v:ID_VAR} = ${cast:EXPR}malloc(${size:EXPR})) == NULL)
  ${stmt:STMT}$;

%after

${v} = ${cast}my_malloc(${size});

%end

Another example is to add a debug code after function calls. Because TEBA can parse program fragments including preprocessor directives, we can describe patterns with directives.

%before

${v:ID_VAR} = total(${ar:EXPR}, ${n:EXPR});

%after

${v} = total(${ar}, ${n});
#ifdef DEBUG_TOTAL
fprintf(stderr, "DEBUG: total = %d (at %d)\n", ${v}, __LINE__);
#endif

%end

We can also remove DEBUG fragments by a pattern:

%before

#ifdef DEBUG_TOTAL
${stmts:STMT*}$;
#endif

%after

%end

Token Sequence Pattern

All patterns are parsed by TEBA itself, and translated into pattern descriptions of token sequences. For example, the program pattern for malloc() shown above is translated into the following description:

{ $t01#E0001:BEGIN_STMT $t02:'if' $t03:SP $t04#B0001:'('
    $t05#B0002:'(' $v:ID_VAR $t06:SP $t07:'='
      $t08:SP $cast:EXPR $t09:'malloc' $t10#B0003:'(' $size:EXPR $t11#B0003:')'
    $t12#B0002:')' $t13:SP $t14:'==' $t15:SP $t16:'NULL'
  $t17#B0001:')' $t18:SP
  $t19#E0002:BEGIN_STMT $stmt:STMT $t20#E0002:END_STMT $t21#E0001:END_STMT
} => {
  '':BEGIN_STMT $v ' ':SPACE_B '=':OPE ' ':SPACE_B
    $cast 'my_malloc':ID_FUNC '(':PAR_L $size ')':PAR_R ';':SEMI  '':END_STMT }

An Application of Transformation: Reverse Macro Replacement Tool

The reverse macro replacement tool is an application of transformation. We often define macros to abstract same descriptions in programs. If we define a macro, the reverse macro replacement tool automatically replace same descriptions of the macro expanded text to macro calls. Descriptions may be a little different from macro definition. They may have extra parentheses and comments, and the macro definition may also have extra parentheses and a compound block with curly braces. The tool generates token patterns to accept those differences, and then replace all occurrences matching patterns. To implement this tool, we have built a parser of sub-expressions, which will be merged into TEBA itself in the near future. An example of replacement is the following.

Before applying the tool, macro PROC has just be defined:

#define PROC(foo) do { if (foo) {\
   bar(foo); \
} while(0)

void test(int x) {
    if (x) {  /* this is the target, having no do-while. */
      bar(x);
    }
}
	    

After replacement by the tool, the if-statement has become the macro call. The semicolon after "PROC(x)" is also added automatically:

#define PROC(foo) do { if (foo) {\
   bar(foo); \
} while(0)

void test(int x) {
    PROC(x);
}
	    

In our study, we found 45 macros are not used at about 300 places where macro calls should be used in GNU Emacs 23.3. The tool can find and replace these neglected macro uses.

Current and Feature Work

The current expressive power of program pattern is poor, unfortunately. Improving expressive power is in progress. One of the big challenges is how to express conditions of data flow and control flow. If we could describe flows in a pattern, TEBA becomes more tolerant of flows. The current TEBA has no dependency analysis and I will develop it in the future.

Another challenge is how to preserve style and comments of the target program. The current tool preserves all tokens where pattern transformations are not applied. But, all styles and comments where transformations are applied are to be lost. If comments is described in patterns, those pattern should be merged into program fragments after transformed. A study of preservation of style and comments is in progress.

Program development using small filtering tools of analysis and transformation

The purpose of my research is to enhance UNIX-style programming environment. UNIX is a wonderful program development environment. It's best practice is using small filtering tools connecting by pipes and shell scripts. TEBA enables us to build small tools specialized for programs. TEBA's small tools is connectable filters, whose input and output token sequences in text style. Sometimes we can connect original tools of UNIX mixing with TEBA tools, such as grep to find tokens matching a pattern. By using TEBA, we can describe scripts of tools when we need, and achieve our works of programming quickly and accurately.

Copyright(C) 2012, Yoshida Atsushi, All rights reserved.