src/site/xdoc/manual/bcel-api.xml - platform/external/apache-commons-bcel - Git at Google

 <?xml version="1.0"?>
 <!--
     * Licensed to the Apache Software Foundation (ASF) under one
     * or more contributor license agreements.  See the NOTICE file
     * distributed with this work for additional information
     * regarding copyright ownership.  The ASF licenses this file
     * to you under the Apache License, Version 2.0 (the
     * "License"); you may not use this file except in compliance
     * with the License.  You may obtain a copy of the License at
     *
     *   http://www.apache.org/licenses/LICENSE-2.0
     *
     * Unless required by applicable law or agreed to in writing,
     * software distributed under the License is distributed on an
     * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     * KIND, either express or implied.  See the License for the
     * specific language governing permissions and limitations
     * under the License.
 -->
 <document>
   <properties>
     <title>The BCEL API</title>
   </properties>

   <body>
     <section name="The BCEL API">
       <p>
         The <font face="helvetica,arial">BCEL</font> API abstracts from
         the concrete circumstances of the Java Virtual Machine and how to
         read and write binary Java class files. The API mainly consists
         of three parts:
       </p>

       <p>

         <ol type="1">
           <li> A package that contains classes that describe "static"
             constraints of class files, i.e., reflects the class file format and
             is not intended for byte code modifications. The classes may be
             used to read and write class files from or to a file.  This is
             useful especially for analyzing Java classes without having the
             source files at hand.  The main data structure is called
             <tt>JavaClass</tt> which contains methods, fields, etc..</li>

           <li> A package to dynamically generate or modify
             <tt>JavaClass</tt> or <tt>Method</tt> objects.  It may be used to
             insert analysis code, to strip unnecessary information from class
             files, or to implement the code generator back-end of a Java
             compiler.</li>

           <li> Various code examples and utilities like a class file viewer,
             a tool to convert class files into HTML, and a converter from
             class files to the <a
                     href="http://jasmin.sourceforge.net">Jasmin</a> assembly
             language.</li>
         </ol>
       </p>

     <subsection name="JavaClass">
       <p>
         The "static" component of the <font
               face="helvetica,arial">BCEL</font> API resides in the package
         <tt>org.apache.bcel.classfile</tt> and closely represents class
         files. All of the binary components and data structures declared
         in the <a
               href="http://docs.oracle.com/javase/specs/">JVM
         specification</a> and described in section <a
               href="#2 The Java Virtual Machine">2</a> are mapped to classes.

         <a href="#Figure 3">Figure 3</a> shows an UML diagram of the
         hierarchy of classes of the <font face="helvetica,arial">BCEL
       </font>API. <a href="#Figure 8">Figure 8</a> in the appendix also
         shows a detailed diagram of the <tt>ConstantPool</tt> components.
       </p>

       <p align="center">
         <a name="Figure 3">
           <img src="../images/javaclass.gif"/> <br/>
           Figure 3: UML diagram for the JavaClass API</a>
       </p>

       <p>
         The top-level data structure is <tt>JavaClass</tt>, which in most
         cases is created by a <tt>ClassParser</tt> object that is capable
         of parsing binary class files. A <tt>JavaClass</tt> object
         basically consists of fields, methods, symbolic references to the
         super class and to the implemented interfaces.
       </p>

       <p>
         The constant pool serves as some kind of central repository and is
         thus of outstanding importance for all components.
         <tt>ConstantPool</tt> objects contain an array of fixed size of
         <tt>Constant</tt> entries, which may be retrieved via the
         <tt>getConstant()</tt> method taking an integer index as argument.
         Indexes to the constant pool may be contained in instructions as
         well as in other components of a class file and in constant pool
         entries themselves.
       </p>

       <p>
         Methods and fields contain a signature, symbolically defining
         their types.  Access flags like <tt>public static final</tt> occur
         in several places and are encoded by an integer bit mask, e.g.,
         <tt>public static final</tt> matches to the Java expression
       </p>


       <source>int access_flags = ACC_PUBLIC | ACC_STATIC | ACC_FINAL;</source>

       <p>
         As mentioned in <a href="jvm.html#Java_class_file_format">section
         2.1</a> already, several components may contain <em>attribute</em>
         objects: classes, fields, methods, and <tt>Code</tt> objects
         (introduced in <a href="jvm.html#Method_code">section 2.3</a>).  The
         latter is an attribute itself that contains the actual byte code
         array, the maximum stack size, the number of local variables, a
         table of handled exceptions, and some optional debugging
         information coded as <tt>LineNumberTable</tt> and
         <tt>LocalVariableTable</tt> attributes. Attributes are in general
         specific to some data structure, i.e., no two components share the
         same kind of attribute, though this is not explicitly
         forbidden. In the figure the <tt>Attribute</tt> classes are stereotyped
         with the component they belong to.
       </p>

     </subsection>

     <subsection name="Class repository">
       <p>
         Using the provided <tt>Repository</tt> class, reading class files into
         a <tt>JavaClass</tt> object is quite simple:
       </p>

       <source>JavaClass clazz = Repository.lookupClass("java.lang.String");</source>

       <p>
         The repository also contains methods providing the dynamic equivalent
         of the <tt>instanceof</tt> operator, and other useful routines:
       </p>

       <source>
 if (Repository.instanceOf(clazz, super_class)) {
     ...
 }
       </source>

     </subsection>

     <h4>Accessing class file data</h4>

       <p>
         Information within the class file components may be accessed like
         Java Beans via intuitive set/get methods. All of them also define
         a <tt>toString()</tt> method so that implementing a simple class
         viewer is very easy. In fact all of the examples used here have
         been produced this way:
       </p>

       <source>
 System.out.println(clazz);
 printCode(clazz.getMethods());
 ...
 public static void printCode(Method[] methods) {
     for (int i = 0; i &lt; methods.length; i++) {
         System.out.println(methods[i]);

         Code code = methods[i].getCode();
         if (code != null) // Non-abstract method
         System.out.println(code);
     }
 }
       </source>

     <h4>Analyzing class data</h4>
       <p>
         Last but not least, <font face="helvetica,arial">BCEL</font>
         supports the <em>Visitor</em> design pattern, so one can write
         visitor objects to traverse and analyze the contents of a class
         file. Included in the distribution is a class
         <tt>JasminVisitor</tt> that converts class files into the <a
               href="http://jasmin.sourceforge.net">Jasmin</a>
         assembler language.
       </p>

     <subsection name="ClassGen">
       <p>
         This part of the API (package <tt>org.apache.bcel.generic</tt>)
         supplies an abstraction level for creating or transforming class
         files dynamically. It makes the static constraints of Java class
         files like the hard-coded byte code addresses "generic". The
         generic constant pool, for example, is implemented by the class
         <tt>ConstantPoolGen</tt> which offers methods for adding different
         types of constants. Accordingly, <tt>ClassGen</tt> offers an
         interface to add methods, fields, and attributes.
         <a href="#Figure 4">Figure 4</a> gives an overview of this part of the API.
       </p>

       <p align="center">
         <a name="Figure 4">
           <img src="../images/classgen.gif"/>
           <br/>
           Figure 4: UML diagram of the ClassGen API</a>
       </p>

     <h4>Types</h4>
       <p>
         We abstract from the concrete details of the type signature syntax
         (see <a href="jvm.html#Type_information">2.5</a>) by introducing the
         <tt>Type</tt> class, which is used, for example, by methods to
         define their return and argument types. Concrete sub-classes are
         <tt>BasicType</tt>, <tt>ObjectType</tt>, and <tt>ArrayType</tt>
         which consists of the element type and the number of
         dimensions. For commonly used types the class offers some
         predefined constants. For example, the method signature of the
         <tt>main</tt> method as shown in
         <a href="jvm.html#Type_information">section 2.5</a> is represented by:
       </p>

       <source>
 Type return_type = Type.VOID;
 Type[] arg_types = new Type[] { new ArrayType(Type.STRING, 1) };
       </source>

       <p>
         <tt>Type</tt> also contains methods to convert types into textual
         signatures and vice versa. The sub-classes contain implementations
         of the routines and constraints specified by the Java Language
         Specification.
       </p>

     <h4>Generic fields and methods</h4>
       <p>
         Fields are represented by <tt>FieldGen</tt> objects, which may be
         freely modified by the user. If they have the access rights
         <tt>static final</tt>, i.e., are constants and of basic type, they
         may optionally have an initializing value.
       </p>

       <p>
         Generic methods contain methods to add exceptions the method may
         throw, local variables, and exception handlers. The latter two are
         represented by user-configurable objects as well. Because
         exception handlers and local variables contain references to byte
         code addresses, they also take the role of an <em>instruction
         targeter</em> in our terminology. Instruction targeters contain a
         method <tt>updateTarget()</tt> to redirect a reference. This is
         somewhat related to the Observer design pattern. Generic
         (non-abstract) methods refer to <em>instruction lists</em> that
         consist of instruction objects. References to byte code addresses
         are implemented by handles to instruction objects. If the list is
         updated the instruction targeters will be informed about it. This
         is explained in more detail in the following sections.
       </p>

       <p>
         The maximum stack size needed by the method and the maximum number
         of local variables used may be set manually or computed via the
         <tt>setMaxStack()</tt> and <tt>setMaxLocals()</tt> methods
         automatically.
       </p>

     <h4>Instructions</h4>
       <p>
         Modeling instructions as objects may look somewhat odd at first
         sight, but in fact enables programmers to obtain a high-level view
         upon control flow without handling details like concrete byte code
         offsets.  Instructions consist of an opcode (sometimes called
         tag), their length in bytes and an offset (or index) within the
         byte code. Since many instructions are immutable (stack operators,
         e.g.), the <tt>InstructionConstants</tt> interface offers
         shareable predefined "fly-weight" constants to use.
       </p>

       <p>
         Instructions are grouped via sub-classing, the type hierarchy of
         instruction classes is illustrated by (incomplete) figure in the
         appendix. The most important family of instructions are the
         <em>branch instructions</em>, e.g., <tt>goto</tt>, that branch to
         targets somewhere within the byte code. Obviously, this makes them
         candidates for playing an <tt>InstructionTargeter</tt> role,
         too. Instructions are further grouped by the interfaces they
         implement, there are, e.g., <tt>TypedInstruction</tt>s that are
         associated with a specific type like <tt>ldc</tt>, or
         <tt>ExceptionThrower</tt> instructions that may raise exceptions
         when executed.
       </p>

       <p>
         All instructions can be traversed via <tt>accept(Visitor v)</tt>
         methods, i.e., the Visitor design pattern. There is however some
         special trick in these methods that allows to merge the handling
         of certain instruction groups. The <tt>accept()</tt> do not only
         call the corresponding <tt>visit()</tt> method, but call
         <tt>visit()</tt> methods of their respective super classes and
         implemented interfaces first, i.e., the most specific
         <tt>visit()</tt> call is last. Thus one can group the handling of,
         say, all <tt>BranchInstruction</tt>s into one single method.
       </p>

       <p>
         For debugging purposes it may even make sense to "invent" your own
         instructions. In a sophisticated code generator like the one used
         as a backend of the <a href="http://barat.sourceforge.net">Barat
         framework</a> for static analysis one often has to insert
         temporary <tt>nop</tt> (No operation) instructions. When examining
         the produced code it may be very difficult to track back where the
         <tt>nop</tt> was actually inserted. One could think of a derived
         <tt>nop2</tt> instruction that contains additional debugging
         information. When the instruction list is dumped to byte code, the
         extra data is simply dropped.
       </p>

       <p>
         One could also think of new byte code instructions operating on
         complex numbers that are replaced by normal byte code upon
         load-time or are recognized by a new JVM.
       </p>

     <h4>Instruction lists</h4>
       <p>
         An <em>instruction list</em> is implemented by a list of
         <em>instruction handles</em> encapsulating instruction objects.
         References to instructions in the list are thus not implemented by
         direct pointers to instructions but by pointers to instruction
         <em>handles</em>. This makes appending, inserting and deleting
         areas of code very simple and also allows us to reuse immutable
         instruction objects (fly-weight objects). Since we use symbolic
         references, computation of concrete byte code offsets does not
         need to occur until finalization, i.e., until the user has
         finished the process of generating or transforming code. We will
         use the term instruction handle and instruction synonymously
         throughout the rest of the paper. Instruction handles may contain
         additional user-defined data using the <tt>addAttribute()</tt>
         method.
       </p>

       <p>
         <b>Appending:</b> One can append instructions or other instruction
         lists anywhere to an existing list. The instructions are appended
         after the given instruction handle. All append methods return a
         new instruction handle which may then be used as the target of a
         branch instruction, e.g.:
       </p>

       <source>
 InstructionList il = new InstructionList();
 ...
 GOTO g = new GOTO(null);
 il.append(g);
 ...
 // Use immutable fly-weight object
 InstructionHandle ih = il.append(InstructionConstants.ACONST_NULL);
 g.setTarget(ih);
       </source>

       <p>
         <b>Inserting:</b> Instructions may be inserted anywhere into an
         existing list. They are inserted before the given instruction
         handle. All insert methods return a new instruction handle which
         may then be used as the start address of an exception handler, for
         example.
       </p>

       <source>
 InstructionHandle start = il.insert(insertion_point, InstructionConstants.NOP);
 ...
 mg.addExceptionHandler(start, end, handler, "java.io.IOException");
       </source>

       <p>
         <b>Deleting:</b> Deletion of instructions is also very
         straightforward; all instruction handles and the contained
         instructions within a given range are removed from the instruction
         list and disposed. The <tt>delete()</tt> method may however throw
         a <tt>TargetLostException</tt> when there are instruction
         targeters still referencing one of the deleted instructions. The
         user is forced to handle such exceptions in a <tt>try-catch</tt>
         clause and redirect these references elsewhere. The <em>peep
         hole</em> optimizer described in the appendix gives a detailed
         example for this.
       </p>

       <source>
 try {
     il.delete(first, last);
 } catch (TargetLostException e) {
     for (InstructionHandle target : e.getTargets()) {
         for (InstructionTargeter targeter : target.getTargeters()) {
             targeter.updateTarget(target, new_target);
         }
     }
 }
       </source>

       <p>
         <b>Finalizing:</b> When the instruction list is ready to be dumped
         to pure byte code, all symbolic references must be mapped to real
         byte code offsets. This is done by the <tt>getByteCode()</tt>
         method which is called by default by
         <tt>MethodGen.getMethod()</tt>. Afterwards you should call
         <tt>dispose()</tt> so that the instruction handles can be reused
         internally. This helps to improve memory usage.
       </p>

       <source>
 InstructionList il = new InstructionList();

 ClassGen  cg = new ClassGen("HelloWorld", "java.lang.Object",
         "&lt;generated&#62;", ACC_PUBLIC | ACC_SUPER, null);
 MethodGen mg = new MethodGen(ACC_STATIC | ACC_PUBLIC,
         Type.VOID, new Type[] { new ArrayType(Type.STRING, 1) },
         new String[] { "argv" }, "main", "HelloWorld", il, cp);
 ...
 cg.addMethod(mg.getMethod());
 il.dispose(); // Reuse instruction handles of list
       </source>

     <h4>Code example revisited</h4>
       <p>
         Using instruction lists gives us a generic view upon the code: In
         <a href="#Figure 5">Figure 5</a> we again present the code chunk
         of the <tt>readInt()</tt> method of the factorial example in section
         <a href="jvm.html#Code_example">2.6</a>: The local variables
         <tt>n</tt> and <tt>e1</tt> both hold two references to
         instructions, defining their scope.  There are two <tt>goto</tt>s
         branching to the <tt>iload</tt> at the end of the method. One of
         the exception handlers is displayed, too: it references the start
         and the end of the <tt>try</tt> block and also the exception
         handler code.
       </p>

       <p align="center">
         <a name="Figure 5">
           <img src="../images/il.gif"/>
           <br/>
           Figure 5: Instruction list for <tt>readInt()</tt> method</a>
       </p>

     <h4>Instruction factories</h4>
       <p>
         To simplify the creation of certain instructions the user can use
         the supplied <tt>InstructionFactory</tt> class which offers a lot
         of useful methods to create instructions from
         scratch. Alternatively, he can also use <em>compound
         instructions</em>: When producing byte code, some patterns
         typically occur very frequently, for instance the compilation of
         arithmetic or comparison expressions. You certainly do not want
         to rewrite the code that translates such expressions into byte
         code in every place they may appear. In order to support this, the
         <font face="helvetica,arial">BCEL</font> API includes a <em>compound
         instruction</em> (an interface with a single
         <tt>getInstructionList()</tt> method). Instances of this class
         may be used in any place where normal instructions would occur,
         particularly in append operations.
       </p>

       <p>
         <b>Example: Pushing constants</b> Pushing constants onto the
         operand stack may be coded in different ways. As explained in <a
               href="jvm.html#Byte_code_instruction_set">section 2.2</a> there are
         some "short-cut" instructions that can be used to make the
         produced byte code more compact. The smallest instruction to push
         a single <tt>1</tt> onto the stack is <tt>iconst_1</tt>, other
         possibilities are <tt>bipush</tt> (can be used to push values
         between -128 and 127), <tt>sipush</tt> (between -32768 and 32767),
         or <tt>ldc</tt> (load constant from constant pool).
       </p>

       <p>
         Instead of repeatedly selecting the most compact instruction in,
         say, a switch, one can use the compound <tt>PUSH</tt> instruction
         whenever pushing a constant number or string. It will produce the
         appropriate byte code instruction and insert entries into to
         constant pool if necessary.
       </p>

       <source>
 InstructionFactory f  = new InstructionFactory(class_gen);
 InstructionList    il = new InstructionList();
 ...
 il.append(new PUSH(cp, "Hello, world"));
 il.append(new PUSH(cp, 4711));
 ...
 il.append(f.createPrintln("Hello World"));
 ...
 il.append(f.createReturn(type));
       </source>

     <h4>Code patterns using regular expressions</h4>
       <p>
         When transforming code, for instance during optimization or when
         inserting analysis method calls, one typically searches for
         certain patterns of code to perform the transformation at. To
         simplify handling such situations <font
               face="helvetica,arial">BCEL </font>introduces a special feature:
         One can search for given code patterns within an instruction list
         using <em>regular expressions</em>. In such expressions,
         instructions are represented by their opcode names, e.g.,
         <tt>LDC</tt>, one may also use their respective super classes, e.g.,
         "<tt>IfInstruction</tt>". Meta characters like <tt>+</tt>,
         <tt>*</tt>, and <tt>(..|..)</tt> have their usual meanings. Thus,
         the expression
       </p>

       <source>"NOP+(ILOAD|ALOAD)*"</source>

       <p>
         represents a piece of code consisting of at least one <tt>NOP</tt>
         followed by a possibly empty sequence of <tt>ILOAD</tt> and
         <tt>ALOAD</tt> instructions.
       </p>

       <p>
         The <tt>search()</tt> method of class
         <tt>org.apache.bcel.util.InstructionFinder</tt> gets a regular
         expression and a starting point as arguments and returns an
         iterator describing the area of matched instructions. Additional
         constraints to the matching area of instructions, which can not be
         implemented via regular expressions, may be expressed via <em>code
         constraint</em> objects.
       </p>

     <h4>Example: Optimizing boolean expressions</h4>
       <p>
         In Java, boolean values are mapped to 1 and to 0,
         respectively. Thus, the simplest way to evaluate boolean
         expressions is to push a 1 or a 0 onto the operand stack depending
         on the truth value of the expression. But this way, the
         subsequent combination of boolean expressions (with
         <tt>&amp;&amp;</tt>, e.g) yields long chunks of code that push
         lots of 1s and 0s onto the stack.
       </p>

       <p>
         When the code has been finalized these chunks can be optimized
         with a <em>peep hole</em> algorithm: An <tt>IfInstruction</tt>
         (e.g.  the comparison of two integers: <tt>if_icmpeq</tt>) that
         either produces a 1 or a 0 on the stack and is followed by an
         <tt>ifne</tt> instruction (branch if stack value 0) may be
         replaced by the <tt>IfInstruction</tt> with its branch target
         replaced by the target of the <tt>ifne</tt> instruction:
       </p>

       <source>
 CodeConstraint constraint = new CodeConstraint() {
     public boolean checkCode(InstructionHandle[] match) {
         IfInstruction if1 = (IfInstruction) match[0].getInstruction();
         GOTO g = (GOTO) match[2].getInstruction();
         return (if1.getTarget() == match[3]) &amp;&amp;
             (g.getTarget() == match[4]);
     }
 };

 InstructionFinder f = new InstructionFinder(il);
 String pat = "IfInstruction ICONST_0 GOTO ICONST_1 NOP(IFEQ|IFNE)";

 for (Iterator e = f.search(pat, constraint); e.hasNext(); ) {
     InstructionHandle[] match = (InstructionHandle[]) e.next();;
     ...
     match[0].setTarget(match[5].getTarget()); // Update target
     ...
     try {
         il.delete(match[1], match[5]);
     } catch (TargetLostException ex) {
         ...
     }
 }
       </source>

       <p>
         The applied code constraint object ensures that the matched code
         really corresponds to the targeted expression pattern. Subsequent
         application of this algorithm removes all unnecessary stack
         operations and branch instructions from the byte code. If any of
         the deleted instructions is still referenced by an
         <tt>InstructionTargeter</tt> object, the reference has to be
         updated in the <tt>catch</tt>-clause.
       </p>

       <p>
         <b>Example application:</b>
         The expression:
       </p>

       <source>
         if ((a == null) || (i &lt; 2))
         System.out.println("Ooops");
       </source>

       <p>
         can be mapped to both of the chunks of byte code shown in <a
               href="#Figure 6">figure 6</a>. The left column represents the
         unoptimized code while the right column displays the same code
         after the peep hole algorithm has been applied:
       </p>

       <p align="center"><a name="Figure 6">
         <table>
           <tr>
             <td valign="top"><pre>
               5:  aload_0
               6:  ifnull        #13
               9:  iconst_0
               10: goto          #14
               13: iconst_1
               14: nop
               15: ifne          #36
               18: iload_1
               19: iconst_2
               20: if_icmplt     #27
               23: iconst_0
               24: goto          #28
               27: iconst_1
               28: nop
               29: ifne          #36
               32: iconst_0
               33: goto          #37
               36: iconst_1
               37: nop
               38: ifeq          #52
               41: getstatic     System.out
               44: ldc           "Ooops"
               46: invokevirtual println
               52: return
             </pre></td>
             <td valign="top"><pre>
               10: aload_0
               11: ifnull        #19
               14: iload_1
               15: iconst_2
               16: if_icmpge     #27
               19: getstatic     System.out
               22: ldc           "Ooops"
               24: invokevirtual println
               27: return
             </pre></td>
           </tr>
         </table>
       </a>
       </p>
     </subsection>
     </section>
   </body>
 </document>