Friday, November 23, 2012

RfC: Improving Mavens Performance

I am typically working in projects that are relatively complex, like one parent projects and 20 modules, or so. To handle the complexity, I have learned to use and appreciate Maven. OTOH, after 8 years or so with Maven, I am still missing some aspects of Ant builds, in particular the speed. Maven does a good job when it comes to understand Build scripts (biggest problem of Ant), but it can be painfully slow. Why is that? I could name several reason, but the most obvious seems to be that Maven is always building the whole project, whereas Ant allows to implement logic like

   if (module.isUpToDate()) {
     // Build it
   } else {
     // Ignore it
Of course, Ant's syntax is completely different, but that's not the point, unless you are a fanatic XML hater and really believe that a Groovy or JSON syntax is faster by definition (If so, stop reading, you picked up the wrong posting!)
The absence of such an uptodate check isn't necessarily a problem. Most Maven plugins are nowadays implementing an uptodate check for themselves. OTOH, if every plugin does an uptodate check and the module is possibly made up of other modules itself, then it sums up.
Apart from that, uptodate checks can be unnecessarily slow. Suggest the following situation, which I have quite frequently:
A module contains an XML schema. JAXB is used to create Java classes from the schema If the schema is complex, then the module might easily have severeal thousand Java source files.
This means, that the Compiler plugin needs to check the timestamps of several thousand Java and .class files, before it can detect that it is uptodate. Likewise, the Jar Plugin will check the same thousands of .class files and compare it against the jar file, before building it.
That's sad, because we could have a very easy and quick uptodate check by comparing the time stamps of the XML schema, and the pom file (it does affect the build, does it) with that of the jar file. If we notice that the jar file is uptodate with regard to the other two, then we might ignore the module at all: Ignore it would mean to completely remove it from the reactor and not invoke the Compiler or Jar plugins at all. Okay, that would help, but how do we achieve that without breaking the complete logic of Maven? Well, here's my proposal:
  1. Introduce a new lifecycle phase into Maven, which comes before everything else. (Let's call it "init". In other words, a typical Maven lifecycle would be "init, validate, compile, test, package, integration-test, verify, install, deploy" (see this document, if you need to learn about these phases.
  2. Create a new project property called "uptodate" with a default value of false (upwards compatibility).
  3. Create a new Maven plugin called "maven-init-plugin" with a configuration like
       groupid: org.apache.maven.plugins
            artifactId: artifactid>="maven-init-plugin"
            configuration:
               sourceResources:
                 sourceResource:
                   directory: src/main/schema
                   includes:
                     include: **/*.xsd
                 sourceResource:
                   directory: .
                   includes:
                     include: pom.xml
               targetResources: ${project.build.directory}
                   includes:
                     include: *.jar
        (Excuse the crude syntax, I have no idea how to dixplay XML on blogspot.com!
         I hope, you do get the idea, though.)
        The plugins purpose would be to perform an uptodate check by comparing source-
        and target resources and set th "uptodate" flag accordingly.
      


  • Modify the Maven core as follows: After the "init" phase, search for modules with isUptodate() == true and remove those modules from the reactor. Then run the other lifecycle phases.
  • That's it. Perfectly upwards compatible. Moderate changes. Much faster builds. How about that?

    Friday, November 16, 2012

    DB2 Weirdness

    In the year 2012, what serious database might require code like this:
    private ResultSet getColumns(DatabaseMetaData pMetaData,
                                 String pCat,
                                 String pSchema,
                                 String pTableName)
        throws SQLException {
     if (pMetaData.getDatabaseProductName().startsWith("DB2")) {
       final String q = "SELECT null, TABSCHEMA, TABNAME, COLNAME," 
      + " CASE TYPENAME"
      + " WHEN 'BIGINT' THEN -5"
      + " WHEN 'BLOB' THEN 2004"
      + " WHEN 'CHARACTER' THEN 1"
      + " WHEN 'DATE' THEN 91"
      + " WHEN 'INTEGER' THEN 5"
      + " WHEN 'SMALLINT' THEN 4"
      + " WHEN 'TIMESTAMP' THEN 93"
      + " WHEN 'VARCHAR' THEN 12"
      + " WHEN 'XML' THEN -1"
      + " ELSE NULL"
      + " END, TYPENAME, LENGTH FROM SYSCAT.COLUMNS"
      + " WHERE TABSCHEMA=? AND TABNAME=?";
       final PreparedStatement stmt =
         pMetaData.getConnection().prepareStatement(q);
       stmt.setString(1, pSchema);
       stmt.setString(2, pTableName);
       return stmt.executeQuery();
     } else {
       return pMetaData.getColumns(pCat, pSchema, pTableName, null);
     }
    }
    
    or this:
      private ResultSet getExportedKeys(DatabaseMetaData pMetaData)
         throws SQLException {
        if (pMetaData.getDatabaseProductName().startsWith("DB2")) {
          final String q = "SELECT null, TABSCHEMA, TABNAME,"
          +  " PK_COLNAMES, null, REFTABSCHEMA, REFTABNAME,"
          +  " FK_COLNAMES, COLCOUNT FROM SYSCAT.REFERENCES"
          +  " WHERE TABSCHEMA=? OR REFTABSCHEMA=?";
          final PreparedStatement stmt =
            pMetaData.getConnection().prepareStatement(q);
          stmt.setString(1, "EKFADM");
          stmt.setString(2, "EKFADM");
          return stmt.executeQuery();   
        } else {
          return pMetaData.getExportedKeys(null, "EKFADM", null);
        }
    }
    
    

    Thursday, October 18, 2012

    BPM Process Migration

    BPM Process Migration

    Having worked in several BPM projects for quite some time, I usually enjoy the help of the BPM server. In particular, BPMN etc. are excellent for conversations with the customers. Of course, you still need to translate the customers desires into your own technical picture (which might differ considerably), but in the end you'll likely to get something that gives the customer a "I know this" feeling, which is worth a lot. Of course, there are still gaps, problems, and all that stuff. However, what really sucks, are upgrades of the project version.

    Disclaimer: I am no BPM expert, much less skilled in the theory, just an experienced user. This is just the result of my thinkings. In particular, don't mismatch this post with a statement of my employer, Software AG, or Fujitsu. It reflects my impression of how to work with the webMethods BPM Server, or the Fujitsu Interstage Server. I have no idea how these ideas can be transferred to other BPM tools like, for example, Apache ServiceMix, or whatever.

    Terminology


    A BPM Process Model in the sense of this posting is a set of Process Nodes and a set of transitions between these nodes. In what follows, let PM be a process model, PN be the set of PM's nodes and TPN the set of transitions. PN consists two special subsets, the start nodes (SPN), and the end nodes (EPN). A process model typically reflects some kind of workflow can be graphically visualized (see, for example, this picture:





    The possibility of graphical visualization is what's so attractive about BPM for non-technical folk.)
    A BPM Process State is an element from a universe U, typically an unstructured set of named objects. In the case of Interstage BPM, these named objects are strings, in the case of webMethods BPM these objects can be complex (maps, arrays, etc.: the webMethods Pipeline):

    A BPM Process Instance is an element of the set PN x U: A combination of a process node and a process state. This definition is too geeneral, of course. For example, the node PN must be reachable from a start node via a series of transitions out of TPN. However, for now we can ignore this.
    A BPM Process Model can have multiple versions. These versions are usually related, for example, the sets of process nodes and transitions are frequently subsets. In general, however, they can be completely unrelated.
    A BPM Process Migration involves
    1. the creation of one or more new process models, or model versions.
    2. Possibly the removal of existing process models and process instances.
    3. Possibly a migration of process instances from one, or more process models to a new process model, typically a new version of their current model.
    This last part is the one that sucks, because it is completely unsupported. The developers are completely left alone. (All you can do is to ensure some kind of compatibilty, which usually implies leaving old software versions, or at least parts thereof, in place and hoping, that old and new versions are working fine together.)

    But, how could such tools look like? This is what my post is about:
    • It should be possible to replace process models with new versions by migrating the process instances.
    • This means that a developer ought  to be able to specify a mapping from the set PN1 x U to PN2 x U. (The mapping would usually be a Java class implementing a special interface.)
    Example: A process state usually contains entries like these:
    ID (Database or any other internal ID, for example an incoming order; the process specific details are stored elsewhere and not as part of the pipeline, which would otherwise grow too big. However, the details are easily accessible.)
    State (A human readably process state, like "unconfirmed", "available", or "acknowledge".)
    Names (for example, name of the orderer, etc. These are frequently not really required, but redundant and just copied from the details for the sake of convenience.)

    A new process version might introduce a new ID (for example, from another external system, which is now connected to the processs), a new state, or something like that. In order to get the existing process instances working with the new model, we can either
    - modify the process so that it does support null values, even if they are mandatory from a business perspective
    - enhance the process state by adding these new values as part of the migration.
    Guess, which I'd prefer? And, guess which we are left with now?


    Thursday, August 9, 2012

    Maven and property files

    After so many years (since 2004, indeed when the first version of Maven 2 was still in development), I am still learning new stuff every day. For example, so far I was always specifying properties in my POM file. But you can use external property files! There is a Maven Properties Plugin over at Mojo with a goal "properties:read-project-properties".

    Wednesday, August 8, 2012

    Maven is groovy!

    Recently, I had another one of those cases where Maven almost does the right thing, but not quite. Let me explain the use case:
    I've got a software component that can initialize the database from an SQL script. Such an SQL script (in what follows: The DDL, or data definition language script) is ideally generated by the Hibernate Schema Exporter, aka "hbm2ddl", which in turn is available in Maven by running the Hibernate3 Maven Plugin. But, if just creating the database is not sufficient and you need to run a second SQL script (in what follows: The data script) to populate the DB with some initial entries? Well, I came up with the following solution:
    1. At build time, have Maven create the DDL script (below target/classes, so that it is available at run time)
    2. At development time, manually create the data script (in src/main/db)
    3. At build time, have Maven concatenate these scripts into a third SQL script (in what follows: The concatenated script, also below target/classes, as it must also be available at runtime)
    Question: How do we do that last step? The most obvious solution was the Maven Antrun Plugin, Ant even's got a "concat" task, which should do exactly what I want (Including uptodate checks). However, I wasn't really happy with that solution, because Ant, or the "concat" task behaved too unpredictable (For example, no error was produced, if either of the source files didn't exist. An, error checking is, where Ant scripts become really nasty.) In the end, I had to admit: It didn't work.
    So I came up with another idea: Why not have a small Groovy Script in the Maven POM. And, as is usually the case, someone else already had that idea and there is a Maven Plugin, which already provides just that:
    I can embed a Groovy snippet into my Maven POM and have it executed at a suitable point of my build script. Here's the snippet I came up with:
    <plugin>
    <groupId>org.codehaus.gmaven</groupId>
    <artifactId>gmaven-plugin</artifactId>
    <version>1.4</version>
    <executions>
    <execution>
    <phase>prepare-package</phase>
    <goals>
    <goal>execute</goal>
    </goals>
    <configuration>
    <source><![CDATA[
    def concat(s1, s2, t) {
    def java.io.File f1 = new java.io.File(s1)
    def java.io.File f2 = new java.io.File(s2)
    def java.io.File ft = new java.io.File(t)
    def long l1 = f1.lastModified()
    def long l2 = f2.lastModified()
    def long lt = ft.lastModified()
    if (l1 == 0) {
    throw new IllegalStateException("Source file must exist:" + f1);
    } else if (l2 == 0) {
    throw new IllegalStateException("Source file must exist:" + f2);
    } else if (lt == 0 || l1 > lt || l2 > lt) {
    java.io.File pd = ft.getParentFile()
    if (pd != null && !pd.isDirectory() && !pd.mkdirs()) {
    throw new IOException("Unable to create parent directory: " + pd)
    }
    println("Creating target file: " + ft)
    println("Source1 = " + f1)
    println("Source2 = " + f2)
    java.io.FileInputStream fi1 = new java.io.FileInputStream(f1)
    java.io.FileInputStream fi2 = new java.io.FileInputStream(f2)
    ft.append(fi1)
    ft.append(fi2)
    fi1.close()
    fi2.close()
    } else {
    println("Target file is uptodate: " + ft)
    println("Source1 = " + f1)
    println("Source2 = " + f2)
    }
    }
    concat("target/classes/com/softwareag/de/s/framework/demo/db/derby/initZero.sql",
    "src/main/db/init0.sql",
    "target/classes/com/softwareag/de/s/framework/demo/db/hsqldb/init0.sql")
    concat("target/classes/com/softwareag/de/s/framework/demo/db/derby/initZero.sql",
    "src/main/db/init0.sql",
    "target/classes/com/softwareag/de/s/framework/demo/db/hsqldb/init0.sql")
    ]]></source>
    </configuration>
    </execution>
    </executions>
    </plugin>

    perhaps in combination with a byte array, for performance reasons, but in Groovy a file has got a method append(InputStream), which does exactly that. And, although I am declaring the variable ft above as an instance of java.io.File, it is nevertheless a Groovy file, with all the added sugar of Groovy! Which is, why embedding Groovy into the POM is much nicer than embedding Java!

    In the future. I will most likely never ever write Maven plugins and use Groovy scripts instead.

     Second: We are inside a Maven POM, or, to put it different: Inside an XML file. As a consequence, I've got to be careful with characters like '&', or '!'. Which is why I am using the strings ">" and "&" instead. I might as well use a CDATA section, or, even better: An external script (in src/main/groovy) However, I believed to make this postings point better with an internal (albeit somewhat lengthy) snippet. Hope, you agree, so let's be groovy!