Friday, 24 August 2007

Mini Axon for Java

A few months ago I was at LugRadio Live where I attended a talk by Michael Sparks from the Kamaelia project. Kamaelia is a python-based component framework for building concurrent applications. The core of the Kamaelia system is called Axon. The main concept of Axon is the Microprocess. A Microprocess is simply a class that implements a generator method. For those unfamiliar with Python, generators are Coroutines, i.e. a method where instead of executing and returning a value when called, it returns an object that can generate values. For example the following code will print 3 integers 1, 2 then 3.

def generate_ints(N):
    for i in range(N):
        yield i

gen = generate_ints(3)
print(gen.next())
print(gen.next())
print(gen.next())

The interesting part of this is that each time the next method is called control is returned back to just after where the last yield was called, restoring the state of all of the local variables. It is this feature that Axon exploits to implement its concurrent behaviour. In some ways its reminiscent of Java's green threads or SEDA

One thing that really stood out about Axon, is the getting started tutorial doesn't go through contrived use cases for the framework. Instead it starts with a tutorial that walks you through building cut down implementation of the framework. By implementing the system from the ground up (albeit a very cut down one), you get a much better understanding of the system when it comes to using it. It also demonstrates that the system is built on simple clear concepts.

One of the goals for Kamaelia was to support scalable network servers (although it is much more broadly applicable), which is interesting for implementing a mail server. I decided to run through the Mini-Axon tutorial but implementing it in Java rather than Python. The first issue is that Java does not support Coroutines. It is possible to simulate the behaviour by using an inner class that stores all state as fields rather than local variables. This was the basis for my initial implementation, but the implementation was clunky and intelligent. I then stumbled upon a library called Yielder. Yielder provides Coroutines through byte code manipulation of class files. To implement the above Python example in Java:

Iterable<Integer> gen = new Yielder<Integer>() {
    public void yieldNextCore() {
        for (int i = 1; i <= 3; i++) {
            yieldReturn(i);
        }
    }
}
Iterator<Integer> i = gen.iterator();
i.next();
i.next();
i.next();

To allow this behaviour the '-javaagent=lib/yielder.jar' is required so that subclasses of the Yielder class can be identified and modified. To implement Mini-Axon using Yielder a Microprocess is a direct subclass of the Yielder. The next most important class in the Axon framework is the Component. Functional areas of an application extend the Component class. In order to support safe concurrent applications Components behave according to a specific pattern, where each Component will have only a single reader or writer at one time. Each Component is initialised with a number of boxes (or queues, or buffers, or whichever term you prefer to use). In a typical situation, within the main method (yieldNextCore), the Component loops reading requests from its Inbox, processing them and writing responses to its Outbox. The Component should yield processing at an appropriate point during its execution. Components are not restricted to single inbox/outbox combinations. It is possible to implement aggregation or multicast by reading or writing from multiple boxes. Communication between Components is handled by wiring their boxes together. This can be done in a number of ways, a simple example is the a Postman which copies items from the outbox of one component into the inbox of another. It is also necessary to have a mechanism to actually run the components. This is handled by yet another Microprocess which can schedule the running of other Microprocesses.

Because everything in Axon is a Microprocess including scheduling components, there are all sorts of ways that Components can be wired together. It shows that the model that Axon uses is very flexible. I will probably do some experimentation with using Axon in Meldware. I am currently envisaging a solution using Apache MINA or Grizzly to generate events that would drive an Axon scheduler. The incoming data would be partitioned, either into commands (single text lines) or parts of a larger message request (e.g. the message part of an SMTP DATA command) and passed as messages between Components. Components would handle activities such as command parsing, retrieving mail data from the folders and the streaming the bodies of messages from the store. One of the trickier aspects of moving to an event based approach to building a system like Meldware is how transactions should be handled. Declarative transactions (such as those using in JBoss AOP) generally rely on a single thread processing the entire transaction. However if we want to move away from blocking I/O and use an event based approach, it will be necessary to have transactions that span multiple threads. To handle this case it looks like I will need to suspend and resume transactions manually via the JTA API. It should be possible to implement some sort of transactional Component that handles the suspension and resumption of transactions around the yield call. This class could be extended by Components that need transaction behaviour. How this would work with multiple components sharing the same transaction is something I still need to work through.

This is all still speculative. I am yet to implement any useful behaviour with my implementation of Axon, so I will see how it goes. There are still some bugs in the Yielder library (doesn't work with ecj and requires debug to be enabled), which I am working with the author of Yielder to resolve. If you are interested in the Java implementation of Axon it is available here.

Posted by mbarker at 4:42 PM in Meldware

Friday, 13 July 2007

The elephant rocks!

Some of the guys at SUN have been working on performance testing with Glassfish and PostgreSQL. Some of the results are pushing towards those achieved with an Oracle database. Postgres is currently the recommended database for use with Meldware, mainly due to its excellent BLOB support. BLOB support is something that very few database do well. Oracle is one of the few that implement BLOB locators properly (Postgres uses a paging mechanism). Unfortunately this is an area where MySQL does very badly. In MySQL the driver will either load all of the data into memory and do a single DB write (not particular useful for a system that needs to be scalable w.r.t memory) or it provides "emulated blobs" which use a series or prepared statements that use concat to add the data into the blob. Unfortunately this shifts the memory problem to the database and runs into some O(n^2) performance issues. In fact to support MySQL in Meldware we had to implement our own paging algorithm. Its great to see Postgres general DB performance look so good. Lets hope it continues to improve.
Posted by mbarker at 6:35 AM in Meldware

Thursday, 5 July 2007

LugRadio Live 2007

I going to speaking at LugRadio Live 2007. If you are in the UK head over to Wolverhampton this weekend (7th-8th of July). There are load of great speakers and its only £5. See you there!

LugRadio Live 2007
Posted by mbarker at 3:35 AM in Meldware

Tuesday, 5 June 2007

One of those a-ha moments.

The majority of my time on Meldware these days seems to be spent on our IMAP implementation. IMAP is a funny spec, every time I think I have a handle on it I discover another interesting quirk. Recently I have been trying to understand how to handle multi-accessed mailboxes. I.e. mailboxes that can be access by multiple clients at the same time. The behaviour in this situation is defined in a different rfc to the main IMAP4 spec.

The trickiest part of this is how to handle mailbox expunges. An IMAP server is supposed send updates of changes to a mailbox to any client that has that mailbox selected, except expunges. Expunges must not be sent unsolicited nor when a FETCH, STORE or SEARCH is executing. RFC 2180 specifies 4 different approaches to supporting expunge. Of the 4 approaches only 2 are worth considering.

  • Prevent EXPUNGEs when more than 1 client has a mailbox selected
  • Wait until appropriate time before showing other clients that the messages have been removed
  • Initially I intended implementing the first option. In order to do this I needed a locking mechanism that supported shared locks and upgrading of locks from shared to exclusive. Quite easy with the new Java concurrency API, well that is until we want to support clustering. Two quite obvious approaches jump out. Firstly do something in the database, which is okay until you need to deal with nodes that may die without cleanly removing their locks. We can work around this by applying timeouts to locks. Another option would be use JGroups and do implement some form of clustered locking manager. I rejected this, after trying to bend my mind round all of the possible race conditions that could occur in such an implementation.

    After considering this for some time, finally the penny dropped. What I should do is implement the second behaviour using a "version" number to determine which messages are currently visible. In the latest implementation, the Folder (our entity representing an IMAP mailbox) holds an expunge version number. We also maintain an expunge version value on the FolderEntry. A FolderEntry represents a many-many relationship between Messages and Folders. Unlike other mail servers, we don't duplicate emails when sending to multiple receivers on the same server. On initial delivery the FolderEntry's expunge version is set to Long.MAX_VALUE (2^63 - 1). When a client selects a mailbox it uses the expunge version to determine which messages are visible SELECT * FROM folderentry where expungeversion > :folder_expunge_version. When a client decides to expunge a folder it increments the Folder's expunge version and sets the expunge version of all of the deleted FolderEntries to that value. Other clients will not see the messages that have been deleted, until they decide to refresh their instance of the Folder entity.

    The only side effect of this solution is that we can't immediately delete the actual rows from the database. When a message is expunged we set an expunge date on the FolderEntry. With both IMAP and POP we set a timeout value for the connection, thereby ensuring that POP and IMAP clients are never more that about 30 minutes out of date with the current mailbox state. We will need to have some periodic mechanism for clearing down expunged messages. Probably after they have been expunged for more than 24 hours.

    Technorati Tags:

    Posted by mbarker at 4:22 PM in Meldware

    Monday, 4 June 2007

    Panto 0.4 release. Still really fast.

    I have made available a 0.4 version of Panto, Buni's fast mime parser. The jar file in only 30K in size and has no other dependencies. It makes use of the Simple Boyer-Moore string matching algorithm. Compared to the lastest version of Apache Mime4J it is 40 times faster on large messages (~2MB).

    Parsing a 2MB message 20 times:

    Apache Mime4J: 4049ms
    Buni Panto: 233ms

    Check out the performance tests

    Posted by mbarker at 9:54 AM in Meldware

    Wednesday, 16 May 2007

    LUG Presentations

    This Saturday (19th May) I am heading over to Manchester University to give a talk at ManLUG on Meldware. This will be the first of my contributions to our LUG Tour. I have also had a talk accepted for LugRadio Live 2007.

    null

    I am scheduled to speak at 3pm on Sunday afternoon. I expect I will be addressing lots of people on the tail end of some nasty hangovers.

    Technorati Tags:

    Posted by mbarker at 4:18 AM in Meldware

    Saturday, 5 May 2007

    Squeezing the SPAM Guice

    Like any half-way decent mail server, Meldware needs a SPAM filtering solution. There are a number of good Free and proprietary solutions for managing SPAM floating about. For Meldware I wanted something that was Java-based and embeddable, i.e. a library rather than some external system. That lead me to jASEN, which appeared to meet most of our requirements. After getting a few unit tests working and writing some transformation code from our email structures to those used by jASEN, I proceeded to deploy it to JBoss. That is where I ran into problems. The jASEN library has an interesting configuration mechanism. There is a central XML file that declares the configuration for system and specifies the plug-ins. Each plug-in class has its own properties file which is loaded within its init method. The property files would be loading by getting the resource's location from the classloader as a URL, and transforming that into a file name. This caused the following problems.

    • JBoss was unable to locate the file due to the URL -> file name transformation.
    • The configuration files could not be placed inside of a jar file, even though they were mostly boiler-plate.
    • It would be difficult to integrate this configuration mechanism into our administration tool

    Despite all of the this jASEN contained a collection of useful scanning algorithms. Rather than throw the baby out with the bath water it set about modifying jASEN so that its configuration could be externalised. The code appeared to be crying out for some form of dependency injection. To implement this I decided to have a whack at using the new Google Guice dependency injection tool.

    This proved to be the way to go. The majority of the configuration information was simply binding interfaces to their implementation. Guice takes the approach that most dependency injection configuration is boiler plate, therefore the majority of the work is done in Java rather than XML (or some other configuration format). Thereby adding an element of type safety to the binding. To inject a value into a class using Guice the @Inject annotation is used. Guice then uses the type of the value being injected and any annotation that is applied to that value (constructor/method parameter or field). To specify the bindings, I had to create a Module implementation. Assuming I had a class that required access to a probability calculator.

    First I have a probability calculator interface and implementation.

    public interface ProbabilityCalculator {
        double calculate(double[] probabilities, int start, int end);
    }
    
    public class CompoundCalculator implements ProbabilityCalculator {
        ...
    }
    

    And a class that uses the calculator.

    public class AnomalousCharacterScanner implements JasenPlugin {
        private final ProbabilityCalculator calculator;
    
        public AnomalousCharacterScanner(ProbabilityCalculator calculator) {
            this.calculator = calculator;
        }
    }

    Then I can write the configuration module to bind the implementation of the ProbabilityCalculator.

    public class JasenModule extends AbstractModule {
        public void configure() {
            bind(ProbabilityCalculator.class).to(CompoundCalculator.class).in(SINGLETON);
        }
    }
    Then I can create an injector which gives me an entry point into my application.
    Injector i = Guice.createInjector(new JasenModule());
    AnomalousCharacterScanner acs = i.getInstance(AnomalousCharacterScanner.class);

    I have used the SINGLETON specifier, because I know that the CompundCalculator is stateless and a single instance can happily be shared among many classes. Note that this is not a singleton in the GOF-style pattern, but a single instance within the scope of the injector that I created.

    Guice also has the concept of providers, which are like factories and can be used to create instances. I have a JasenMap object that holds a mapping of all of the spam and non-spam text tokens. This is created using a loader class, however I can isolate the loading of the map using a provider.

    First I need to write a provider for a JasenMap, which assumes that the JasenMap is just a serialized Java object

    public class InputStreamMapProvider implements Provider {
        private final JasenMap map;
        public InputStreamMapProvider(InputStream in) {
            ObjectInputStream oin = new ObjectInputStream(in);
            map = (JasenMap) oin.readObject();
        }
        
        public JasenMap get() {
            return map;
        }
    }

    I have a class that needs the JasenMap to be injected.

    public class RobinsonScanner implements JasenPlugin {
        private final JasenMap map;
        
        @Inject
        public RobinsonScanner(JasenMap map) {
            this.map = map;
        }
    }

    And I bind the provider in my JasenModule implementation

    FileInputStream f = new FileInputStream("/path/to/map/file");
    Provider p = new InputStreamMapProvider(f);
    bind(JasenMap.class).to(p);

    The type safety and simple binding strategy the Guice is useful, however some configuration information simply has to be specified at runtime. The simplest way to implement this is using the @Named annotation provided by Guice. The mechanism is quite simple. Specify the Named annotation and a string to identify the a value that needs to be externally configurable. The optional = true means that this configuration value can be excluded. I have specified a default value for the field.

    public class AnomalousCharacterScanner {
        @Inject(optional=true)
        @Named("AnomalousCharacterScanner.max")
        private float max = 0.9f;
    }

    Specify the value in a properties file.

    AnomalousCharacterScanner.max=0.75

    Bind a properties object using the Guice Names interface, from within the JasenModule.

    InputStream in = new FileInputStream("/path/to/config/file");
    Properties props = new Properties();
    p.load(f);
    Names.bindProperties(binder(), config);

    Guice will coerce the string into a class, enum or primitive value for you, but it will not coerce into any other types. I needed a way to convert some comma delimited strings into string arrays. To do this I need my own implementation of the Named annotation and a custom provider.

    private class MyNamedImpl implements Named {
    
        public Class annotationType() {
            return Named.class;
        }
        
        final String value;
    
        public MyNamedImpl(String value) {
          this.value = Objects.nonNull(value, "name");
        }
    
        public String value() {
          return this.value;
        }
    
        public int hashCode() {
          // This is specified in java.lang.Annotation.
          return 127 * "value".hashCode() ^ value.hashCode();
        }
    
        public boolean equals(Object o) {
          if (!(o instanceof Named)) {
            return false;
          }
    
          Named other = (Named) o;
          return value.equals(other.value());
        }
    }
    
    private static class ArrayProvider implements Provider {
        String value;
        
        public ArrayProvider(String value) {
            this.value = value;
        }
    
        public String[] get() {
            System.out.printf("Converting %s\n", value);
            if (value != null) {
                return value.split(",");
            } else {
                return new String[0];
            }
        }
    }
    

    I can then bind my properties.

    Properties props = new Properties();
    props.load(f);
    for (Map.Entry e : props.entrySet()) {
        String key = (String) e.getKey();
        String value = (String) e.getValue();
        bind(String[].class).annotatedWith(new MyNamedImpl(key)).toProvider(new ArrayProvider(value));
    }

    It is a little bit hacky in that it will create and bind providers for some fields that aren't actually string arrays, but this doesn't cause any problems for Guice. It will only use the provider for values that are actually are typed as string arrays. It will suck a few unnecessary CPU cycles, but it is unlikely to cause a performance hit.

    All told I am a big fan of Guice. I like its type safe binding, speed and simplicity. There are a load of other features I haven't touched on. E.g. method interception to do transactions, etc. The only feature I would like to see added at this point is a way to add custom coercions from strings into other objects, e.g. arrays, URLs, etc.

    We now have SPAM filtering built into Meldware using our own forked, Guice-ified implementation of jASEN. It integrates with Thunderbird quite nicely. It will be a feature of our M8 release

    Posted by mbarker at 8:01 AM in Meldware

    Monday, 19 March 2007

    Hilarious: SOAP v REST

    Interesting imaginary conversation about using SOAP to build applications. I think we'll stick to a REST-like interface for our webmail for the time being.

    In other news Panto 0.3 has been released. Wondering what happened to 0.2? I think its in Jamaica watching the cricket.

    Posted by mbarker at 6:19 PM in Meldware

    Monday, 26 February 2007

    More ANTLR

    A little while ago I though I ANTLR sussed. What a fool I was. After exclaiming my skills at understanding syncatic predicates, I found a bug where my definition of an atom didn't support an number of non-alphanumeric characters. Using the solution that I had implemented by hacking up the pre-existing ANTLR grammar it was basically unfixable, mainly due to the excessively ambiguous nature of the IMAP grammar. What I really needed was a context based lexer, the IMAP grammer is reasonably simple, just hightly ambigous. After considering throwing away the ANTLR solution, I bumped into this entry in the ANTLR documentation. The result was turfing all of the previously ANTLR grammar definition and replacing it with just a lexer. By using a single top level lexer token definition and by making the rest of the tokens protected I was able to get the behaviour I desired.

    Posted by mbarker at 12:40 PM in Meldware

    Tuesday, 20 February 2007

    It's Always in the Last Place You Look...

    We've managed to get Milestone 7 out the door now which is great. Andy has bestowed 'Beta' status onto the IMAP implementation, which I think acurately reflects the work that has gone into it recently. The main changes for IMAP in M7 include a new MIME parser, a new proxy pattern based API for the Mailbox, a complete reimplementation of our ANTLR parser (in fact its not a parser at all, its just a lexer) and a new Search implementation.

    Its the details of new search that I wanted to bore you with today. The previous implemenation of search was a just a bit naive, it would take in the incoming search terms, load all of the messages in the folder and scan them for any matches. Not particularly efficient, mostly because the search was implemented on the consumer side of the Mailbox API. So whilst in the midst of a 'make the DB do all of the work' rage, I added search to the Mailbox API. The caller simply creates a SearchKey and passes that to the folder proxy that it wants to search.

    MailboxService service; // Injected
    Mailbox mailbox = service.createMailboxProxy("alias");
    Folder inbox = mailbox.getFolder("INBOX");
    SearchKey recentKey = SearchKey.create(SearchKey.RECENT, true);
    SearchKey sizeKey = SearchKey.create(SearchKey.LARGER, 4096);
    SearchKey andKey = SearchKey.createAnd(recentKey, sizeKey);
    Collection<Long> uids = folder.search(andKey, true);

    The SearchKey is a flyweight that indicates key and value(s) to search for. The Mailbox takes a the keys and creates a number of SearchQuery objects depending on the type of query being issued. Where possible the implementation tries to convert the search into an EJB-QL query so that the heavy lifting for the search is handed off to the database. At the moment the implementation is fairly simple, as multiple search keys will result in multiple queries. At some point I will change the implementation so that it will merge some of the search keys into a single query. Some types of queries can't be merged (specifically text and body searches), but there is all sort of other optimisations that can be done, including reordering the search keys depending on how efficiently a given key can searched for.

    All in all the IMAP implementation is looking pretty complete. M7 still has a couple of bugs in the bodystruture implemenation (fixed in HEAD) and still requires more testing with clients other than T-Bird. Messages with attachment seem to crash Evolution and Outlook Express doesn't "see" the INBOX. Expect issues like these to be fixed soon.

    Posted by mbarker at 6:54 PM in Meldware

    Wednesday, 7 February 2007

    Panto 0.1 released

    Panto is now integrated into the Meldware code base and a separate release is available. This should now make it into our Milestone 7 release.

    Technorati Tags:

    Posted by mbarker at 10:15 AM in Meldware

    Wednesday, 17 January 2007

    Fetch Command & Evil Hacks

    I've just committed a working implementation of the FETCH command for our imap server, we were missing support for individual message body parts and doing partial fetches. I've had to redo some of our mime parser implementation in order to build our message & body parts correctly. We're currently using the Mime4J from the James mail server project. The implementation is quite solid and provides a nice SAX style model for hooking into the parser. It uses a byte by byte comparison to determine when it matches. It would be nice to improve this using something like Boyer-Moore to match the mime boundary.

    I ran into a bit of a quirk with my implementation of the code that writes Mime message out to the user. I currently walk a the message tree structure, writing each part to an OutputStream. When implementing the partial fetch I noticed that would of had to check the number of bytes written at each point. The proper fix (which I will do at some point) would be to lazily create an InputStream so that the message structure can be "read". For the time being I have implemented a quick hack.

    private static class SizeLimitedOutputStream 
        extends OutputStream {
    
        @Override
        public void write(byte[] b, int off, int len) throws IOException {
                
            if (pos >= end) {
                out.flush();
                throw LimitReachedException.instance();
            } else if (pos + len >= start) {
                int toSkip = Math.max(0, start - pos);
                int toTrunc = Math.max(0, (pos + len) - end);
                off += toSkip;
                len -= (toSkip + toTrunc);
                if (off < b.length) {
                    out.write(b, off, len);
                }
                pos += (len + toSkip);
            } else {
                pos += len;
            }
        }
    

    Basically throw the exception to signal that the size limit has been reached. Apparently the RIFE framework uses excecptions to implement continuations as I found in this entry in the Java Specialists' Newsletter which describes an approach to improve the preformance of exception creation.

    private static class LimitReachedException extends RuntimeException {
        private final static LimitReachedException e 
            = new LimitReachedException();
    
        public Throwable fillInStackTrace() {
            return null;
        }
            
        public static LimitReachedException instance() {
            return e;
        }
    }
    

    A sufficently evil hack. I will have the proper fix in before 1.0 (which will be quicker as it will handle offsets more efficiently), but this will do prior to then.

    Technorati Tags:

    Posted by mbarker at 10:52 PM in Meldware

    Thursday, 11 January 2007

    Bustin' out the ANTLR-fu

    For better or for worse we inherited an ANTLR grammar to handle the parsing of commands for our IMAP server. It has been a boon in terms of getting IMAP usable as quick as possible. Unfortunately the grammar contains a few niggles, e.g. for handling the range parameter of a fetch command, it would simply take the whole range as a single big string and parse it seperately using a combination of regular expressions and String.indexOf().

    This didn't seem right as we are using a very powerful parsing tool, we shouldn't need to do any seperate parsing. I set about fixing this, however it wasn't as simple as I initially thought. The main problem is that IMAP has a very ambigous grammar. The type of a particular token is heavily dependent upon where it occurs within the grammar. E.g. consider the imap command:

    1 FETCH 1 (FLAGS)
    It contains the token '1' as the tag and '1' as the sequence set. Within IMAP's grammer the first '1' value is considered to an atom (alpha-numeric string, plus some extra characters), in the second position it is non-zero number. This type of ambiguity causes problems for the ANTLR lexer as, without doing anything special, it does not have any notion of context, therefore will issue warning if you define atoms and non-zero number using the most obvious approach.

    ATOM : ( 'A'..'Z' | 'a'..'z' | '0'..'9' | '!' | '$' | ... )+;
    NZ_NUMBER : ( '1'..'9' ) ( '0'..'9' )*;

    Note how the NZ_NUMBER is a subset of ATOM set. My initial thought was to reduce the size of the tokens such that all tokens were single characters and handle the full tokens within the grammar, but it didn't work very well as I had to create a token for every character and implement each literal keyword as a grammar rule. Where to next? Well, it was necessary to remove the ambiguity. How? By defining tokens that fall into distinct sets. This is tricky when our tokens often will have a common prefex, e.g. 111A and 1111. ANTLR provides some options that can be used to resolve this, as it has the ability to do a lookahead in the input stream. Either explicitly, by increasing the setting for k (the value of the lexical lookahead), or by using a syntactic predicates to scan an arbitrary number of tokens ahead in the stream. My token definitions are as follows:

    protected
    DIGIT : '0'..'9';
    protected
    NUMBER : (DIGIT)+;
    protected
    NZ_NUMBER : '1'..'9' (DIGIT)*;
    protected
    ATOM_CHAR : ~( '\\' | '\"' | '\u007f'..'\u00FF' | '\u0000'..'\u001f' | ... );
    protected
    ATOM : (ATOM_CHAR)+;
    
    UBER_PREDICATE 
        : ( NZ_NUMBER DIGIT )     => NZ_NUMBER { $setType(NZ_NUMBER); }
        | ( NZ_NUMBER ATOM_CHAR ) => ATOM      { $setType(ATOM); }
        | ( NZ_NUMBER )           => NZ_NUMBER { $setType(NZ_NUMBER); }
        | ( NUMBER DIGIT )        => NUMBER    { $setType(NUMBER); }
        | ( NUMBER ATOM_CHAR )    => ATOM      { $setType(ATOM); }
        | ( NUMBER )              => NUMBER    { $setType(NUMBER); }
        | ( ATOM )                => ATOM      { $setType(ATOM); }
        ;

    The protected keyword means that the rules specified are only accessable as subrules from "non-protected" rules. I.e. there will be no ambiguity between ATOM and UBER_PREDICATE. The (ABC) => XYZ is the syntactic predicate, the ABC is the predicate that will be matched against. This can be an arbitrary length rule like NUMBER which is a list of DIGIT tokens. The XYZ it the production, I think of it as the state that the parser will be in as a result of the match against the predicate. The final bit $setType(XYZ) determines the token that will be returned to the parser. Note that although some of tokens are protected, they can be returned explicitly.

    The rules state that if a non-numeric character occurs after an numeric one consider it to be an atom rather than a number. An interesting point is that ANTLR, in the first example, will not use the order of declaration to resolve the ambiguity. When using a list of predicates as in the second example the order does become important. If order wasn't important then the rules (NUMBER DIGIT) => NUMBER and (NUMBER ATOM_CHAR) => ATOM would be considered ambiguous.

    This isn't the full story as now we have made ATOM and NZ_NUMBER distinct, therefore any numeric value input will only be interpreted as a NUMBER (or a NZ_NUMBER) and never as an ATOM. This is where the parser grammar comes in. We need to define a grammar rule for an atom that will accept both numeric and non-numeric values and as the grammar has information about the context of the input, it will not be ambigous.

    atom
        : ATOM
        | NUMBER
        | NZ_NUMBER
        ;

    There are a few more details, but this is crux of the changes to our parser for IMAP. Hopefully moving all of the parsing logic into a single place will further simplify and cleanse the IMAP implementation.

    Posted by mbarker at 1:30 PM in Meldware

    Saturday, 6 January 2007

    IMAP Performance - a Good Start

    I've spent some of the holiday period looking at IMAP performance and stability. I fleshed out the IMAP unit tests for the IMAP server, which highlighted a couple of bugs and has improved the stability. One aspect of our IMAP implementation is that it was coupled tightly to the folder implementation. The first step in improving the IMAP server was to introduce an API for our folder implemenation. After looking at a couple of options, the folder API uses an interface/proxy approach. The advantage of the proxy approach is that all sorts smart things can be done inside the proxy implemenation without much of an impact on the IMAP server.

    Once the API was in place I got a chance to make the first performance improvement. The key issue is related to the loading of messages upon each fetch request. The initial implemenation would load all of the message meta data for each fetch request then filter out the messages that are not required. The fix was to delegate this request down the database. It now generates the EJB-QL based on the IMAP sequence set passed in from the client.

    To test this I sent 500 mails to the same mailbox, then retrieved all of the messages (full message 1 by 1) via IMAP. Check out the following charts from jconsole to see the improvement.

    Original implemenation.

    Original Implemenation

    The first portion of the line is the SMTP delivery, Second portion that jags up to ~128MB is the loading of the messages via IMAP. The time spend receiving the messages via IMAP was 263 seconds.

    New Implemenation

    New Implemenation

    The time spent retrieving the messages reduced to 46 seconds and the memory used never gets above 55MB.

    Its a good start, and probably the most significant improvement to be made. There are a some more improvements to come, but unlikely to show them same sort of difference on a single user test. All of these changes are in CVS HEAD, IMAP is going to rock in our M7 release.

    Posted by mbarker at 12:09 PM in Meldware

    Monday, 11 December 2006

    Buni Begins

    So after some time deleveloping in the dark, the Buni community & the Meldware Communication Server now see the light of day. Its great to get the work out in the open. I have been busy fixing things in a variety of areas in the Mail Server IMAP, WebMail and event to some of our POP and SMTP code. The Mailbox implemenation we have now is starting to mature nicely. It works very well for POP, but is waiting on a few changes to be made to make our IMAP implemenation rock solid.

    Along with reams of Java hacking, I have found it necessary to teach myself ActionScript/ECMAScript, as our WebMail GUI is based on the Flex toolkit from Adobe. Despite some initial reluctance on my part, I have found Flex to be a pretty decent framework for rich web applications. Due to Flash using a byte code/JIT based mechanism for execution, it is much faster than JavaScript and it also seems to start and run much quicker than a Java Applet. Having spent 2.5 working on a large AJAX project my perference for a rich web application framework was heading toward GWT, but now I am not so sure. I guess the biggest challenge will be the lack of a large Flash 9 installed user base. Oh yeah, come on Adobe, lets see Flash open sourced.

    Posted by mbarker at 5:07 AM in Buni.org