Saturday, 5 May 2007
Squeezing the SPAM Guice
« Hilarious: SOAP v REST | Main | LUG Presentations »Like any half-way decent mail server, Meldware needs a SPAM filtering solution. There are a number of good Free and proprietary solutions for managing SPAM floating about. For Meldware I wanted something that was Java-based and embeddable, i.e. a library rather than some external system. That lead me to jASEN, which appeared to meet most of our requirements. After getting a few unit tests working and writing some transformation code from our email structures to those used by jASEN, I proceeded to deploy it to JBoss. That is where I ran into problems. The jASEN library has an interesting configuration mechanism. There is a central XML file that declares the configuration for system and specifies the plug-ins. Each plug-in class has its own properties file which is loaded within its init method. The property files would be loading by getting the resource's location from the classloader as a URL, and transforming that into a file name. This caused the following problems.
- JBoss was unable to locate the file due to the URL -> file name transformation.
- The configuration files could not be placed inside of a jar file, even though they were mostly boiler-plate.
- It would be difficult to integrate this configuration mechanism into our administration tool
Despite all of the this jASEN contained a collection of useful scanning algorithms. Rather than throw the baby out with the bath water it set about modifying jASEN so that its configuration could be externalised. The code appeared to be crying out for some form of dependency injection. To implement this I decided to have a whack at using the new Google Guice dependency injection tool.
This proved to be the way to go. The majority of the configuration information was simply binding interfaces to their implementation. Guice takes the approach that most dependency injection configuration is boiler plate, therefore the majority of the work is done in Java rather than XML (or some other configuration format). Thereby adding an element of type safety to the binding. To inject a value into a class using Guice the @Inject annotation is used. Guice then uses the type of the value being injected and any annotation that is applied to that value (constructor/method parameter or field). To specify the bindings, I had to create a Module implementation. Assuming I had a class that required access to a probability calculator.
First I have a probability calculator interface and implementation.
public interface ProbabilityCalculator {
double calculate(double[] probabilities, int start, int end);
}
public class CompoundCalculator implements ProbabilityCalculator {
...
}
And a class that uses the calculator.
public class AnomalousCharacterScanner implements JasenPlugin {
private final ProbabilityCalculator calculator;
public AnomalousCharacterScanner(ProbabilityCalculator calculator) {
this.calculator = calculator;
}
}
Then I can write the configuration module to bind the implementation of the ProbabilityCalculator.
public class JasenModule extends AbstractModule {
public void configure() {
bind(ProbabilityCalculator.class).to(CompoundCalculator.class).in(SINGLETON);
}
}
Then I can create an injector which gives me an entry point into my application.
Injector i = Guice.createInjector(new JasenModule()); AnomalousCharacterScanner acs = i.getInstance(AnomalousCharacterScanner.class);
I have used the SINGLETON specifier, because I know that the CompundCalculator is stateless and a single instance can happily be shared among many classes. Note that this is not a singleton in the GOF-style pattern, but a single instance within the scope of the injector that I created.
Guice also has the concept of providers, which are like factories and can be used to create instances. I have a JasenMap object that holds a mapping of all of the spam and non-spam text tokens. This is created using a loader class, however I can isolate the loading of the map using a provider.
First I need to write a provider for a JasenMap, which assumes that the JasenMap is just a serialized Java object
public class InputStreamMapProvider implements Provider{ private final JasenMap map; public InputStreamMapProvider(InputStream in) { ObjectInputStream oin = new ObjectInputStream(in); map = (JasenMap) oin.readObject(); } public JasenMap get() { return map; } }
I have a class that needs the JasenMap to be injected.
public class RobinsonScanner implements JasenPlugin {
private final JasenMap map;
@Inject
public RobinsonScanner(JasenMap map) {
this.map = map;
}
}
And I bind the provider in my JasenModule implementation
FileInputStream f = new FileInputStream("/path/to/map/file");
Provider p = new InputStreamMapProvider(f);
bind(JasenMap.class).to(p);
The type safety and simple binding strategy the Guice is useful, however some configuration information simply has to be specified at runtime. The simplest way to implement this is using the @Named annotation provided by Guice. The mechanism is quite simple. Specify the Named annotation and a string to identify the a value that needs to be externally configurable. The optional = true means that this configuration value can be excluded. I have specified a default value for the field.
public class AnomalousCharacterScanner {
@Inject(optional=true)
@Named("AnomalousCharacterScanner.max")
private float max = 0.9f;
}
Specify the value in a properties file.
AnomalousCharacterScanner.max=0.75
Bind a properties object using the Guice Names interface, from within the JasenModule.
InputStream in = new FileInputStream("/path/to/config/file");
Properties props = new Properties();
p.load(f);
Names.bindProperties(binder(), config);
Guice will coerce the string into a class, enum or primitive value for you, but it will not coerce into any other types. I needed a way to convert some comma delimited strings into string arrays. To do this I need my own implementation of the Named annotation and a custom provider.
private class MyNamedImpl implements Named {
public Class extends Annotation> annotationType() {
return Named.class;
}
final String value;
public MyNamedImpl(String value) {
this.value = Objects.nonNull(value, "name");
}
public String value() {
return this.value;
}
public int hashCode() {
// This is specified in java.lang.Annotation.
return 127 * "value".hashCode() ^ value.hashCode();
}
public boolean equals(Object o) {
if (!(o instanceof Named)) {
return false;
}
Named other = (Named) o;
return value.equals(other.value());
}
}
private static class ArrayProvider implements Provider {
String value;
public ArrayProvider(String value) {
this.value = value;
}
public String[] get() {
System.out.printf("Converting %s\n", value);
if (value != null) {
return value.split(",");
} else {
return new String[0];
}
}
}
I can then bind my properties.
Properties props = new Properties(); props.load(f); for (Map.Entry
It is a little bit hacky in that it will create and bind providers for some fields that aren't actually string arrays, but this doesn't cause any problems for Guice. It will only use the provider for values that are actually are typed as string arrays. It will suck a few unnecessary CPU cycles, but it is unlikely to cause a performance hit.
All told I am a big fan of Guice. I like its type safe binding, speed and simplicity. There are a load of other features I haven't touched on. E.g. method interception to do transactions, etc. The only feature I would like to see added at this point is a way to add custom coercions from strings into other objects, e.g. arrays, URLs, etc.
We now have SPAM filtering built into Meldware using our own forked, Guice-ified implementation of jASEN. It integrates with Thunderbird quite nicely. It will be a feature of our M8 release
