In the previous post of this series, we looked at local persistence using Serializable and writing files to disc. A lot of issues were brought to light… in this post, we’ll look at BerkeleyDB as a very lightweight (1MB) yet high performance, embedded database library which addresses most (but not all) of the previous issues.
BerkeleyDB has been around for over 10 years and Oracle can quote Google as a casestudy. Unfortunately it uses a GPL-like licence. This means you have to open source any applications which make use of it. This is unusual in the Java world, where most licences are based on the Apache Licence and do not require you to open source your code. This alone may disqualify BerkeleyDB from your project (I know it does for us) —- but it is still worthwhile discussing its merits over a home-brew solution. Note that commercial licences are available from Oracle prices starting at $750 per processor.
There are 4 ways of using BerkeleyDB:-
- Transactions, writing atomic database queries manually. This is the low level approach to using BerkeleyDB and not recommended.
- Collections wrapper, which effectively makes a database look like a big
HashMap. - Direct Persistence Layer (DPL), which uses annotations on class files to provide a JPA-like interface for type-safe persistence.
- JMX support. JMX was added to Java 5 to make distributed web applications more modular.
In the previous post, we highlighted many of the shortcomings of a naive persistence store. It is probably best to go through each again, but from the perspective of BerkeleyDB. Once this is finished we’ll show some of the unique advantages of BerkeleyDB and then some code to get you up and running. As a lot of this is boilerplate we provide a wrapper layer which makes creating databases and adding secondary keys as simple as a single line of code each.
To clarify some terminology, instance refers to a directory location and the BerkeleyDB process which is currently accessing it, and a database is one of the many key/value mappings that exists in that instance.
Portability and Maintenance
It is entirely up to the client how they want their data to be persistently stored… unfortunately that means defining a binary format.
BerkeleyDB uses bindings to map objects into binary form. The datafiles used by BerkeleyDB Java Edition should be compatible with other BerkeleyDB’s so a well documented data structure could potentially be loaded from a C application. However, only one instance of BerkeleyDB can have access to the files at any time.
That said, BerkeleyDB can use the Java Serializable framework for saving objects to disc as a convenience. It goes about this in a very efficient way. As discussed previously, there is a lot of duplication involved in serialising many objects of the same type. BerkeleyDB uses a StoredClassCatalog in a separate database. For primitive types such as int, double and String, BerkeleyDB uses its own optimised bindings.
It is possible to do both offline (i.e. take a tarball of the directory) and online hot backups of a BerkeleyDB instance… without seriously interrupting currently running operations.
It is worth pointing out here that BerkeleyDB runs in the same address space as the containing application. That means the data sits on the same machine as the application (no client-server model) and can only be accessed by one application at a time.
Object Relations
As with a home-brew solution, there is only one way to deal with the type hierarchy… pretend it doesn’t exist. A database is defined with a set of bindings for the keys and values, so if you pass in a subclass you’ll probably lose all the subclass information.
Unfortunately shared objects that are present in more than one of your persistent objects will be duplicated the same as before. This leaves us wide-open to the N + 1 SELECTS problem if we wish to keep the objects in a separate database.
OS Dependence
BerkeleyDB puts a considerably smaller burden on the OS than any home-brew solution would. You can happily store millions of records and BerkeleyDB will only have created a handful of files. This means that the OS file handling overhead is almost completely removed and can result in ridiculous performance increases… we found a x10 increase over our naive store just on data access.
Restrictions on the keys and indices
BerkeleyDB no longer ties you to Java’s Serializable form (unless you want it!). However, this comes at a cost. Unless the class is already supported by one of the pre-bundled bindings, you must implement EntryBinding for all your objects. Alternatively you can use annotations through the DPL.
Persistent Object Scope
If you are using the low level API and using transactions, the rules are well documented… however you’ll most likely be using the Collection wrappers in which case your objects will only ever be transient or detached.
Concurrent Scalability
BerkeleyDB has absolutely first class concurrent scalability through transactions. This allows you to mark the beginning and end of a unit of work and commit it when finished. You will be notified of a failure if another thread modified data that was critical to your actions.
Version Control and Conflict Resolution
The transactional nature of BerkeleyDB allows us to employ a locking strategy. The transactional documentation has detailed notes on isolation and avoiding deadlock. If you make an incompatible change, you’ll hear about it. However, you’ll probably lose out on all the benefits of this when you use the Collections wrapper and you’ll be back to the same problem as the home-brew solution with the last thread winning in a write race.
Cascading
BerkeleyDB doesn’t improve on a home-brew solution here… you have to manually remember to persist internal objects that are stored in a separate database.
Caching
BerkeleyDB has a really efficient cache underneath which means it doesn’t need to hit the disc every time you want to read. You can also set up deferred writing, but then it’s up to you to handle the writing to disc. When I have used this in the past, guaranteed writing to disc was more important than speed ups.
Optimising Lookups
A major advantage that comes with using BerkeleyDB is the ability to set up a SecondaryDatabase. If you have a database which is a mapping from keys to values, then the secondary databases can be thought of as additional keys that index the primary keys or the values. It is therefore possible to perform reverse lookups or have multiple keys to a value.
Collection Wrapper
The BerkeleyDB authors had the foresight to create a very simple wrapper which implements the Collections API… although you lose fine grain control of the transactional database, you get to think of your database as a HashMap<Key, Value>.
There are some subtleties to this… equality of keys/values is based on equality of the binary stored data of the object, not on the equals() method. In order to remain consistent, keys and values must therefore ensure that object1.equals(object2) implies binding(object1) is lexicographically equal to binding(object2).
Using BerkeleyDB
Unfortunately there is quite a bit of boilerplate associated to creating a BerkeleyDB instance or database. I’d strongly suggest you read the getting started tutorials, but they are not exactly terse.
An instance is created in a directory by creating an Environment object from an EnvironmentConfig and a suitable directory
File directory = new File(MY_BASE_DIR);
EnvironmentConfig envConfig = new EnvironmentConfig();
// create the directory if it doesn't already exist
envConfig.setAllowCreate(true);
// transactional means thread safe! and autocommited writes
envConfig.setTransactional(true);
// use a 16MB cache
envConfig.setCacheSize(16384000);
// allow 10 second timeout for Locks
// should be tolerant of heavy load but also catch deadlocks
envConfig.setLockTimeout(10000000);
// open new environment
Environment environment = new Environment(directory, envConfig);
Now that we have an instance living in a directory, we can create or load up databases within it. Each database has a DatabaseConfig which is created like so
DatabaseConfig config = new DatabaseConfig();
config.setAllowCreate(true);
config.setTransactional(true);
If you are going to use the Serializable form of objects then you probably want to have a Java catalogue database to store the duplicated data (see above). We need to pass this to the binding responsible for persisting Serializable objects.
Database catalogueDb = environment.openDatabase(null, "java_catalogue", config);
StoredClassCatalog catalogue = new StoredClassCatalog(catalogueDb);
For now, let’s just assume that you want a database with Integers for keys and Serializables as values.
// we need to give the database a unique string name in the instance
String name = "Integer_mySerializableClass";
// then we set up the bindings
EntryBinding keyBinding = new IntegerBinding();
EntryBinding valueBinding = new SerialBinding(catalogue, MyClass.class);
// now open up the database
Database database = environment.openDatabase(null, name, config);
// then we create a StoredMap which implements the Collections API
Map bdb = new StoredMap(database, entryBinding, valueBinding, true);
So, you’re probably looking at all this and wondering if you really have to type it all the time. Luckily for you I’ve made a few wrapper classes (source included in the jar file) which makes the above a two-liner. To create a new instance simply type
// create an instance
BDB instance = BDB.getInstance(INSTANCE_DIRECTORY, 16384000);
// and add as many databases as you like. Note that they are generic
Map<Integer, MyClass> bdb1 = instance.getDatabase("bdb1", Integer.class, MyClass.class);
BDBMap<String, String> bdb2 = instance.getDatabase("bdb2", String.class, String.class);
But that’s not all it does. It’ll also close the databases cleanly on JVM shutdown. Earlier on I said that you can add secondary databases to BDB. If you’re using my wrapper then adding a secondary key is as simple as implementing a 3-method interface IBDBIndex that gets called every time a new key/value pair is added to the database. In the following, we create a reverse lookup to bdb2
IBDBIndex<String, String> reverse = new IBDBIndex<String, String>() {
public Set<String> getIndex(String value) {
return Collections.singleton(value);
}
public Class<String> getIndexClass() {
return String.class;
}
public String getName() {
return "Reverse";
}
};
bdb2.setIndex(reverse, true);
// so now you can either do normal lookups
String value = bdb2.get("a key");
// or reverse lookups
String key = bdb2.get("a value");
You may not like using this interface… or you may find it to be too inefficient (as it assumes your secondary keys may reference collections of objects, not just single values). But feel free to look at the source code to see how to implement secondary databases your own way. They are very useful.
Conclusion
If you don’t mind the licence, locking yourself in to a single database vendor, don’t require a distributed database environment and are fine writing your own binary formats… then BerkeleyDB will probably solve all your persistence problems. However, there is a heavyweight alternative… Hibernate, which has a dependency trail as large as an operating system and every introduction book is thicker than the bible. However, hibernate addresses almost every problem you could ever have to do with persistence. I’ll complete off this series soon with a gentle introduction and some pointers on where to go.
UPDATE: as an (apparently unmaintained) alternative to BerkeleyDB… try JDBM. Some more listed on Guther Schadow’s website.
KesheR wrote:
November 10th, 2007 at 2:57 pm