pagetop
Javablog
by Java coders, for Java coders RSS

Persistence Options in Java, Part 2 – BerkeleyDB

November 10th, 2007 by Sam

In the previous post of this series, we looked at local persistence using Serializable and writing files to disc. A lot of issues were brought to light… in this post, we’ll look at BerkeleyDB as a very lightweight (1MB) yet high performance, embedded database library which addresses most (but not all) of the previous issues.

BerkeleyDB has been around for over 10 years and Oracle can quote Google as a casestudy. Unfortunately it uses a GPL-like licence. This means you have to open source any applications which make use of it. This is unusual in the Java world, where most licences are based on the Apache Licence and do not require you to open source your code. This alone may disqualify BerkeleyDB from your project (I know it does for us) —- but it is still worthwhile discussing its merits over a home-brew solution. Note that commercial licences are available from Oracle prices starting at $750 per processor.

There are 4 ways of using BerkeleyDB:-

  • Transactions, writing atomic database queries manually. This is the low level approach to using BerkeleyDB and not recommended.
  • Collections wrapper, which effectively makes a database look like a big HashMap.
  • Direct Persistence Layer (DPL), which uses annotations on class files to provide a JPA-like interface for type-safe persistence.
  • JMX support. JMX was added to Java 5 to make distributed web applications more modular.

In the previous post, we highlighted many of the shortcomings of a naive persistence store. It is probably best to go through each again, but from the perspective of BerkeleyDB. Once this is finished we’ll show some of the unique advantages of BerkeleyDB and then some code to get you up and running. As a lot of this is boilerplate we provide a wrapper layer which makes creating databases and adding secondary keys as simple as a single line of code each.

To clarify some terminology, instance refers to a directory location and the BerkeleyDB process which is currently accessing it, and a database is one of the many key/value mappings that exists in that instance.

Portability and Maintenance

It is entirely up to the client how they want their data to be persistently stored… unfortunately that means defining a binary format.

BerkeleyDB uses bindings to map objects into binary form. The datafiles used by BerkeleyDB Java Edition should be compatible with other BerkeleyDB’s so a well documented data structure could potentially be loaded from a C application. However, only one instance of BerkeleyDB can have access to the files at any time.

That said, BerkeleyDB can use the Java Serializable framework for saving objects to disc as a convenience. It goes about this in a very efficient way. As discussed previously, there is a lot of duplication involved in serialising many objects of the same type. BerkeleyDB uses a StoredClassCatalog in a separate database. For primitive types such as int, double and String, BerkeleyDB uses its own optimised bindings.

It is possible to do both offline (i.e. take a tarball of the directory) and online hot backups of a BerkeleyDB instance… without seriously interrupting currently running operations.

It is worth pointing out here that BerkeleyDB runs in the same address space as the containing application. That means the data sits on the same machine as the application (no client-server model) and can only be accessed by one application at a time.

Object Relations

As with a home-brew solution, there is only one way to deal with the type hierarchy… pretend it doesn’t exist. A database is defined with a set of bindings for the keys and values, so if you pass in a subclass you’ll probably lose all the subclass information.

Unfortunately shared objects that are present in more than one of your persistent objects will be duplicated the same as before. This leaves us wide-open to the N + 1 SELECTS problem if we wish to keep the objects in a separate database.

OS Dependence

BerkeleyDB puts a considerably smaller burden on the OS than any home-brew solution would. You can happily store millions of records and BerkeleyDB will only have created a handful of files. This means that the OS file handling overhead is almost completely removed and can result in ridiculous performance increases… we found a x10 increase over our naive store just on data access.

Restrictions on the keys and indices

BerkeleyDB no longer ties you to Java’s Serializable form (unless you want it!). However, this comes at a cost. Unless the class is already supported by one of the pre-bundled bindings, you must implement EntryBinding for all your objects. Alternatively you can use annotations through the DPL.

Persistent Object Scope

If you are using the low level API and using transactions, the rules are well documented… however you’ll most likely be using the Collection wrappers in which case your objects will only ever be transient or detached.

Concurrent Scalability

BerkeleyDB has absolutely first class concurrent scalability through transactions. This allows you to mark the beginning and end of a unit of work and commit it when finished. You will be notified of a failure if another thread modified data that was critical to your actions.

Version Control and Conflict Resolution

The transactional nature of BerkeleyDB allows us to employ a locking strategy. The transactional documentation has detailed notes on isolation and avoiding deadlock. If you make an incompatible change, you’ll hear about it. However, you’ll probably lose out on all the benefits of this when you use the Collections wrapper and you’ll be back to the same problem as the home-brew solution with the last thread winning in a write race.

Cascading

BerkeleyDB doesn’t improve on a home-brew solution here… you have to manually remember to persist internal objects that are stored in a separate database.

Caching

BerkeleyDB has a really efficient cache underneath which means it doesn’t need to hit the disc every time you want to read. You can also set up deferred writing, but then it’s up to you to handle the writing to disc. When I have used this in the past, guaranteed writing to disc was more important than speed ups.

Optimising Lookups

A major advantage that comes with using BerkeleyDB is the ability to set up a SecondaryDatabase. If you have a database which is a mapping from keys to values, then the secondary databases can be thought of as additional keys that index the primary keys or the values. It is therefore possible to perform reverse lookups or have multiple keys to a value.

Collection Wrapper

The BerkeleyDB authors had the foresight to create a very simple wrapper which implements the Collections API… although you lose fine grain control of the transactional database, you get to think of your database as a HashMap<Key, Value>.

There are some subtleties to this… equality of keys/values is based on equality of the binary stored data of the object, not on the equals() method. In order to remain consistent, keys and values must therefore ensure that object1.equals(object2) implies binding(object1) is lexicographically equal to binding(object2).

Using BerkeleyDB

Unfortunately there is quite a bit of boilerplate associated to creating a BerkeleyDB instance or database. I’d strongly suggest you read the getting started tutorials, but they are not exactly terse.

An instance is created in a directory by creating an Environment object from an EnvironmentConfig and a suitable directory

File directory = new File(MY_BASE_DIR);
EnvironmentConfig envConfig = new EnvironmentConfig();
// create the directory if it doesn't already exist
envConfig.setAllowCreate(true);
// transactional means thread safe! and autocommited writes
envConfig.setTransactional(true);
// use a 16MB cache
envConfig.setCacheSize(16384000);
// allow 10 second timeout for Locks
// should be tolerant of heavy load but also catch deadlocks
envConfig.setLockTimeout(10000000);
// open new environment
Environment environment = new Environment(directory, envConfig);

Now that we have an instance living in a directory, we can create or load up databases within it. Each database has a DatabaseConfig which is created like so

DatabaseConfig config = new DatabaseConfig();
config.setAllowCreate(true);
config.setTransactional(true);

If you are going to use the Serializable form of objects then you probably want to have a Java catalogue database to store the duplicated data (see above). We need to pass this to the binding responsible for persisting Serializable objects.

Database catalogueDb = environment.openDatabase(null, "java_catalogue", config);
StoredClassCatalog catalogue = new StoredClassCatalog(catalogueDb);

For now, let’s just assume that you want a database with Integers for keys and Serializables as values.

// we need to give the database a unique string name in the instance
String name = "Integer_mySerializableClass";
// then we set up the bindings
EntryBinding keyBinding = new IntegerBinding();
EntryBinding valueBinding = new SerialBinding(catalogue, MyClass.class);
// now open up the database
Database database = environment.openDatabase(null, name, config);
// then we create a StoredMap which implements the Collections API
Map bdb = new StoredMap(database, entryBinding, valueBinding, true);

So, you’re probably looking at all this and wondering if you really have to type it all the time. Luckily for you I’ve made a few wrapper classes (source included in the jar file) which makes the above a two-liner. To create a new instance simply type

// create an instance
BDB instance = BDB.getInstance(INSTANCE_DIRECTORY, 16384000);
// and add as many databases as you like. Note that they are generic
Map<Integer, MyClass> bdb1 = instance.getDatabase("bdb1", Integer.class, MyClass.class);
BDBMap<String, String> bdb2 = instance.getDatabase("bdb2", String.class, String.class);

But that’s not all it does. It’ll also close the databases cleanly on JVM shutdown. Earlier on I said that you can add secondary databases to BDB. If you’re using my wrapper then adding a secondary key is as simple as implementing a 3-method interface IBDBIndex that gets called every time a new key/value pair is added to the database. In the following, we create a reverse lookup to bdb2

IBDBIndex<String, String> reverse = new IBDBIndex<String, String>() {
    public Set<String> getIndex(String value) {
        return Collections.singleton(value);
    }
    public Class<String> getIndexClass() {
        return String.class;
    }
    public String getName() {
        return "Reverse";
    }
};
bdb2.setIndex(reverse, true);
 
// so now you can either do normal lookups
String value = bdb2.get("a key");
// or reverse lookups
String key = bdb2.get("a value");

You may not like using this interface… or you may find it to be too inefficient (as it assumes your secondary keys may reference collections of objects, not just single values). But feel free to look at the source code to see how to implement secondary databases your own way. They are very useful.

Conclusion

If you don’t mind the licence, locking yourself in to a single database vendor, don’t require a distributed database environment and are fine writing your own binary formats… then BerkeleyDB will probably solve all your persistence problems. However, there is a heavyweight alternative… Hibernate, which has a dependency trail as large as an operating system and every introduction book is thicker than the bible. However, hibernate addresses almost every problem you could ever have to do with persistence. I’ll complete off this series soon with a gentle introduction and some pointers on where to go.

UPDATE: as an (apparently unmaintained) alternative to BerkeleyDB… try JDBM. Some more listed on Guther Schadow’s website.


This entry was posted by by Sam on Saturday, November 10th, 2007 at 1:17 am, and is filed under BerkeleyDB, Collections, Database, Hibernate, Java, Persistence, SQL, Serializable. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.



5 comments on “Persistence Options in Java, Part 2 – BerkeleyDB”

I recently used Apache Cayenne in my current project. It works basically as Hibernate. It isn’t complex, I learned its secrets in just a couple of days. Recommended.

@KesheR cool, thanks… I see they are still working on their JPA layer though. I’ll check back in 6 months.

Erm!!,Java its cool.Reduct time implement code for Berkeley DB and not same C++ long time for implement.

BDBJE (which is what you linked to at the beginning of the article) is not the same as the BDB used in the Oracle/Google paper. That was the C version which is a considerably older implementation and uses a much different internal structure. That said, I have a lot of experience with BDBJE and it is very fast and reliable. It’s even faster if you roll your own serialization or use their DPL.

@AmericanJeff thanks for that… I also believe the C implementation has some form of distributed element to it. The Java “port” doesn’t have anything like that… you have to roll your own support for non-integrated databases. Interesting that you note the codebases are distinct, I have thought they were related but perhaps this is just PR from Oracle… the stability of BDB and the ubiquitous-ness of Java.

Leave a comment

Markdown is supported.

To include code snippets in your comment, use

<pre><code># lang java
... code here ...
</code></pre>

or use 4 spaces at the start of the line instead of using code and pre tags.

Comment feed: RSS