pagetop
Javablog
by Java coders, for Java codersRSS

Persistence Options in Java, Part 1 —- Local Filesystem

October 25th, 2007 by Sam

There are many ways to save your data across sessions in Java ranging from saving Serializable objects to files all the way to enterprise SQL frameworks. In this series of 3 blog posts you’ll get an idea of the options, with some simple usage examples. This is part 1, where we review a simple non-scalable local persistence option and introduce the recurring issues of persistence.

Serializable and FileOutputStream

The simplest form of persistence can be achieved by simply serialising objects to their default binary Serializable output, or by using a library such as XStream to create XML output. One can then dedicate a directory (or directory structure) on the filesystem to saving an object per file. We’ll call this solution the “Serializable store”.

We have used this technique in the past for the caching of webpages (we’ve since learnt our lesson). We took an MD5SUM of a URL to get a unique(ish) String which can be used as the filename and the contents of the webpage can be saved as the contents of the file. Individual files can then be compressed using java.util.zip.

If you are intending to do this, you’ll probably want to write your own wrapper class. I’d like to dissuade you from using this option as you’ll come up against the following recurring database problems:-

  • maintainability of the objects is restricted to the application
  • Serializable headers are duplicated across each object
  • object relations are not easily represented
  • relies on OS for intense directory and file handling
  • must be a map from a safe String to Serializable
  • persistent object scope is primitive
  • no concurrency
  • no version control
  • no cascading
  • no caching

You may be wondering what a lot of that actually means. As these concepts are critical to the choice of a persistence solution, I’ll go through each of these in turn.

Portability and Maintenance of the Store

In many cases it is not enough to simply consider a persistent application portable if the binary form of the store is the same across multiple OSes. Binary compatibility does not future-proof the data. If data is stored in binary form, it is almost certainly already tied to a specific version of your application. If you expect your data to live longer than the current version of your application, then a Serializable solution is not appropriate. You get some protection against this by using an XML format, although that dramatically increases the size of the store.

Duplication of Information

You may not be aware, but the Serializable form of small objects in Java is actually very inefficient. You might expect that an Integer is 4 bytes… but if you look at the serialised form, you’ll see that it is over 100 bytes! This is because Java includes quite a bit of markup at the beginning of each stream to identify the stream with a particular object type. You can’t even get around this by overriding writeObject.

Object Relations

The duplication story gets even worse… if you have a shared object which is present in more than one of your persistent objects, it gets saved in each Serializable file dump. That means on loading from disc, it isn’t a shared object anymore… each instance is unique.

It is of course possible to write your classes so that they store keys to other persistent stores. Not only is this ugly and places a lot of the burden of persistence on the developer, but it also leaves you wide open to the N + 1 SELECTS problem, as discussed below in the Optimised Lookups section.

OS Dependence

One must bare in mind that OSes are not all born equal. OSes have caps on the number of files per folder, the number of folders in a directory and are typically less efficient at handling many files than a small selection of files.

Having a directory structure of 100 folders each with 100 folders inside, each containing 100 files is what you need to be able to support a million files representing Serializable objects. That means you need to write a hashmapping algorithm to map each key to a file on disc. This is a lot less efficient than you would imagine.

Restrictions on the keys and indices

A Serializable store means that you can only have a single key to look up the objects, and it has to be a file system safe String. It is entirely in your hands to ensure that this restraint is respected… failure could result in the store breaking out of the designated directory and potentially becoming a security risk to your home directory.

You may need to have several ways of looking up an object… by name, by tags or by date. A simple store doesn’t offer any solutions except iterating through all the objects.

Persistent Object Scope

When you have a reference to a persistent object, it can be in one of several states:-

  • transient
  • persistent
  • detached

Transient means that the object you’re working with is not currently persistent. This is the case if you have just created the object and haven’t added it to the store yet, or if you have just removed it from the store permanently.

Persistent means that the object is currently backed by the persistent store and any modifications you make can be immediately seen by other actions in the same scope. This is a complicated issue which does not appear here, but we will address in a later post.

Detached means that you have just obtained a copy of the object from the store, but it is not backed by the store and you’ll have to reattach or merge it back before your changes become persistent.

In a Serializable store, your objects are either transient or detached. And every detached object is unique… even if you call Store.get(key) twice in a row, each returned object will incur a disc hit and a new object is returned.

This topic is intimately related to caching, version control and conflict resolution.

Concurrent Scalability

Concurrency is a very difficult subject, and I highly recommend Java Concurrency in Practice for a rigourous treatment of the subject in Java. You probably need to access your store from multiple threads, and using simple synchronized is not going to scale. You may get away with using ReadWriteLock for smaller projects, but it won’t scale if writing is a regular operation. Production databases use transactions to identify an atomic operation of work which allows for highly optimised concurrency. You don’t want to start implementing that yourself.

Version Control and Conflict Resolution

Consider the situation of 2 threads (A and B) accessing a persistent object, they both read the object at the same time. Thread A makes a small edit and commits the object back to the store. Thread B takes a little longer to make its edits and then commits its changes to the store. The changes of thread B have overwritten thread A’s changes! This is probably not what you wanted.

There are many techniques for dealing with this which we will discuss later. A commonly used solution is to mark each object with a version number so that conflicting changes can be identified, resulting in thread B’s changes being rejected.

Cascading

Cascading is a term used to define what actually gets persisted when you save a persistent object. Consider an object which contains a collection of keys to other persistent objects. If you load an object like this, edit it’s collection and then add it back to the store… it will not have persisted its internals. This is probably not what you expected. The Serializable store solution does not help here… you have to manually remember to persist all the internal objects too!

Caching

Do not underestimate caching. It is very difficult to get right. You could implement your own cache by using Map<String, WeakReference<Serializable>> but then you’ll encounter more issues than you could shake a stick at including cleanup threads, capping memory usage and the lack of guarantees on object uniqueness. Basically, you can’t do caching without having side effects… if you need to cache, don’t even think about implementing it yourself. Consider the options instead.

Optimising Lookups

The infamous N + 1 SELECTS problem is very easy to fall into, and a Serializable store doesn’t offer any protection against it. Consider that you have a persistent object that contains a collection of other persistent objects and you want to get all of them and access them. You probably write something like this.

Persistent p = pStore.get(pKey);
for (String oKey : p.getOtherKeys()){
    Other o = oStore.get(oKey);
    o.doStuff();
}

but now you’ve performed N + 1 lookups in the store! Database optimisation is all about reducing the number of queries to the store and returning the minimal amount of information you need. A Serializable store doesn’t offer any way around a query hit every time you want an object. It also returns entire objects every time you get them… it’s about as inefficient as you could possibly get. The only way around N + 1 SELECTS in this kind of store is to store the Other objects inside your persistent object, not just the keys. But then you incur a much larger data transfer every time you want your objects and they are now tied to that instance. And maybe you don’t need the other objects every time.

Later on we’ll see how it is possible to optimise object loading through SQL based solutions.

Conclusion

Hopefully you are not considering writing your own store backend after reading this. In the next post of this series, I’ll review the BerkeleyDB as a single jar dependency to significantly improve the Serializable store… but it is still not appropriate for all situations. In part 3 I’ll introduce the heavyweight (18 jars and as many megabytes) SQL solution Hibernate… paying specific attention to its implementation of the Java Persistence API which is now available in J2SE.

UPDATE: JDBM is a BSD-licenced, non-SQL embeddable database engine… although it is very basic and doesn’t give a Collections wrapper.


This entry was posted by by Sam on Thursday, October 25th, 2007 at 10:34 pm, and is filed under BerkeleyDB, Concurrency, Database, Hibernate, Java, Persistence, SQL, Serializable. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.


Leave a comment

Markdown is supported.

To include code snippets in your comment, use

<pre><code># lang java
... code here ...
</code></pre>

or use 4 spaces at the start of the line instead of using code and pre tags.

Comment feed: RSS