More memory or smaller memories?

In order to build a classification server that can handle thousands of classifiers and process huge amounts of data we were sure that we eventually would have to do some major optimizations. To avoid doing any work prematurely we waited with all optimizations that actually require design changes or invoke less code readability until we were absolutely sure where to improve.

When we ran our first test in May it was obvious what would be our first bottleneck – the memory consumption of classifiers. It was really bad, raw classifier data that expanded by a factor of about 5 into the primary memory – a tiny classifier of 1Mb would take 5Mb as soon it’s fetched into memory. It was really easy to pinpoint the memory theives.

Couple in crime – STL strings and maps

We were using STL maps to hold frequency distributions for tokens (features). All tokens were mapped to their frequency, map<string, unsigned int> accordingly. This is a very convenient and straightforward way to do it. But the memory overhead is not very attractive.

VS2005 STL string memory overhead

The actual sizes of types vary between platforms and STL implementations (these numbers are from the STL that comes with VS2005 on 32 bit Windows XP).

Each string takes at least 32 bytes
size_type _Mysize = 4 bytes (string size)
size_type _Myres = 4 bytes (reserve size)
_Elem _Buf = 16 bytes (internal buffer for strings shorter than 16 bytes)

_Elem* _Ptr = 4 bytes (pointer to strings that don’t fit in the buffer)
this* = 4 bytes (this pointer)

Best case overhead for STL strings is 16 bytes if the internal buffer is filled exactly. Worst case is for empty or strings longer than 15 bytes which gives the overhead of 32 bytes. Therefore string overhead varies from 16 to 32 bytes.

VS2005 STL map memory overhead

Each entry in a map consists of a STL pair – the key and value (first and second). A pair only has the memory overhead of the this pointer (4 bytes) (and that inherited from the types it’s composed of). However the map is a colored tree and consists of linked nodes. Each pair is stored in a node and nodes have quite heavy memory overhead:

_Genptr _Left = 4 bytes (points to the left subtree)
_Genptr _Parent = 4 bytes (pointer to parent)
_Genptr _Right = 4 bytes (points to the right subtree)
char _Color = 1 byte (the color of the node)
char _Isnil = 1 byte (true if node is head)

this* = 4 bytes (this pointer)

So there is a 18 byte overhead per node and 4 bytes per pair, which sums up to 22 bytes.

Strings in maps

Now inserting a string shorter than 16 bytes into a map<string, unsigned int> will consume 32+22+4=58 bytes. It could even be more if memory alignment kicks in for any of the allocations. In most cases this is perfectly fine and is not even worth considering optimizing. In our case it was not plausible to have a memory overhead factor of 5. Our language classifier takes about 14Mb on disk and should not take much more when loaded into memory – it blew up to about 65Mb. As it consists of 43 languages with probably around 30000 unique words per class (language) it gets really bloated.

One solution

We needed to maintain the search and insertion speed of maps (time complexity O(log n)) but get rid of the overhead. Insertions are needed when classifiers are trained.

Maintaining search speed

Since we already had limited features to the maximum length of 32 bytes we could use that information to create what we call memory lanes. A memory lane only consists of tokens of the same size followed by the frequency. In that manner we created 32 lanes, lane 1 with all tokens of size 1, lane 2 with all tokens of size 2 and so on. Tokens in memory lanes are sorted so we can use binary search.

Memory lane 1 could look like this (tokens of size 1 followed by the frequency)
a0031i0018y0003

and memory lane 3 like this
can0011far0004the0019zoo0001

By doing so we get rid of all overhead and maintaining search at O(log n).

Maintaining insertion speed (almost)

Maps allow fast insertions in O(log n) so we kept an intermediate map for each memory lane. When a classifier is trained, new tokens they go into the map and the frequency of those that already exist in the memory lane is increased. When the training session is over the intermediate maps are merged to their respective memory lane. This can be done in O(n) and is the major penalty. Note that explicit sorting is never required since maps are ordered. Another penalty occur when both the map and memory lane are filled with tokens – at this point two lookups can happen (first in the memory lane and if it doesn’t exist a search through the map is required).

This solution reduced memory consumption by a factor of 4-5 at the penalty of having to merge new training data into memory lanes every now and then. This is perfectly fine for us as training often reduce with time (training data get good enough) and classification hence increase.

A similar optimization for Java is described on the LingPipe blog.