Effective Micro-caching in Relational Database
In the previous post, we showed how fast MIA DB can be for a considerably small variety of requests under heavy load. In this case, it is because of caching. MIA DB quickly computes a cache for all the requests and serves the rest of them from the cache. It seems unfair, but it's not. The caching is integral part of MIA DB design. The reason for it is a huge improvement in the speed of requests and their throughput. At the same time, it simplifies a system design. So you basically have all of it or none of it.
Formerly, in GoodData we verified that caching can improve the speed of a heavily loaded business intelligence (BI) system. The system in GoodData was based on caches on application level. The main reason is a lack of caching capabilities integrated into existing databases. The application level caching is complicated for implementation and maintenance.
It has also its limits. The cache can only be made for a whole query. The cache identification can be based only on query definition which limits you to information within tables and columns. It affects caching efficiency because it prevents cache reusability between queries. Cache creation usually implies tables creation and it usually doesn't belong to fast operations in common databases.
In MIA DB, we took things a step further and addressed most of those issues. The caches work on a lower level than the usual query which improves the cache hit ratio. The cache identification is based on exact data content. It nullifies requirements on the application which uses the database. The cached operations are carefully chosen to maximise their reusability. We call this feature automatic micro-caching. This feature enforces a different query execution model which works really well with the typical BI system load.
Let's demonstrate automatic micro-caching in a simple example. Our friend Petr Šimeček examined speed of basic query aggregation on several different databases in one of his blog posts (in Czech). We link here an original result table:
|Redshift dw1.xlarge (50M rows)||15s|
|Redshift dw2.large (50M rows)||7s|
|Redshift dw1.xlarge (500M rows)||182s|
|Redshift dw2.large (500M rows)||53s|
VoltDB result (Test A, 0.07s) is suspicious compared to other databases. There is most probably prepared aggregation result during data load. Even if there are various HW configurations it allows us to put results into context and compare the speed at least in order of magnitude.
The dataset originates from this blog post. Source data contains 50 millions of rows with two columns, time and value. Both values are random which is not a typical situation for real data. Using random data disables compression and some advanced algorithms to be used for computation in our database. The data looks like the one in the following example.
created_on,value 2012-05-02T06:08:47Z,0.9270193106494844 2012-09-06T22:40:25Z,0.005334891844540834 2012-06-15T05:58:22Z,0.05611344985663891 2012-01-05T20:47:19Z,0.2171613881364465 2012-02-10T00:35:17Z,0.4581454689614475 2012-06-19T17:46:41Z,0.9841521594207734 2012-08-20T21:42:19Z,0.3296361684333533 2012-02-24T20:29:17Z,0.9760254160501063
The databases were tested with a simple query which computes four aggregated values (metrics) for each day like following SQL.
SELECT EXTRACT(year FROM created_on), EXTRACT(day FROM created_on), COUNT(*), MIN(value), MAX(value), AVG(value) FROM RandomData_T GROUP BY EXTRACT(year FROM created_on), EXTRACT(day FROM created_on) ORDER BY EXTRACT(year FROM created_on), EXTRACT(day FROM created_on);
First we ran the query in PostgreSQL, Vertica and MIA DB on various HW to find out vertical scalability.
|T430s 4c||AWS 8c||AWS 16c|
In the table above we can see nice vertical scaling and in this case we are faster than those databases. It is nice, but there is not much visible caching involved yet. So far, it would just be a little bit faster of a database.
But what would happen if we loaded an additional million rows? The same query takes only 38ms! It's 4% of the original time. In other databases, it usually takes an even longer time than in the first example. So now things start to become interesting. We gain almost two orders of magnitude in speed.
Another nice feature is that it works right out of the box. We don't need to give MIA DB a hint that the data is loaded incrementally. You can even do a full data load to a new table or even in a new customer project and it just works. The reason is MIA DB operates on a raw data level and automatically detects the same data and the same operation in them.
Let's explore this feature in more details. The original query computes four aggregations at once. In real BI systems there are usually many queries which come in an unpredictable order and time. For example, there are four different reports at the same dashboard which computes different metrics. In this case, we will compute four queries. The first one will compute SUM, the second one MIN and MAX and so on. Each row in the following table is separate query execution. It's run sequentially, but it can run in any particular order.
|SELECT||No-caching DB||MIA||Speed up|
As we can see, MIA DB can be a little bit slower for first query execution but it pays off once you execute next queries. First query execution comes with cache overhead. But an interesting and an unusual thing is that MIA DB don't cache whole queries but only partial operations which are needed for computation.
In case of the first query, the caches are created. They speed up every subsequent aggregation on the same table, but it doesn't just affect the same query. It speeds up all subsequent queries which use this table and use different filters. There are usually many facts in the same table in analytical projects so collective speed up grows even more.
The first query can be relatively slow. But following requests are much faster so user experience improves. Psychologically it has an even bigger impact on perceived system performance. Good user experience shapes the way a user interacts with the system. It encourages people to explore and gain knowledge. We can also eliminate the first query problem by computing commonly known reports shortly after a load of new data.
We showed the caching works also for incrementally loaded data. Let us show how it will work with the previous four requests in this scenario.
|SELECT||No-caching DB||MIA||Speed up|
The caching works well in this scenario but has an even bigger effect. The responses in this time range not only affects the immediate user experience. The caching also allows to serve many more concurrent users. There is a big probability that the user will utilise some existing cache even the one that doesn't request exactly the same report. The rest of it is computed fast enough for interactive work.
The speed up is not for free. We traded-off the memory for the gain of the speed which is a well-known technique called caching. The following table shows memory overhead of caching.
Keep in mind the test data is random which makes it hard to compress. In a real project, the situation will usually be several times better.
Let us reiterate that the caching is fully automatic. There are some options on how to speed up the system - which in a particular case can work very well. Eg. precomputing results at the time of load if you know the queries, optimizing particular queries (eg. in Postgres) or adding projections (e.g. in Vertica). But all of these techniques require manual work. Once you need to speed up hundreds of different analytic projects with thousands of unknown queries it doesn't scale well in terms of human resources because human time is costly.
We tried to demonstrate basic concepts of MIA DB on this simple example. The MIA DB has a unique design and it's behaviour is different compared to common databases. Actually urban dictionary nicely sums up what MIA DB is:
She is a crazy girl. Makes jokes most people do not understand. She is veryyyy pretty and has nice eyes. You will be stunned when you meet a Mia.