Inside the Data Industry

Vladimir Makovsky, 2015-01-16

brain Data industry is moving fast and there is fierce competition in the field. There are thriving businesses in this industry and the reason for this is the expansion of all sorts of IT devices which produce a huge amount of data. Once you own huge amount of data you want to process them for a simple reason: Data hides a lot of information such as people behaviour, market segment change, various historic milestones etc. Based on such information you can target your marketing campaign, increase your sales revenue, relocate your resources on a different task, verify some gossips and so on so forth. Who have the information have also technological advantage; they can innovate and have the power to influence others people decisions. It also improves general knowledge how the world works. The market is naturally driven by people's lust for knowledge, for power but also by desire to compete with one another.

Those who don't have information stay behind and can lose competitive advantage. No wonder there are over $2 billion investment from venture capitalists into data-processing oriented companies, currently the big market players are Cloudera ($1.2 billion), Mongo DB ($231 mil), CouchBase ($116 mil.), Vertica ($30 mil., acquired for $350 mil. by HP), ParAccel ($84.5 mil.), Domo ($248 mil.), GoodData ($101 mil.). Let's not forget the in-house billion investments in companies like SAP (SAP HANA), Oracle, Netezza, Teradata, Google (Big Query), Facebook (Presto). This is an ongoing evolution. There will be winners and losers. For technical people interested in data world, I would recommend well-maintained blog which covers a lot of current databases and also important BI vendors. It has been around for a long time and brings interesting insights.

Databases

Let's focus just on databases. During the development of our database, I realised how hard it is to orientate in the current world of databases even for software developers. The database term is used on quite the large scale of products including relational databases, graph databases, key-value(document) databases and even indexing and search engines. Non-IT person either don't know what database mean at all or have various perception mostly that a database is a list of items. I will stick with a broad and vague meaning of the database - it is a system that provides access to data.

Overall there are hundreds different databases on the market though many of them are very niche players. The current world of databases is thus also very vibrant and vivid with quite a lot of old and new rivalries (Oracle vs. Microsoft, Oracle vs SAP). New database appears almost every month. For example few months ago I spoke about our database with a guy from a small US-based company, and, surprisingly he told me their 2 developers was working on key-value database for 2 years and plan to go public soon. On the other hand, we met guys from ScalienDB in 2011 while working in my previous company. In the end they failed after trying three years, interestingly, they published white paper on the design and the reasons why they failed. Also, the analytical database Vectorwise was acquired by Actian company which acquired ParAccel (AWS Redshift is derived from ParAccel). Vectorwise is either discontinued or being merged together with ParAccel. So don't expect to have a complete list to be put anywhere.

Even though, the biggest database player is Oracle, it doesn't have such dominant position like Microsoft in PC desktop world, Google in search industry, Facebook in social networking. Many developers know about Oracle or databases like Microsoft SQL server, Postgres, Mysql. Some of them may know Teradata, DB2, MariaDB. Not many of them know about MemSQL, Vertica, VoltDB, SAP HANA. Some may have heard about Presto (Facebook), Big Query (Google), Redshift (Amazon). And these are just the most known databases. Then there are schema-less better say key-value systems which are sometimes also considered databases e.g. Riak, Cassandra, Mem-cached, MongoDB, CouchBase, Redis and also Splunk, ElasticSearch for data processing and monitoring.

Nearly 10 years ago several companies started developing analytical(see below) columnar databases. Interestingly ParAccel and Vertica started working on their database in 2005 as newly established companies. According to John Santaferraro from Actian, he wrote to Michael Stonebraker if he knew any really good database engineers to join their startup. He allegedly responded that he was also doing a startup which was Vertica. Also in 2005 SAP started their work on SAP HANA by quietly acquiring companies Callixa and P*time and putting distinguished database engineers into the same team - the Tracker team. Last but not least Google's BigQuery is based on Dremel. Dremel has been in production since 2006, so they must have started with the development even earlier. Is this a coincidence or a sort of company espionage :-)?

Separating sheep from goats

Though not very often the problems arise when your database reaches the limit and you try to look for another database. In such case, you really want to be sure that your choice of database will be the right one for at least the next few years. The reason is that your database sits at the bottom of your system. It's technically hard to replace it, takes a lot of effort and company resources which could be spent elsewhere and so the replacement is very expensive. It's usually hard to pick just the one database, however, as the database guru Michael Stonebraker says "One size doesn't fit all". It always boils down to your use cases and you must do a lot of assessments and consider a lot of unknown answers and assumptions. With all options available there is always some sort of tradeoffs.

There are a few business and technical criteria when choosing an (analytical) database:

type of deploy - in-house(on-premise) vs. cloud
scalability - cloud scalability vs. scalable on more nodes (up and out) vs. single-node databases
schema vs. schema-less data model (usually key-value)
in-memory vs. disk-based databases
transactional(OLTP) vs. analytical(OLAP)
total cost of ownership - free vs. paid (licensed vs. subscription based models)
set of functional product features

</p>

One of the major databases subdivision is whether the database is suited for OLAP or OLTP systems. As we develop analytical database we are focusing here just on OLAP-suited databases. The main difference between OLAP and OLTP system is the read/write ratio and data volume. In OLAP system, you require asking the system a lot of questions mostly on a large amount of data. You also typically load the data into the system in batches. So the read operation is used much more often than write operation in OLAP system. In OLTP it's the other way round - ie. you need to write or change tiny data very often. Obviously these areas overlap, but usually the database can excel only in one of the category. About twenty, thirty years ago database designers were persuaded, that the same database will be able both to store data quickly and be hit by queries often. However due to processor architecture ten or fifteen years ago it turned out, that it's better to specialize OLAP-suited databases and design them very differently. SAP claims that they can do both OLTP and OLAP well.

The technology matured so much over the last 10 years that you can eventually start considering Saas system for analysis or even BI providers like GoodData. Nevertheless, this is a broad topic so I am not going to discuss it here. Just let me point out using AWS you can have new running instance with Vertica, SAP HANA and AWS Redshift within 10-40 minutes. Big query is perhaps the only true multitenant database - you don't have to install it and you use the API to access it. They have nearly zero interruptions in service and can update the system whenever they need to. Although it's not the intention of other vendors, it is real technological and process workflow challenge - that's what separate boys from men :-). Google has a giant technological advantage in this area - you have to have a lot of skills, intelligence and know-hows.

In the above criteria, I haven't mentioned performance. Everybody knows that time is money, however I consider speed as a feature. It is well summarized by Larry Page

"As a product manager you should know that speed is product feature number one."

And measuring this feature is always hard. You can go through some benchmark tests which may help you in the evaluation and narrowing your choice down to a few candidates. Nevertheless, once you narrow down your search I would recommend to do the testing for all candidates in the environment which should be close to the production environment - using real data and real workload. Don't expect, don't predict. Measure and get the numbers. Learn from yesterday, live for today and hope for tomorrow. Have a nice day.

Read Other Blog

Inside the Data Industry

Databases

Separating sheep from goats

Comments Section