Databases as a service: FathomDB
I don’t think Google want to use BigTable. I think Google have to use BigTable because of the absurd scale that they’re working at.
Unstructured databases (like Amazon’s SimpleDB and Google’s Data Store – built on BigTable) are great in that they are easy to scale, have an uncomplicated model of how data is stored in them and a simple approach to how that data is queried. These shortcuts and simplifications on the data storage and retrieval are key to the scalability of the databases. The databases have basically been reduced down to being huge distributed hash tables.
Unstructured databases: the downside
Unfortunately, the same simplifications that enable enormous scalability are very punitive in the restrictions they place on us as programmers. When we think about modeling our data, we think that a user HAS-MANY, a manager IS-A employee, a blog comment BELONGS-TO a blog post, and so on. Those relationships are just not representable in unstructured databases; you have to synthesise them yourself in software.
These connections between objects are not merely the result of us all having been conditioned to “think in SQL” over the last 35 years. Rather, these are the real relationships between the actual objects we’re modelling; it was SQL that was designed so that it matched reality, not the other way round.
Your average small to medium startup company does not need to store the entire internet in a database, so unstructured databases burden us with unneeded, inconvenient over-simplifications. As Einstein said:
Things should be as simple as possible, but not simpler.
Unstructured databases: the upside
And yet there is a proto-trend towards using these unstructured databases. The reason is that they are much, much more easy to offer “as a service”, and having someone else managed your database makes a lot sense to startups. Database maintenance is a huge time sink, and small companies should strive to spend as much time as possible on differentiating features and as little time as possible on mundane admin tasks and overhead.
That SimpleDB and Data Store liberates you from the need to be a DBA is a big enough draw for a lot of people to live with the downsides that the unstructured product brings. You don’t need to configure backups, implement master / slave replication, tweak performance parameters, set up sharding – the list goes on and on. The surfeit of “how do I set up replication” questions on MySQL forums are a testament to how easy that is for the inexperienced.
Surely, then, the ideal situation would be a pay-as-you-go, managed database provider with true relational capabilities? WebMynd have been lucky enough to be using just such a database over the last year: FathomDB, which launches in private beta today.
FathomDB: relational and managed
To us, FathomDB just looks like a normal MySQL database. There’s no time wasted figuring out how to convert your data model into a denormalised form, and existing databases can be easily converted to run FathomDB.
However, that normal-looking MySQL database is fully managed, so that we don’t have to worry about backups, monitoring, replication: the very same things that SimpleDB and Data Store relieve you from worrying about.
The time savings have been huge. On the backend, we’ve been able to spend our time working on new search technology without having to worry about database admin tasks at the same time. We’ve been able to focus on valuable features, relevant to our business and company, and been liberated from spending time playing the DBA.
Scaling
As we’ve grown over the last few months, FathomDB have used us a proving tool for their scalability, and are now inserting 5 million new rows every day; that’s 60 rows per second on average, although we spike up to around 100 per second. For comparison, there are 2.8 million documents, total, in the English Wikipedia: being able to handle this scale should be more than sufficient for the vast majority of startups.
FathomDB bills itself as “databases as a service”. The difference with them compared to databases like SimpleDB and Data Store is that is really is a database. All the features of RDBMSes that you know and want are available, with the added benefit of a pay-as-you-go pricing model and a fully managed service.
Filed under: technical | 11 Comments
“user HAS-MANY, a manager IS-A employee, a blog comment BELONGS-TO a blog post, and so on. Those relationships are just not representable in non-relational databases; you have to synthesise them yourself in software.”
That’s either stunningly ignorant or you have a different definition of “not representable” than I do.
@Alan: This point was not fully explored for brevity’s sake. It really boils down to the fact that databases like SimpleDB are missing two heavily used RDBMS features: the JOIN operator and foreign keys.
Even though JOIN is a fundamental relational theory operator, you can simulate a good deal its functionality by, for example, throwing several object types into one domain, then applying a bunch of UNION and INTERSECT operations. But this massive denormalisation is non-trivial to get right, involves an intermediate step of translating your schema into a denormalised form, and most importantly, you have to write, run and manage all that code yourself. It’s all just there in RDBMSes.
Foreign keys are still harder to fake out. Restrictions and cascades on write operations are not available, so you need to implement it yourself. Even harder to get right are multi-table transactions.
So, it’s not as fundamental a limitation as, for example, Turing complete vs. non-Turing complete; you can simulate the full set relational theory operators on top of a non-relational database. My point is that you have to do all that simulation yourself – relations aren’t representable in a non-relational database, it’s left to the programme using the database to do this.
@Alan – It’s probably the latter. key:value data stores like big table can implicitly represent these relationships, but the onus is on the application programmer to set them up, maintain them, enforce them, etc. I think that’s the point he’s making.
More importantly though, the piece of info missing from this article is relative pricing for the service. How much does it cost for practical use?
@Marco: absolutely, pricing is very important – that’s why I’m leaving it for FathomDB to answer that question… I’m just a happy user
Well, the data model (e.g. key-value vs relational) is orthogonal to whether it’s automagically managed. In terms of data model, I think key-value stores have at least three benefits over an RDBMS:
1. A flexible data model that can easily capture semi-structured information (increasingly relevant in a web 2.0 / UGC / decentralized-info-creation world that generates more and more such datasets).
2. A schemaless model that allows bottom-up, data-first design rather than the pre-defined and static approach of the relational database.
3. A data model that lends itself extremely well to interfacing over REST and to JSON-like serialization. See CouchDB for an excellent example of this.
Having said that, I couldn’t agree more with James and Marco in terms of the drawbacks of a key-value store: no support[1] for expressing relationships. You can certainly encode that information in the values of objects, but as Marco said then the onus is on the application layer to manage that state.
How entities are related to each other is an incredibly important part of the semantics of a domain. On a philosophical level, I guess one could even say that knowledge is all about how concepts are related to each other. I believe it’s fundamental to a data model to support that, in particular in this increasingly connected web 2.0+ day and age.
In a graph database like Neo4j (disclaimer: /me member), relationships are first-class citizens. They connect two nodes and both nodes and relationships can hold an arbitrary amount of key-value pairs. So you can look at a graph database as a key-value store, with full support for relationships.
http://neo4j.org
Neo4j has 1 (semi-structure) and 2 (schema-less-ness) from above, but not yet a standardized REST API. We’re working to figure that out, if you’re interested in such things feel free to join the discussion on the mailing list.
1] 1] As defined by Stroustrup: A data model (he said language) is said to support a feature if it provides facilities that make it convenient (reasonable easy, safe and efficient) to use the feature.
-EE
Careful. It’s confusing because you seem to be using “non-relational” and “unstructured like SimpleDB/BigTable” interchangeably. But there are plenty of (kinds of) databases which are both non-relational, and still have structure.
For example, those of us who have used object- or navigational databases are frustrated/confused that SQL forces you to do a table JOIN just to traverse between records. (I need an ORM just to be able to say “blog.author.name”?) In that, I would say relational databases (at least SQL-based ones) are far *worse* at modeling relationships than some other types of databases.
@Ken: Good point – I was initially using the name “denormalised data stores”, but went for the pithier “non-relational” label. You’re quite right though – object databases definitely don’t fall into the same category as those I’m referring to – I’m going to steal your “un-structured” suggestion instead!
The post raises some good points. Google and Amazon examples are very important because they demonstrate in order to shard, you need simplicity and denormalization *anyway*, with caching on top (which were implemented as in-memory key-value stores already), so you arrive at the KV idea quite naturally.
I do accept maintaining relationships in code can have its downsides. For example, in order to add a new relationship to an existing 1:N, you read the entire list, add one, write the entire list back. Could this get bad if N > 5000? Sure it would, but N that high up tells me the programmer should be dividing up this query.
App logic must be careful maintaining denormalized data; This should also be managable if app programmer has certain discipline in designing his/her code. I plan to set up a pub/sub style event notification system between various Dao’s in my next project utilizing a KV store, so inserts, updates trigger additional actions in other parts of the code without explicit connections.
A big upside to KV stores is this; “the representation defines how you access your data”. There are no surprises. I have to say, after years of SQL, I do like getting this kind of control back. It also allows me to be more creative in defining my data, my representation. I can have Maps of Lists pointing to other Maps with objects in them. It all gets streamed in the Value part of the KV.