How can we bring the query to the data?
Baron recently wrote about sending the query to the data, looking at distributed systems like Cassandra. I want to take a look at simpler systems like MySQL and see how we’re doing in this space.
It is obvious that getting computations as close to the data as possible is the most efficient as we will likely have less data to work with on the higher level in this case. Internally, MySQL starts add optimizations which help in this regard, such as Index Condition Pushdown allows the storage engine to do most rudimentary data filtering, which improves efficiency.
The more important case, though, is the Application-Database interaction. Modern applications often have quite complicated logic which might not map to SQL very well. The framework and the practices developers follow can only add to this problem. As a result, the Application may be issuing a lot of queries to the database doing computation inside the application and paying a lot of inefficiency and latency for transferring data back and forth. Latency is really a key here as accessing data through the network is thousands of times slower than accessing data in memory, so many simple data processing algorithms that access data row by row simply do not work.
For some tasks, simply learning SQL (including some voodoo practices with user variables, etc.) and using it correctly is good enough to do efficient computations with single round trip, for others – it might not map to SQL very well.
There is a known solution to this problem which existed in many database systems for decades – stored procedures. These would allow you to store programs inside database servers so they can work with data locally, often accessing it with much lower latency so you can implement a much broader set of algorithms. Stored procedures also offer other advantages such as giving more security and more control over what the application does to the DBA, but they have limitations too.
There are MySQL limitations – Stored Procedures restrict transparency, are hard to debug, do not perform very well and need to be implemented in a programming language from the 70s. All of this can be fixed in time. The design limitation, though, is stored procedures that do not support some of the modern development and operational practices very well.
If you look at a lot of modern applications with database backends, they have the “code,” which is something that lives in your version control system, changed quickly and such changes can be developed in production many times a day – this approach has proven to be successful for many modern web application. This is all as long as no database change is needed… because the database does not like you to be agile. Changing database schema, indexes, etc., takes significant effort and can’t be taken lightly. They also require a different process as you can’t just “deploy changed database structure” as you do with data; you have to have a migration process which can be ranging from trivial (such as adding a column) to rather complicated – such as a major database redesign. In fact, reducing pain from database maintenance by having no schema is one of the major draws for NoSQL systems, which offer a lot more flexibility.
The code also can often be deployed in a rolling fashion to make it downtime-free, then more than one version of the code is allowed to run in production for a short or long period of time. For example, when deploying the performance-optimized version of code I can deploy it only on one web server for a few days to ensure there are no surprises before full-scale deployment
The problem with Stored Procedures is that they take some middle place. They are really the code, which is part of the application, but they live in the database and update through changes to the database, which is global for all application servers. This makes it so that Developers can’t use the same editing tool to just edit code, save it, and see how it runs in the test system without doing extra work. The second problem is solved by some people by implementing some versioning in the database, so instead of CALL MAKE_PAYMENT(…) they will use CALL MAKE_PAYMENT_(…) which is however also quite the pain in the butt.
So what would be the alternative ?
If implementing something like this one, we would need to be extra careful as it would allow application developers to kill database servers either more effectively than they do now – the proper limit on resource usage (CPU, memory, etc.) would need to be enforced. We also would need to be extra careful with security, as the ability to change the “program” allows hackers a lot more ways to express their creativity than SQL injection currently does.
The downside of this approach compared to really “stored” procedures from a performance standpoint is their dynamic nature – each would need to be compiled for execution from scratch. I think this could be substantially optimized with modern technologies and it is a small price to pay for the power to avoid many round trips and get a lot more power on the data processing local to the database.
In the end I think something along those lines could be quite helpful in expending usability of relational databases for modern applications. What are your thoughts?
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)