
Hi all, I'm seeking feedback for a project I'd like to start. I have a bit of experience developing large scale systems containing many microservices, databases, message queues, and caches over many VMs. Time and time again I find myself confronted with the same problems: 1. It is difficult to trace events through the system: Consider an HTTP request made by a customer to a public API. Which microservices were impacted by that request? What SQL queries were run as a result of that request? What 3rd party APIs were consulted during the request's fulfillment? Answers to these questions are essential to fixing bugs quickly, and yet they are so difficult to answer (at least in my experience). 2. Problems are difficult to reproduce: When Customer Success walks in and says, "I have an angry customer on the phone. They want to know why [FOO] wasn't properly [BAR]" it is often impossible to give an answer without interactive troubleshooting and hours of grepping through unstructured log files. Troubleshooting may incur additional expenses too, since (for instance) you may hit your API request limit for a 3rd party service. 3. Business and non-business logic are not well encapsulated: Often I see code related to (for example) RabbitMQ interwoven with core business logic when calls need to be made to other microservices. The fact that RabbitMQ facilitates communication between microservices is an implementation detail that I shouldn't have to think about. 4. Resource consumption is non-uniform: Some microservices are more demanding than others in terms of CPU, memory, and disk usage. Achieving optimal "packing" is difficult. In other words, some VMs will have a high load and others will remain idle. Auto scaling groups can help with this in theory, but I don't think they can achieve the kind of density I would like to see. Moreover, what constitutes a "resource"? If a 3rd party service rate limits requests by IP address, couldn't each request be considered a resource unit which needs to be properly load balanced, just as you would with CPU? Given these motivations, I would like to flesh out some ideas for a framework/platform which addresses these issues. These ideas are half-baked and may not tie in well with one another. I envision a distributed system as follows: 1. One kind of VM: DevOps people have a saying: "Treat your VMs like cattle, not pets". In practice, "cattle" becomes "cows, chickens, pigs, and lobster". VMs typically have an assigned role, and they become part of a group which may or may not be auto-scaling. For a given instantiation of this hypothetical platform, I would like to see a single kind of VM. That is, every VM is identical to every other VM, and they all run the same Haskell application. 2. Strict separation of business and non-business logic: The framework should handle all aspects of communication between nodes (like Cloud Haskell does) in a pluggable and transparent way, but that's not all. The framework should have first class support for other integrations (such as PagerDuty alerting, performance monitoring, etc) which are described below. 3. Pool coordination via DSL: The entire pool of VMs is orchestrated/coordinated by one ore more "scripts" written in a DSL, which is implemented as a Free Monad. Every single "operation" or "primitive" in your AST data type is Serializable, and when the framework interprets the DSL, it serializes the instruction and sends it over the network to a node for execution. The particular node on which the instruction gets executed is chosen by the platform, not the developer. 4. Smart resource consumption: Each node brings with it a set of resources. It is *not* my intention to create a system which views CPU, memory, etc as a contiguous unit. Rather, each primitive instruction in the AST is viewed as a "black box" which can only consume as much CPU and memory as the node has available to it. The framework is responsible for profiling each instruction and scheduling future instructions to a node for which resources are predicted to be available. The developer should be able to define new resources such as 3rd party API calls, bandwidth, database connections, etc, all of which are profiled just as CPU and memory would. 5. Browser based control panel: Engineers should have a GUI at their disposal which allows them to watch -- in real time -- the execution flow of the DSL script. 6. Structured logs with advanced filtering: All log output should be structured with first class support for shipping the data to Logstash/ElasticSearch. The aforementioned GUI should be able to selectively filter output based on certain pre-defined predicates and display them to the developer. For example, if you're building an email virus scanning system (which may see millions of emails per day), you may want to limit the real-time debugging output to only a specific customer. 7. First class integration with modern tools and services: The system should integrate with Consul, PagerDuty, statsd, RabbitMQ, memcache, DataDog, Logstash, and Slack, with new integrations being easy to add. This is vital for clean separation of business and non-business logic. For example, the developer should be able to cache certain bits of data at will, without having to worry about opening and managing a TCP connection to memcache. This is my vision, and I want to build it completely in Haskell. What do you all think? -- Alex