Project idea, seeking feedback

Hi all, I'm seeking feedback for a project I'd like to start. I have a bit of experience developing large scale systems containing many microservices, databases, message queues, and caches over many VMs. Time and time again I find myself confronted with the same problems: 1. It is difficult to trace events through the system: Consider an HTTP request made by a customer to a public API. Which microservices were impacted by that request? What SQL queries were run as a result of that request? What 3rd party APIs were consulted during the request's fulfillment? Answers to these questions are essential to fixing bugs quickly, and yet they are so difficult to answer (at least in my experience). 2. Problems are difficult to reproduce: When Customer Success walks in and says, "I have an angry customer on the phone. They want to know why [FOO] wasn't properly [BAR]" it is often impossible to give an answer without interactive troubleshooting and hours of grepping through unstructured log files. Troubleshooting may incur additional expenses too, since (for instance) you may hit your API request limit for a 3rd party service. 3. Business and non-business logic are not well encapsulated: Often I see code related to (for example) RabbitMQ interwoven with core business logic when calls need to be made to other microservices. The fact that RabbitMQ facilitates communication between microservices is an implementation detail that I shouldn't have to think about. 4. Resource consumption is non-uniform: Some microservices are more demanding than others in terms of CPU, memory, and disk usage. Achieving optimal "packing" is difficult. In other words, some VMs will have a high load and others will remain idle. Auto scaling groups can help with this in theory, but I don't think they can achieve the kind of density I would like to see. Moreover, what constitutes a "resource"? If a 3rd party service rate limits requests by IP address, couldn't each request be considered a resource unit which needs to be properly load balanced, just as you would with CPU? Given these motivations, I would like to flesh out some ideas for a framework/platform which addresses these issues. These ideas are half-baked and may not tie in well with one another. I envision a distributed system as follows: 1. One kind of VM: DevOps people have a saying: "Treat your VMs like cattle, not pets". In practice, "cattle" becomes "cows, chickens, pigs, and lobster". VMs typically have an assigned role, and they become part of a group which may or may not be auto-scaling. For a given instantiation of this hypothetical platform, I would like to see a single kind of VM. That is, every VM is identical to every other VM, and they all run the same Haskell application. 2. Strict separation of business and non-business logic: The framework should handle all aspects of communication between nodes (like Cloud Haskell does) in a pluggable and transparent way, but that's not all. The framework should have first class support for other integrations (such as PagerDuty alerting, performance monitoring, etc) which are described below. 3. Pool coordination via DSL: The entire pool of VMs is orchestrated/coordinated by one ore more "scripts" written in a DSL, which is implemented as a Free Monad. Every single "operation" or "primitive" in your AST data type is Serializable, and when the framework interprets the DSL, it serializes the instruction and sends it over the network to a node for execution. The particular node on which the instruction gets executed is chosen by the platform, not the developer. 4. Smart resource consumption: Each node brings with it a set of resources. It is *not* my intention to create a system which views CPU, memory, etc as a contiguous unit. Rather, each primitive instruction in the AST is viewed as a "black box" which can only consume as much CPU and memory as the node has available to it. The framework is responsible for profiling each instruction and scheduling future instructions to a node for which resources are predicted to be available. The developer should be able to define new resources such as 3rd party API calls, bandwidth, database connections, etc, all of which are profiled just as CPU and memory would. 5. Browser based control panel: Engineers should have a GUI at their disposal which allows them to watch -- in real time -- the execution flow of the DSL script. 6. Structured logs with advanced filtering: All log output should be structured with first class support for shipping the data to Logstash/ElasticSearch. The aforementioned GUI should be able to selectively filter output based on certain pre-defined predicates and display them to the developer. For example, if you're building an email virus scanning system (which may see millions of emails per day), you may want to limit the real-time debugging output to only a specific customer. 7. First class integration with modern tools and services: The system should integrate with Consul, PagerDuty, statsd, RabbitMQ, memcache, DataDog, Logstash, and Slack, with new integrations being easy to add. This is vital for clean separation of business and non-business logic. For example, the developer should be able to cache certain bits of data at will, without having to worry about opening and managing a TCP connection to memcache. This is my vision, and I want to build it completely in Haskell. What do you all think? -- Alex

Hi Alex,
this is obviously highly ambitious if you want to get this right. If
you actually plan to start on this, then since there are plenty of
DSL's that eventually run on one machine, then where I would start is
the distributed part. I.e make something that passes around an Int,
and have it deploy to any number of machines. Then add gradually add
complexity, like distributed queues and workers, ways to enforce
ordering on when results of works are to be
submitted/accepted. Implementing precedence graphs would be
interesting. Then there is limiting congestion and probably many more
kinds of limits you want to add to different points in the graph.
Just some random ideas. But start and deploy something very simple at
first.
On Wed, Nov 8, 2017 at 2:21 PM, Alex
Hi all,
I'm seeking feedback for a project I'd like to start. I have a bit of experience developing large scale systems containing many microservices, databases, message queues, and caches over many VMs. Time and time again I find myself confronted with the same problems:
1. It is difficult to trace events through the system: Consider an HTTP request made by a customer to a public API. Which microservices were impacted by that request? What SQL queries were run as a result of that request? What 3rd party APIs were consulted during the request's fulfillment? Answers to these questions are essential to fixing bugs quickly, and yet they are so difficult to answer (at least in my experience).
2. Problems are difficult to reproduce: When Customer Success walks in and says, "I have an angry customer on the phone. They want to know why [FOO] wasn't properly [BAR]" it is often impossible to give an answer without interactive troubleshooting and hours of grepping through unstructured log files. Troubleshooting may incur additional expenses too, since (for instance) you may hit your API request limit for a 3rd party service.
3. Business and non-business logic are not well encapsulated: Often I see code related to (for example) RabbitMQ interwoven with core business logic when calls need to be made to other microservices. The fact that RabbitMQ facilitates communication between microservices is an implementation detail that I shouldn't have to think about.
4. Resource consumption is non-uniform: Some microservices are more demanding than others in terms of CPU, memory, and disk usage. Achieving optimal "packing" is difficult. In other words, some VMs will have a high load and others will remain idle. Auto scaling groups can help with this in theory, but I don't think they can achieve the kind of density I would like to see. Moreover, what constitutes a "resource"? If a 3rd party service rate limits requests by IP address, couldn't each request be considered a resource unit which needs to be properly load balanced, just as you would with CPU?
Given these motivations, I would like to flesh out some ideas for a framework/platform which addresses these issues. These ideas are half-baked and may not tie in well with one another.
I envision a distributed system as follows:
1. One kind of VM: DevOps people have a saying: "Treat your VMs like cattle, not pets". In practice, "cattle" becomes "cows, chickens, pigs, and lobster". VMs typically have an assigned role, and they become part of a group which may or may not be auto-scaling. For a given instantiation of this hypothetical platform, I would like to see a single kind of VM. That is, every VM is identical to every other VM, and they all run the same Haskell application.
2. Strict separation of business and non-business logic: The framework should handle all aspects of communication between nodes (like Cloud Haskell does) in a pluggable and transparent way, but that's not all. The framework should have first class support for other integrations (such as PagerDuty alerting, performance monitoring, etc) which are described below.
3. Pool coordination via DSL: The entire pool of VMs is orchestrated/coordinated by one ore more "scripts" written in a DSL, which is implemented as a Free Monad. Every single "operation" or "primitive" in your AST data type is Serializable, and when the framework interprets the DSL, it serializes the instruction and sends it over the network to a node for execution. The particular node on which the instruction gets executed is chosen by the platform, not the developer.
4. Smart resource consumption: Each node brings with it a set of resources. It is *not* my intention to create a system which views CPU, memory, etc as a contiguous unit. Rather, each primitive instruction in the AST is viewed as a "black box" which can only consume as much CPU and memory as the node has available to it. The framework is responsible for profiling each instruction and scheduling future instructions to a node for which resources are predicted to be available. The developer should be able to define new resources such as 3rd party API calls, bandwidth, database connections, etc, all of which are profiled just as CPU and memory would.
5. Browser based control panel: Engineers should have a GUI at their disposal which allows them to watch -- in real time -- the execution flow of the DSL script.
6. Structured logs with advanced filtering: All log output should be structured with first class support for shipping the data to Logstash/ElasticSearch. The aforementioned GUI should be able to selectively filter output based on certain pre-defined predicates and display them to the developer. For example, if you're building an email virus scanning system (which may see millions of emails per day), you may want to limit the real-time debugging output to only a specific customer.
7. First class integration with modern tools and services: The system should integrate with Consul, PagerDuty, statsd, RabbitMQ, memcache, DataDog, Logstash, and Slack, with new integrations being easy to add. This is vital for clean separation of business and non-business logic. For example, the developer should be able to cache certain bits of data at will, without having to worry about opening and managing a TCP connection to memcache.
This is my vision, and I want to build it completely in Haskell. What do you all think?
-- Alex _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.
-- Markus Läll

Hi Alex, sounds ambitious. But you might be able to reduce the scope massively by relying on existing tools. Examples: * Let something like Nagios do the monitoring. I know there's tools to control Nagios from Haskell. What I don't know is how up-to-date they are, and I haven't seen something that reports internal performance data of a Haskell app to Nagios, but that should be simple if necessary. * Let something like Cassandra handle both the heaviest parts of messaging between your node controllers and the storage of their config data. If you base your WUI on top of the DB, you can separate it from the controllers as well. * Coordination of resources is a variant of scheduling, which is a ""solved"" problem. So there should be libraries you can use. * Logging has been worked on by many a commercial Haskeller. My guess is that filtering is just a matter of looking at one of the libraries from the right angle. This leaves orchestration, API connectors, and the DSL as the missing parts. Which sounds way more doable than having your tool do all the lifting itself. Or just use Kubernetes. Whichever is easier. ;) Cheers, MarLinn

On Wed, 15 Nov 2017 15:30:37 +0100
MarLinn
Hi Alex,
sounds ambitious. But you might be able to reduce the scope massively by relying on existing tools.
Yes! I do not wish to reinvent the wheel.
Examples:
*
Let something like Nagios do the monitoring. I know there's tools to control Nagios from Haskell. What I don't know is how up-to-date they are, and I haven't seen something that reports internal performance data of a Haskell app to Nagios, but that should be simple if necessary.
I don't think Nagios is a good fit because I want to do more than monitor the performance of the interpreter. I want to rely on that performance data so that I can use resources more effectively. For example, I want to know what the load average of a particular node is, and then I want to rely on historical performance data of the DSL primitives to determine if the next instruction to be executed should be scheduled to run on that node or a different one.
*
Let something like Cassandra handle both the heaviest parts of messaging between your node controllers and the storage of their config data. If you base your WUI on top of the DB, you can separate it from the controllers as well.
*
Coordination of resources is a variant of scheduling, which is a ""solved"" problem. So there should be libraries you can use.
For cluster coordination/configuration I was thinking of using Consul.
*
Logging has been worked on by many a commercial Haskeller. My guess is that filtering is just a matter of looking at one of the libraries from the right angle.
I intend to leverage existing libraries where possible. I want to create an environment in which the commercial Haskeller never has to choose and wire in a logging library. The decision is already made by the framework. They just need to insert logging statements where appropriate.
Or just use Kubernetes. Whichever is easier. ;)
Kubernetes is a great tool, but it doesn't do what I envision. -- Alex

Don't forget http://zipkin.io/ . It's awesome. :)
Cheers,
Ben
On Thu, 16 Nov 2017 at 08:46 Alex
On Wed, 15 Nov 2017 15:30:37 +0100 MarLinn
wrote: Hi Alex,
sounds ambitious. But you might be able to reduce the scope massively by relying on existing tools.
Yes! I do not wish to reinvent the wheel.
Examples:
*
Let something like Nagios do the monitoring. I know there's tools to control Nagios from Haskell. What I don't know is how up-to-date they are, and I haven't seen something that reports internal performance data of a Haskell app to Nagios, but that should be simple if necessary.
I don't think Nagios is a good fit because I want to do more than monitor the performance of the interpreter. I want to rely on that performance data so that I can use resources more effectively. For example, I want to know what the load average of a particular node is, and then I want to rely on historical performance data of the DSL primitives to determine if the next instruction to be executed should be scheduled to run on that node or a different one.
*
Let something like Cassandra handle both the heaviest parts of messaging between your node controllers and the storage of their config data. If you base your WUI on top of the DB, you can separate it from the controllers as well.
*
Coordination of resources is a variant of scheduling, which is a ""solved"" problem. So there should be libraries you can use.
For cluster coordination/configuration I was thinking of using Consul.
*
Logging has been worked on by many a commercial Haskeller. My guess is that filtering is just a matter of looking at one of the libraries from the right angle.
I intend to leverage existing libraries where possible. I want to create an environment in which the commercial Haskeller never has to choose and wire in a logging library. The decision is already made by the framework. They just need to insert logging statements where appropriate.
Or just use Kubernetes. Whichever is easier. ;)
Kubernetes is a great tool, but it doesn't do what I envision.
-- Alex _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.
participants (4)
-
Alex
-
Ben Kolera
-
Markus Läll
-
MarLinn