Project idea, seeking feedback

8 Nov 2017

      Hi all,

I'm seeking feedback for a project I'd like to start. I have a bit of
experience developing large scale systems containing many
microservices, databases, message queues, and caches over many VMs. Time
and time again I find myself confronted with the same problems:

1. It is difficult to trace events through the system: Consider an HTTP
request made by a customer to a public API. Which microservices were
impacted by that request? What SQL queries were run as a result of that
request? What 3rd party APIs were consulted during the request's
fulfillment? Answers to these questions are essential to fixing bugs
quickly, and yet they are so difficult to answer (at least in my
experience).

2. Problems are difficult to reproduce: When Customer Success walks in
and says, "I have an angry customer on the phone. They want to know why
[FOO] wasn't properly [BAR]" it is often impossible to give an answer
without interactive troubleshooting and hours of grepping through
unstructured log files. Troubleshooting may incur additional expenses
too, since (for instance) you may hit your API request limit for a 3rd
party service.

3. Business and non-business logic are not well encapsulated: Often I
see code related to (for example) RabbitMQ interwoven with core business
logic when calls need to be made to other microservices. The fact that
RabbitMQ facilitates communication between microservices is an
implementation detail that I shouldn't have to think about.

4. Resource consumption is non-uniform: Some microservices are more
demanding than others in terms of CPU, memory, and disk usage.
Achieving optimal "packing" is difficult. In other words, some VMs
will have a high load and others will remain idle. Auto scaling groups 
can help with this in theory, but I don't think they can achieve the 
kind of density I would like to see.
   Moreover, what constitutes a "resource"? If a 3rd party service
rate limits requests by IP address, couldn't each request be considered
a resource unit which needs to be properly load balanced, just as you
would with CPU?

Given these motivations, I would like to flesh out some ideas for a 
framework/platform which addresses these issues. These ideas are 
half-baked and may not tie in well with one another.

I envision a distributed system as follows:

1. One kind of VM:
    DevOps people have a saying: "Treat your VMs like cattle, not
pets". In practice, "cattle" becomes "cows, chickens, pigs, and
lobster". VMs typically have an assigned role, and they become part of
a group which may or may not be auto-scaling. For a given instantiation
of this hypothetical platform, I would like to see a single kind of VM.
That is, every VM is identical to every other VM, and they all run the
same Haskell application.

2. Strict separation of business and non-business logic: The framework 
should handle all aspects of communication between nodes (like Cloud 
Haskell does) in a pluggable and transparent way, but that's not all. 
The framework should have first class support for other integrations 
(such as PagerDuty alerting, performance monitoring, etc) which are
described below.

3. Pool coordination via DSL: The entire pool of VMs is
orchestrated/coordinated by one ore more "scripts" written in a DSL,
which is implemented as a Free Monad. Every single "operation" or
"primitive" in your AST data type is Serializable, and when the
framework interprets the DSL, it serializes the instruction and sends
it over the network to a node for execution. The particular node on
which the instruction gets executed is chosen by the platform, not the
developer.

4. Smart resource consumption: Each node brings with it a set of
resources. It is *not* my intention to create a system which views CPU,
memory, etc as a contiguous unit. Rather, each primitive instruction in
the AST is viewed as a "black box" which can only consume as much CPU
and memory as the node has available to it. The framework is
responsible for profiling each instruction and scheduling future
instructions to a node for which resources are predicted to be
available.
   The developer should be able to define new resources such as 3rd
party API calls, bandwidth, database connections, etc, all of which are
profiled just as CPU and memory would.

5. Browser based control panel: Engineers should have a GUI at their
disposal which allows them to watch -- in real time -- the execution
flow of the DSL script.

6. Structured logs with advanced filtering: All log output should be
structured with first class support for shipping the data to
Logstash/ElasticSearch. The aforementioned GUI should be able to
selectively filter output based on certain pre-defined predicates and
display them to the developer. For example, if you're building an email
virus scanning system (which may see millions of emails per day), you
may want to limit the real-time debugging output to only a specific
customer.

7. First class integration with modern tools and services: The system
should integrate with Consul, PagerDuty, statsd, RabbitMQ, memcache,
DataDog, Logstash, and Slack, with new integrations being easy to add.
This is vital for clean separation of business and non-business logic.
For example, the developer should be able to cache certain bits of data
at will, without having to worry about opening and managing a TCP
connection to memcache.

This is my vision, and I want to build it completely in Haskell. What
do you all think?

-- 
Alex

Alex

Markus Läll

MarLinn

Alex

Ben Kolera

tags

participants (4)