My kingdom for (cheap) service discovery on AWS

My kingdom for cheap service discovery on AWS

I've basically spent the last month delving into the rabbit hole of service discovery options on AWS for ECS hosted solutions. This is for a personal project consisting of microservices and I'm just in the process of building it.

Goals

Keep it cheap - since it is a personal pet project, keeping hosting costs low is a key driver.
Minimal code changes - I'd like to not have to write specific code to handle service discovery, and have it be invisible to the application as much as possible.
Polyglot support - One of the best things about microservices is the ability to use different technologies for different services and tie them together. Forgoing this in service discovery would limit future extensibility.
Can replicate in dev - I like to test and build locally using docker - so the solution needs to be either supported in docker locally or I should be able to emulate it for local dev purposes.
Works on ECS/EC2 - since its being hosted in ECS/EC2 it would need to be supported on ECS.
Runs on a single host - again for local development ideally it would be something that can still work if i'm running on a single host i.e. my laptop

Some exclusions off the bat

Why not ECS Fargate? - Fargate charges per vCPU and GB per hour. The cheapest vCPU option in CA-Central region is about $8/month per task excluding data transfer. With just 5 microservices thats about $40/month forgetting about data transfer. For an unlaunched startup, this cost is too much.
Why not Serverless? - good question - again it comes down to being able to code and test offline and just updated it when ready. I feel at this time serverless architectures are quite specific to specific cloud providers. I started writing this thing a while ago in a Node.js server model and didn't want to go back and refactor the whole thing.

General Approaches

There are some common approaches out there

Custom central API - services register with a centralized service that polls service health. Requires custom code and exposes a single point of failure.
DNS A Records - this records DNS A records for services. Port numbers would need to be known ahead of time. This is ideal as it is handled at the infrastructure level and keeps services 'dumb' from a code perspective.
- ECS supports this but only for containers running in AWSVPC network mode. This requires an ENI and there are limits to the number of ENI's per EC2 instance. So the more services you have to register the bigger the EC2 instance needs to be. Drives up cost.
DNS SRV Records - this records DNS SRV records (host and port) for each service. Some additional code is required to query SRV records, and some sort of agent that queries the container and registers them with DNS is required.
- ECS supports this using their provided service discovery and Route 53 entries.
- Replicating it locally for docker requires some agent to query the containers and record SRV records in a local DNS provider. The only way to do this currently with docker is to have something querying the /var/run/docker.sock interface - which poses a bit of a security risk. Still it is something that can be used for development replication of service discovery.
Consul/ etcd / Zookeeper - these services have one or more master nodes running that manages service discovery and agents on containers that report to it. This seems to be the general approach for large microservice meshes.
- Requires multiple instances in EC2 - atleast one for the master node and one for the application node.

AWS Specific approaches

AWS ALB - application load balancing. Costs can be high because of the limitation of one ALB per service.
- In AWS this drives up the cost due to the need for one ALB per service
AWS AppMesh - the newest offerring from AWS that just went Generaly Availability in March 2019.
- From my testing there seems to be a few current limitations- it requires AWSVPC networking mode for containers, requiring a NAT Gateway so that the Envoy (proxy) sidecar containers can access the region's management server. This drives up costs.
- This is a very new service so I'm hopeful that they will address some of this in the near future.

Findings

As can be seen below there is no real silver bullet. Not if you want to keep things cheap and simple.

	Minimal code dependency	Polyglot	Replicate in local docker	ECS/EC2 Supported	Single Host
Custom Central API	No	Yes	Yes	Yes	Yes
DNS A Records	Yes	Yes	Yes	No	Yes
DNS SRV Records	No	Yes	Yes	Yes	Yes
SenecaJS	No	No	Yes	Yes	Yes
CoteJS	No	No	Yes	Yes	Yes
Consul / Etcd / Zookeper	No	Yes	Yes	Yes	No
ALB	Yes	Yes	No	Yes	No
AWS AppMesh	Yes	Yes	No	Yes (Cost)	No

Other approaches (to be investigated)

Istio - this is a service mesh solution that AWS AppMesh appears to be based on.
Envoy - reverse proxy used by AppMesh and Istio - it appears to support service discovery etc - may need its a separate master node though.
Traefik - traefik acts as a reverse proxy that requires the docker host interface to work - it may be possible to use this to run a reverse proxy for services - however when it starts getting into multiple hosts it likely requires a central service repository. (It may be something to use in dev)

Conclusion

Docker Swarm - seems like the logical thing to do at this point. This is making me question why would I even host it on AWS - why not just do this in a cheaper, more generic cloud provider such as DigitalOcean.