Knowledge Transfer

Ethickfox kb page with all notes


Project maintained by ethickfox Hosted on GitHub Pages — Theme by mattgraham

Reliable, Maintainable and Scalable Applications

An application has to meet various requirements in order to be useful. There are

Reliability

If you have an application-managed caching layer (using Memcached or similar), or a full-text search server (such as Elasticsearch or Solr) separate from your main database, it is normally the application code’s responsibility to keep those caches and indexes in sync with the main database.

The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or software faults, and even human error).

Fault Tolerance

The things that can go wrong are called faults, and systems that anticipate faults and can cope with them are called fault-tolerant or resilient.

Fault is not the same as a failure. A fault is usually defined as one component of the system deviating from its spec. Whereas a failure is when the system as a whole stops providing the required service to the use

It is impossible to reduce the probability of a fault to zero; therefore it is usually best to design fault-tolerance mechanisms that prevent faults from causing failures.

Set up detailed and clear monitoring, such as performance metrics and error rates. In other engineering disciplines this is referred to as telemetry. Monitoring can show us early warning signals and allow us to check whether any assumptions or constraints are being violated. When a problem occurs, metrics can be invaluable in diagnosing the issue.

Scalability

As the system grows (in data volume, traffic volume, or complexity), there should be reasonable ways of dealing with that growth.

Scalability is the term we use to describe a system’s ability to cope with increased load.

Example

Simply handling 12,000 writes per second (the peak rate for posting tweets) would be fairly easy. However, Twitter’s scaling challenge is not primarily due to tweet volume, but due to fan-out each user follows many people, and each user is followed by many people.

There are broadly two ways of implementing these two operations: Approach 1 Posting a tweet simply inserts the new tweet into a global collection of tweets. When a user requests their home timeline, look up all the people they follow, find all the tweets for each of those users, and merge them (sorted by time). In a relational database

SELECT tweets.*,  users.* FROM tweets
JOIN users ON tweets.sender_id = users.id
JOIN follows ON follows.followee_id = users.id
WHERE follows.follower_id = current_user

Approach 2 Maintain a cache for each user’s home timeline. When a user posts a tweet, look up all the people who follow that user, and insert the new tweet into each of their home timeline caches. The request to read the home timeline is then cheap, because its result has been computed ahead of time.

Approach 2 works better because the average rate of published tweets is almost two orders of magnitude lower than the rate of home timeline reads, and so in this case it’s preferable to do more work at write time and less at read time. The downside of approach 2 is that posting a tweet now requires a lot of extra work. On average, a tweet is delivered to about 75 followers, so 4.6k tweets per second become 345k writes per second to the home timeline caches.

Load investigation

Once you have described the load on your system, you can investigate what happens when the load increases. You can look at it in two ways:

Latency and Response time

Latency and response time are often used synonymously, but they are not the same.

Usually it is better to use percentiles. If you take your list of response times and sort it from fastest to slowest, then the median is the halfway point. High percentiles of response times, also known as tail latencies, are important because they directly affect users’ experience of the service

Even if only a small percentage of backend calls are slow, the chance of getting a slow call increases if an end-user request requires multiple back-end calls, and so a higher proportion of end-user requests end up being slow - an effect known as tail latency amplification

Maintainability

Over time, many different people will work on the system, and they should all be able to work on it productively.

It is well known that the majority of the cost of software is not in its initial development, but in its ongoing maintenance

Operability

Make it easy for operations teams to keep the system running smoothly. A good operations team typically is responsible for the following

Good operability means making routine tasks easy, allowing the operations team to focus their efforts on high-value activities.

Simplicity

Make it easy for new engineers to understand the system, by removing as much complexity as possible from the system.

There are various possible symptoms of complexity:

When complexity makes maintenance hard, budgets and schedules are often overrun. In complex software, there is also a greater risk of introducing bugs when making a change: when the system is harder for developers to understand and reason about

Making a system simpler does not necessarily mean reducing its functionality;

Evolvability

Make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases as requirements change. Also known as extensibility, modifiability, or plasticity.