Task:
Assemble a working outline or template from which design decisions can be made.
- Core Summary (5 mins - hrs.)
- High-level
- What is it for? What need does it provide?
- Where is it needed?
- Why is it needed?
- Feature Expectations: (5 mins - 1 hr)
- Use cases
- Use cases/scenarios not covered by this app
- Who will use?
- How many will use?
- Usage Patterns (times, and methods)
- Roles of types of users (readers/consumers, content creators, approvers, etc.)
- Estimations (15 mins - hrs)
- Throughput of NICs Receive/Send Ratio
- Throughput by Queries/second (QPS)
- Throughput by Size, Upload and Download, of those queries/second (MBs/sec or GBs/sec)
- Throughput Read/Write Ratio
- Write (QPS, volume of data)
- Read (QPS, volume of data)
- Latency
- Expected latency for read/write queries
- Storage Estimates
- Memory Estimates
- If a cache, how much cached in memory?
- Horizontal scaling loading an option?
- If a disk cache, why and how much to store?
- Types of disks/SSDs needed based on answers above
- CPU Estimates
- Number of CPUs/node/VM
- Number of CPUs total across nodes/VMs (horizontal scaling)
- Design Performance Goals (15 mins)
- Latency and Throughput minimum/maximum requirements for scaling
- Consistency vs. Availability
- Weak/strong --> eventual consistency
- Failover/replication --> availability
- High-Level Design (1 hr - hrs)
- APIs for CRUD scenarios (read, create/write, update, delete) for main design elements (e.g. doc forms, views, reports, metrics, static content vs dynamic, etc.)
- Containerization Platform Considerations
- Database Schema
- If performing Data Normalization/Machine Learning, design algorithms
- Divide and Conquer (smaller subproblems) or Dynamic Programming (simpler/smaller subproblems)
- Streaming Algorithms when data too big to fit all in memory or data is a continuous feed
- Hashing and Indexing Algorithms for large data lookups and insertions
- High-level design (workflow process) for primary read scenario
- High-level design (workflow process) for primary read/write scenario
- High-level design (workflow process) for data management roles (e.g. approvers, data librarians, analysts performing searches for discoveries, etc.)
- Data archiving policies and where
- Deeper-Dive Design (1 hr - hrs)
- Scaling Code
- Algorithms
- App code base(s) ensuring they can scale horizontally/asymmetrically and vertically
- Scaling App Components for Code
- Availability, Consistency, for each App Component
- Retries, Observability, and Reliability
- Patterns of above, across all or sections of the App Components
- App Components to Cover
- DNS (internal and external)
- CDN (Push vs. Pull)
- Load Balancers (Active-Passive, Active-Active, Layer 4, Layer 7)
- Reverse Proxy
- Application Layer Scaling (Microservices, Service Discovery)
- Database options:
- RDBMS: ACID Properties, Primary-Secondary, Primary-Primary, Federation, Sharding, Denormalization, SQL Tuning - Postgres
- Use-cases: Structured data with relationships
- Index Scaling
- NoSQL: Key-Value, Wide-Column, Document - Domino NSF, MongoDB, DynamoDB
- Use-cases: Unstructured or semi-structured data
- Graph: Neo4j, Amazon Neptune
- Use-cases: Social networks, knowledge graphs, recommendation systems, and bioinformatics
- NewSQL: Key-Value with ACID Properties - CockroachDB, Google Spanner, VoltDB
- Use-cases: Transaction processing, real-time analytics and IoT device data
- Time Series: Time-stamped data points - InfluxDB, TimescaleDB, Prometheus
- Use-cases: IoT sensor data, financial market data, system metrics, and logs
- High-dimensional Vector Data - Pinecone, Weaviate, KDB.AI
- Use-cases: Machine learning, similarity search, and recommendation systems
- Fast lookups:
- RAM (Bounded size) => Redis, Memcached.
- AP (Unbounded size) => Cassandra, RIAK, Voldemort, DynamoDB (default mode)
- CP (Unbounded size) => HBase, MongoDB, Couchbase, DynamoDB (consistent read setting)
- Caches:
- Cache Types:
- Client caching
- CDN caching
- Web server caching
- Database caching
- Application caching
- Query level caching
- Object level caching
- Cache Eviction Policies:
- Cache aside
- Write through
- Write behind
- Refresh ahead
- Asynchronism:
- Message queues
- Task queues
- Back pressure
- Network Communication:
- Client to Server, Server to Server Communication Protocols:
- TCP - REST/API
- TCP - RPC
- TCP - WebSockets
- Justify (15 mins)
- Throughput of Each Layer
- Latency Caused Between Each Layer
- Overall Latency Justification
- Key Metrics to Measure (15 mins - 1 hr)
- Identify Key Metrics relevant to your system's design:
- Availability
- Latency
- Throttling
- Request Patterns/Volume
- Measure Customer Experience
- App Component/Feature Specific Metrics
- Search - What keyword searches = empty (failure from the user/customer perspective)
- Define metrics for infrastructure and tools/resources:
- Grafana with Prometheus
- AppDynamics
- System Health Monitoring (15 mins - 1 hr)
- Measure app index and latency of microservices:
- Monitoring health and performance:
- Grafana with Prometheus
- AppDynamics
- Simulate Customer Experience:
- Canaries - Pro-active detection of service degradation
- Log Systems (15 mins - 1 hr)
- Implement metrics gathering and visualization dashboards
- Implement Log Collection and Analyzation:
- Elastic, Logstash, Kibana (ELK)
- Splunk
- Logtail
- Security (15 mins - 1 hr)
- Firewall
- TLS transmissions encryption
- Data encryption at rest
- Authentication / Authorization
- Limited Egress/Ingress Rules
- Implementation of Least Privilege for Roles
previous page
|