2-642.pptx

Surviving Success: Architecting Web Sites and Services for Rapid Growth

1.0x

2-642.pptx

Created 2 years ago

Duration 0:00:00
lesson view count 12
Surviving Success: Architecting Web Sites and Services for Rapid Growth
Select the file type you wish to download
Slide Content
  1. 2-642

    Slide 1 - 2-642

    • Surviving Success: Architecting Web Sites and Services for Rapid Growth
    • Mark Simms (@mabsimms)
    • Principal Group Program Manager
    • Surviving Success: Architecting Web Sites and Services for Rapid Growth
    • //build/ content is being presented by Microsoft Office Mix The video for this session will be available shortly
  2. Designing sites and services that can survive rapid growth and demand spikes requires careful design and architecture choices

    Slide 2 - Designing sites and services that can survive rapid growth and demand spikes requires careful design and architecture choices

    • In this session we’ll explore preparing a web site for growth, then handling extreme success when it suddenly arrives
    • Microsoft Azure Customer Advisory Team (AzureCAT)
    • Works with internal and external customers to build out some of the largest applications on Azure
    • Get our hands dirty on all aspects of delivery; design, implementation and all too often firefighting
    • Setting the Stage
    • This is meant to be an interactive discussion – if you don’t ask questions, we will!
    • Note: please use the mic.
    • This session will be an exploration of the journey to scale for a mostly fake web site and service
  3. This will not be a discussion of features

    Slide 3 - This will not be a discussion of features

    • Focus on the journey, design and architecture choices and their impact on scalability and availability
    • Challenge: how to leave the door open for success, without “overbuilding” in advance
    • Warning: there will be anonymized or mashed up customer stories, code mockery and a high incidence of sarcasm.
    • Agenda and Expectations
  4. Today has been a complete debacle.  I do not quite understand how a website can crash almost immediately upon receiving traffic.

    Slide 4 - Today has been a complete debacle. I do not quite understand how a website can crash almost immediately upon receiving traffic.

    • 30k RPS
    • 1 IaaS VM
    • File-based SQL CE
  5. Slide 5

    • AzureCAT: Framing a Customer Discussion
    • Scalability. The ability to add additional capacity to the service to handle increases in load and demand, together with efficient and effective use of resources allocated.
    • Availability.
    • Manageability.
    • Feasibility.
    • Scalability
    • Availability
    • Manageability
    • Feasibility
  6. Slide 6

    • AzureCAT: Framing a Customer Discussion
    • Scalability.
    • Availability. The ability of the solution to continue to provide value in the face of transient and enduring faults in the application and underlying service dependencies .
    • Manageability.
    • Feasibility.
    • Scalability
    • Availability
    • Manageability
    • Feasibility
  7. Slide 7

    • AzureCAT: Framing a Customer Discussion
    • Scalability.
    • Availability.
    • Manageability. The ability to understand health and performance of the live system and manage site operations
    • Feasibility.
    • Scalability
    • Availability
    • Manageability
    • Feasibility
  8. Slide 8

    • AzureCAT: Framing a Customer Discussion
    • Scalability.
    • Availability.
    • Manageability.
    • Feasibility. The ability to deliver and maintain the system, on time (ish) and under budget (ish).
    • Scalability
    • Availability
    • Manageability
    • Feasibility
  9. Adding more Capacity

    Slide 9 - Adding more Capacity

    • Identifying and breaking contention and choke points
    • How to add additional capacity to a solution?
    • There are subtle constraints to consider...
    • Using Capacity more Efficiently
    • Traditional performance tuning
    • Avoiding common anti-patterns traditionally hidden by capitalized infrastructure
    • Identify unbalanced workloads (read vs. write)
    • Scalability == Capacity * Density
  10. What type of application or service are you building?

    Slide 10 - What type of application or service are you building?

    • What proportion of your budget is allocated for non-functional service fundamentals?
    • If you are designing a multi-release platform, and this number is zero…
    • Design Considerations
  11. Balance – You Ain’t Gonna Need It (YAGNI) with Oh, You Did Need That (OYDNT?)

    Slide 11 - Balance – You Ain’t Gonna Need It (YAGNI) with Oh, You Did Need That (OYDNT?)

    • Balance – testing before production vs. testing in production
    • Hint, you’re always testing in production 
    • Design Considerations
  12. Web site and Mobile Application for tracking sports games

    Slide 12 - Web site and Mobile Application for tracking sports games

    • Burst mode during “events”. Application experiences high and unpredictable load during scheduled events.
    • Viral growth potential. Have to be ready for rapid adoption (triggers – social inflection point, successful ad?).
    • Design Scenario
  13. Workload Decomposition

    Slide 13 - Workload Decomposition

    • Key to design choices is understanding the inherent workloads in the system
    • Pay careful attention to state and consistency
    • Workload
    • Characterization
    • List of events
    • Read workload
    • (mostly) scheduled updates
    • minimal consistency requirements (minutes)
    • Status of active event
    • Read workload, continuous concurrent updates during events.
    • ~ 3-5 second read consistency
    • Status of active event (mobile)
    • Read workload, continuous concurrent updates during events.
    • ~ 3-5 second read consistency.
    • Push notification on “interesting” update.
  14. Slide 14

    • “Classic” 3-tier enterprise relational design
    • Three VM configuration:
    • IIS VM
    • App tier VM
    • SQL Server VM
    • Software stack:
    • ASP.NET, WebApi, Entity Framework
    • .NET 4.5
    • Stage 1 – It Works Great on my Dev Box
  15. Capacity.  Challenging to add additional capacity to front-end (need to manually configure VMs, deploy config+software, integrate with load balancer).

    Slide 15 - Capacity. Challenging to add additional capacity to front-end (need to manually configure VMs, deploy config+software, integrate with load balancer).

    • Density. Application tier adds latency, VMs tuned by default to protect VM – not protect application. Unbalanced workload (read/write) use single store (SQL).
    • Stage 1 – It Works Great on my Dev Box (Scale)
  16. Failure points.  Everything is a single point of failure 

    Slide 16 - Failure points. Everything is a single point of failure 

    • Stage 1 – It Works Great on my Dev Box (Availability)
  17. Operational insight and visibility.  See next slide.

    Slide 17 - Operational insight and visibility. See next slide.

    • Stage 1 – It Works Great on my Dev Box (Manageability)
  18. Operational Visibility

    Slide 18 - Operational Visibility

    • This slide left intentionally blank, as this is probably your operational monitoring experience
  19. Limited resources (time, people and money)

    Slide 19 - Limited resources (time, people and money)

    • How to prioritize?
    • Frame the door – enable deployment of additional resources and capacity
    • Turn the lights on – operational visibility
    • Use data to drive investment – system response under load
    • Mapping the First Stage of the Journey
  20. Need a higher semantic level for components – make it easy to trade $$ for capacity

    Slide 20 - Need a higher semantic level for components – make it easy to trade $$ for capacity

    • Money is always faster to spend than engineering time.
    • Migrate / rehost application components to their PaaS equivalents
    • IIS VMs (front end / mid tier) -> Web Apps (formerly Web Sites)
    • SQL Server -> Azure SQL DB (we’ll get to sizing based on data in a bit)
    • Frame the Door – Enable “Adding”
  21. Psychic debugging is not a recipe for success

    Slide 21 - Psychic debugging is not a recipe for success

    • Rent your way to victory (through insight and data)
    • Evaluate options against your workload, “test for ergonomics”
    • You may need more logging/diag later – prove the need with data
    • Turn the Lights On – Data Driven Engineering
  22. Every system has a breaking strain – find it or your users will

    Slide 22 - Every system has a breaking strain – find it or your users will

    • Use to-destruction load testing to determine the stress curve of the system
    • Do you need to do performance optimization, or can you simply throw more resources at the problem?
    • If you need to optimize, can you target specific improvements?
    • Evaluate Current State – System Response
  23. Stage 2 – Baseline Established

    Slide 23 - Stage 2 – Baseline Established

    • Use insight against live system to understand load profile
    • Pay attention to the “this looks weird”. Those are hindsight moments waiting to happen.
    • Once you have telemetry, you have to look at it. Seriously.
  24. Limited resources (time, people and money)

    Slide 24 - Limited resources (time, people and money)

    • How to prioritize?
    • Identify availability points and mitigations
    • Identify scale bottlenecks and contention points
    • Mapping the Second Stage of the Journey
  25. Slide 25

    • What are the metered resources?
    • 1
    • Compute instances for front-end / back-end web sites
    • 10 dedicated instances (call support for more)
    • 2
    • Concurrent active sockets (e.g. WebSocket)
    • 350 / dedicated instance (can be increased)
    • 3
    • Requests/sec per VM
    • Metered by efficiency of implementation
    • 4
    • Connections / database
    • Capped at 180 (default ASP.NET pool size is 100)
    • 5
    • Database throughput
    • Multiple metered resources, efficiency of implementation
    • 1
    • 2
    • 3
    • 4
    • 5
    • http://azure.microsoft.com/en-us/documentation/articles/azure-subscription-service-limits/
  26. Slide 26

    • What are the failure / availability points?
    • 1
    • Internal logic errors or exceptions in application code
    • By and large, IIS will save you from intermittent errors (or at least limit them to a single request)
    • 2
    • Single SQL DB instance
    • Any transient or enduring errors here will have a drastic effect on the web experience
    • 1
    • 2
  27. Can we reduce/shift critical work for differential workloads (read/write/consistency) away from single-point resources?

    Slide 27 - Can we reduce/shift critical work for differential workloads (read/write/consistency) away from single-point resources?

    • Primarily read workload – very suitable for caching
    • Note; this is something you need to invest engineering time in advance.
    • Adding caching during a live event is not something I want anybody else to experience 
    • Targeting Efficiency
    • ?
  28. Caching is not a magical solution..

    Slide 28 - Caching is not a magical solution..

    • Unless you have a primary read workload against a slow changing state store.. then it is pretty magical.
  29. Chatty I/O

    Slide 29 - Chatty I/O

    • Extraneous fetching
    • Improper instantiation
    • No caching
    • Synchronous I/O
    • Etc..
    • Common Density Barriers
    • Read more at github.com/mspnp
    • For more on scaling and density approaches, check out:
    • Lessons From Scale: Building Applications for Azure
    • Mark Russinovitch, April 30th @ 11:30am in Hall 1B
  30. If you have high fidelity load testing, can use your telemetry data to find workflows to optimize.

    Slide 30 - If you have high fidelity load testing, can use your telemetry data to find workflows to optimize.

    • If your load profiles are not based around extrapolating real customer load – you are telling yourself lies
    • Don’t try and optimize everything. Optimize the primary path(s), observe, measure, react.
  31. Blueprint for success in a single data center

    Slide 31 - Blueprint for success in a single data center

    • How to get ready for the world? How to go to multiple data centers?
    • Mapping the Third Stage of the Journey
  32. Only three numbers: 0, 1 and N.  How do we go from 1 -> N.

    Slide 32 - Only three numbers: 0, 1 and N. How do we go from 1 -> N.

    • Resources, state and affinity:
    • Front end / mid tier web resources. Low state, need code replication approach.
    • Back end database. Highly stateful, need eventually consistent data replication
    • Routing. Need performance/locality based DNS routing.
    • Moving beyond one data center
  33. Slide 33

    • Moving to N - Baseline
    • Start from your baseline 1 data center deployment.
    • Ensure that you can build a production environment from automation.
  34. Slide 34

    • Moving to N - ALM
    • Enable automated git publishing (or other ALM approach) to your staging environment.
    • Automated global deployment to production creates “learning moments”
  35. Slide 35

    • Moving to N - Telemetry
    • Ensure that your telemetry service(s) are enabled and flowing data.
    • Ensure that your operations and dev staff are comfortable with working with the data.
  36. Slide 36

    • Moving to N – Global Routing
    • Enable Azure Traffic Manager with a single region endpoint.
    • Set the stage to go to N deployments.
  37. Slide 37

    • Moving to N – State Replication
    • Use your deployment scripts to roll out additional data centers.
    • Connect the additional data centers to git publishing
    • Do NOT connect them to traffic manager (yet)
  38. Slide 38

    • Moving to N – State Replication
    • Enable Azure SQL DB geo replication to create readable secondaries in other data centers (note; requires Premium).
    • Enable the other data centers via Azure Traffic Manager.
  39. None of these changes involved writing code 

    Slide 40 - None of these changes involved writing code 

    • When you need to grow and go quickly, this is a good thing.
    • But it won’t work unless you have a strong foundation to build on.
    • No psychic debugging!
    • Moving to N - Recap
  40. Success is exhilarating and terrifying.  If you can expect wild and sudden success, lay the foundations.

    Slide 41 - Success is exhilarating and terrifying. If you can expect wild and sudden success, lay the foundations.

    • Insight is life. Psychic debugging during a crisis leads to flailing. Rent your telemetry – and look at it!
    • Be ready to spend your way to victory. Design your system to allow adding more resources without rewriting (too much) code.
    • Takeaways
  41. Azure Clinicpowered by Microsoft AzureCAT

    Slide 42 - Azure Clinicpowered by Microsoft AzureCAT

    • Talk to the folks who build world class, highly scalable,
    • high available systems on Azure today
    • 2) Bring your ideas for your application of the future and have
    • them design it with you right there
    • 3) Bring your questions and your problems and get them
    • fixed in the clinic on the spot
    • 4) Learn about Azure implementation best practices
  42. Follow our patterns & practices guidance on github at github.com/mspnp (contributions welcomed!)

    Slide 43 - Follow our patterns & practices guidance on github at github.com/mspnp (contributions welcomed!)

    • Cloud pattern guidance - https://github.com/mspnp/azure-guidance
    • Common performance related anti-patterns - https://github.com/mspnp/performance-optimization
    • Visit us at the AzureCAT clinic to discuss your scenarios and architecture
    • Call to Action
  43. Improve your skills by enrolling in our free cloud development courses at the Microsoft Virtual Academy.

    Slide 44 - Improve your skills by enrolling in our free cloud development courses at the Microsoft Virtual Academy.

    • Try Microsoft Azure for free and deploy your first cloud solution in under 5 minutes!
    • Easily build web and mobile apps for any platform with AzureAppService for free.
    • Resources