BRK3460: Operating the Microsoft Cloud Platform System

[Speaker: Daniel Savage, Justin Incarnato] Come learn about how customers will operate the Microsoft Cloud Platform System (CPS) – the Azure-consistent private cloud solution for enterprise or service provider environments. In this session, we show you the investments we have made to dramatically reduce the total cost of ownership of running the Microsoft Cloud Platform System (CPS).

How toBreakoutIgnite2015
1.0x

BRK3460: Operating the Microsoft Cloud Platform System

Created 2 years ago

Duration 1:15:20
lesson view count 87
[Speaker: Daniel Savage, Justin Incarnato] Come learn about how customers will operate the Microsoft Cloud Platform System (CPS) – the Azure-consistent private cloud solution for enterprise or service provider environments. In this session, we show you the investments we have made to dramatically reduce the total cost of ownership of running the Microsoft Cloud Platform System (CPS).
Select the file type you wish to download
Slide Content
  1. Operating the Microsoft Cloud Platform System

    Slide 1 - Operating the Microsoft Cloud Platform System

    • Daniel Savage, Justin Incarnato, Thomas Roettinger
    • BRK3460
  2. Agenda

    Slide 2 - Agenda

    • Why converged solutions?
    • CPS Architecture Review
    • Management Cluster and Services Overview
    • Monitoring
    • Automation
    • Patch and Update (P&U)
    • Backup & Disaster Recovery
  3. Converged Systems

    Slide 3 - Converged Systems

    • Do-it-yourself
    • Reference Architecture
    • Enabling datacenter deployment options
    • Faster time to value
    • More
    • Customization
    • Standardization
    • Analytics Platform System (APS)
    • Cloud Platform System (CPS)
    • Fast Track Solutions
    • System Center
  4. Source: IDC, 2015 (http://idcdocserv.com/IDC_Solution_Brief_254796)

    Slide 4 - Source: IDC, 2015 (http://idcdocserv.com/IDC_Solution_Brief_254796)

    • Why Integrated Systems matter
  5. Networking

    Slide 5 - Networking

    • 5 x Force 10 – S4810P
    • 1 x Force 10 – S55
    • Compute Scale Unit (32 x Hyper-V hosts)
    • Dell PowerEdge C6220ii – 4 Nodes per 2U
    • Dual socket Intel IvyBridge (E5-2650v2 @ 2.6GHz)
    • 256 GB memory
    • 2 x 10 GbE Mellanox NIC’s (LBFO Team, NVGRE offload)
    • 2 x 10 GbE Chelsio (iWARP/RDMA)
    • 1 local SSD 200 GB(boot/paging)
    • Storage Scale Unit (4 x File servers, 4 x JBODS)
    • Dell PowerEdge R620v2 Servers (4 Server for Scale Out File Server
    • Dual socket Intel IvyBridge (E5-2650v2 @ 2.6GHz)
    • 2 x LSI 9207-8E SAS Controllers (shared storage)
    • 2 x 10 GbE Chelsio T520 (iWARP/RDMA)
    • PowerVault MD3060e JBODs (48 HDD, 12 SSD)
    • 4 TB HDDs and 800 GB SSDs
    • CPS - Integrated solution for HW and SW
    • One Rack
    • 512 Cores
    • 8TB RAM
    • 282 TB usable storage
    • 1360 Gb/s internal connectivity
    • 560 Gb/s inter-rack connectivity
    • 60 Gb/s external
    • 2322 Lbs
    • 42U
    • 16.6 KW Maximum
    • 1 – 4 racks
  6. Cloud Platform System

    Slide 6 - Cloud Platform System

    • Networking
    • Compute
    • 32 servers per rack
    • 128 maximum
    • Storage
    • 282 TB per rack
    • 1.1 PB maximum
    • Rack 4
    • Agg Switches
    • Mgmt Switches
    • Edge cluster
    • Compute cluster
    • Storage cluster
    • Access switches
    • Rack 3
    • Agg Switches
    • Mgmt Switches
    • Edge cluster
    • Compute cluster
    • Storage cluster
    • Access switches
    • Rack 1
    • Agg Switches
    • Mgmt Switches
    • Mgmt cluster
    • Edge cluster
    • Compute cluster
    • Storage cluster
    • Access switches
    • Rack 2
    • Agg Switches
    • Mgmt Switches
    • Edge cluster
    • Compute cluster
    • Storage cluster
    • Access switches
    • Management host group
    • Edge host group
    • Compute host group
    • Storage host group
  7. Cloud Platform System

    Slide 7 - Cloud Platform System

    • Networking
    • Compute
    • 32 servers per rack
    • 128 maximum
    • Storage
    • 282 TB per rack
    • 1.1 PB maximum
    • Rack 4
    • Agg Switches
    • Mgmt Switches
    • Edge cluster
    • Compute cluster
    • Storage cluster
    • Access switches
    • Rack 3
    • Agg Switches
    • Mgmt Switches
    • Edge cluster
    • Compute cluster
    • Storage cluster
    • Access switches
    • Rack 1
    • Agg Switches
    • Mgmt Switches
    • Mgmt cluster
    • Edge cluster
    • Compute cluster
    • Storage cluster
    • Access switches
    • Rack 2
    • Agg Switches
    • Mgmt Switches
    • Edge cluster
    • Compute cluster
    • Storage cluster
    • Access switches
    • Management host group
    • Edge host group
    • Compute host group
    • Storage host group
  8. Management Cluster

    Slide 8 - Management Cluster

    • VMM (2)
    • SQL Cluster for dB’s (4 VM’s)
    • (VMM, OM, SPF, WSUS, SMA, WAP, DW, Analysis Services)
    • VMM Library (1)
    • Service Templates
    • AD/DNS/DHCP (3)
    • Mgmt AD
    • DPM (1) (for backing up Management cluster)
    • IaaS RP (SPF) (2)
    • SMA (3)
    • WDS (1)
    • WSUS (1)
    • WAP Tenant API (2)
    • WAP Admin Portal/API, Service Reporting (3)
    • ADFS (2)
    • OM (3)
    • Console (4)
    • 32 VM’s on a 6 x node Hyper-V failover cluster
    • Deployed for HA
  9. CPS – Management Services

    Slide 9 - CPS – Management Services

    • Monitoring
    • Automation
    • Backup (Tenant/MC)
    • Patching & Upgrade
    • Disaster Recovery
    • Configure/deploy
    • Service Administration
  10. Monitoring

    Slide 10 - Monitoring

  11. CPS 1.0 Monitoring Goals

    Slide 11 - CPS 1.0 Monitoring Goals

    • Deploy and configure all required monitoring infrastructure and agents via automated methods
    • Provide current, actionable health state of Hosts, Infrastructure VMs (Management Cluster, Edge Cluster, WAP Tenant Services), and the CPS software components.
    • When problems occur with the CPS Stamp as a Fabric Admin I receive actionable alerts
    • During CPS Stamp Maintenance I can pause monitoring using documented maintenance mode steps to avoid false alerting.
  12. Integrated testing of management packs during CPS development.  What did we find out?

    Slide 12 - Integrated testing of management packs during CPS development. What did we find out?

    • Management Packs can be noisy!
    • Understanding fabric health is difficult
    • Making sense of CPS monitoring
    • Tuned management packs to reduce health and alert noise (overrides + MP fixes)
    • Create centralized dashboard visualizations for health and alerts
    • Results?
    • Eliminated need for customer tuning of OpsMgr MPs
    • Reduced time spent investigating noisy alerts
    • Decreased time to effective monitoring
    • OpsMgr in Cloud Platform System (CPS)
  13. Management Cluster

    Slide 13 - Management Cluster

    • All Hosts and VMs
    • All Management SW Components (Windows, SC, WAP)
    • Computer and Storage Clusters
    • All Hosts and VMs supporting WAP Tenant Portal
    • WAP Tenant Portal
    • Windows Storage Spaces (SOFS)
    • Edge Cluster
    • All Hosts and VMs
    • Multi-tenant RAS
    • Hardware
    • Dell Servers, Switches, JBODs (via Storage Spaces MP)
    • HBA Adapters
    • F5 Load Balancer
    • What’s monitored in CPS?
  14. Demo: CPS Monitoring

    Slide 14 - Demo: CPS Monitoring

    • Daniel Savage
  15. AutomationJustin Incarnato

    Slide 15 - AutomationJustin Incarnato

  16. Why Automation?

    Slide 16 - Why Automation?

    • SaveTime
    • Increase Efficiency
    • DecreaseRisk
    • DecreaseDowntime
  17. What operational tasks did we automate in CPS?

    Slide 17 - What operational tasks did we automate in CPS?

    • Service Account Password Reset
    • BMC Password Reset Automation
    • Data Consistency
  18. Password Reset

    Slide 18 - Password Reset

    • The Challenge
    • System Center Components and WAP require domain and database accounts to be setup and run
    • Separation of privileges requires 34 service accounts to run the CPS management stack
    • Passwords expire as per domain policies
    • CPS is an E2E offering which requires an easy and non-disruptive way to reset the service account passwords
    • The Feature
    • Password Reset which automates service account password rotation (SQL, VMM, OM, SPF, SMA, SR, all WAP components and ADFS)
    • Implementation
    • PowerShell & SMA runbooks orchestrating reset of all domain service account passwords and service restarts.
    • Reduced number of accounts requiring password reset through use of Group Managed Service Accounts (gMSAs)
  19. Password Reset Script Process

    Slide 19 - Password Reset Script Process

    • User Action
    • SMA
    • PowerShell
    • All critical components OK - Proceed with Password Reset
    • If any VMs or Services are not running, allow user to terminate the script
    • Logs for each step are saved locally
    • A list of all the
    • runbooks
    • and status is printed on the screen
    • The last step is to also save the logs in the SMA database
    • If any failures have occurred direct user to remediation instructions
    • Passwords are changed in the following order
    • SQL, VMM, OM, ADFS, WAP, SMA, SPF
    • Password Reset starts
    • Credentials Input
    • Health Check
    • Check if dependent components are running (SMA, SQL, VMM)
    • If any of the dependent components are not running: terminate the script
  20. Demo: CPS Password Reset

    Slide 20 - Demo: CPS Password Reset

    • Justin Incarnato
  21. Demo: BMC Password Reset

    Slide 22 - Demo: BMC Password Reset

    • Justin Incarnato
  22. Password Reset

    Slide 24 - Password Reset

    • Value Proposition
    • Seamless automatic service account password reset with minimal downtime
    • Enables Password Resync when DPM restores are necessary
    • The Benefits
    • Time - Reduced password reset process time from many hours to minutes
    • Efficiency – Able to perform other tasks while automation is running
    • Risk – Repeatable safe process with no manual steps
    • Downtime – No Tenant downtime and minimal management stack downtime
  23. Patch & Update

    Slide 25 - Patch & Update

  24. Error prone, tedious task that no one likes

    Slide 26 - Error prone, tedious task that no one likes

    • Disruption of tenant workload or management functions = not good for business
    • Multiple, complex tools & processes going from staging to QA to pre-prod to prod…
    • Wasted time looking for patches from multiple sources with a multitude of release cadences
    • Some of today’s update problems…
  25. P&U provides automated CPS servicing using a single, simple tool

    Slide 27 - P&U provides automated CPS servicing using a single, simple tool

    • Fingerprints a CPS stamp including hardware, firmware & software
    • Automatic update of all pre-validated Windows Server, System Center, SQL & Windows Azure Pack updates
    • Validates all workloads after servicing to ensure availability
    • Designed to support “offline/submarine” scenario
    • Enables production-ready update release cadence
    • Component sequencing & dependency aware to avoid disruptions in tenant workloads or management functions
    • So, what did we build? Enter CPS P&U
  26. Importance of dependencies & sequence

    Slide 28 - Importance of dependencies & sequence

  27. Not everything is for CPS so we implemented a Points/Scoring system for:

    Slide 29 - Not everything is for CPS so we implemented a Points/Scoring system for:

    • Security updates
    • Critical updates
    • General Distribution Releases (GDR)
    • Limited Distribution Releases (LDR)
    • Update Rollups
    • Cumulative Updates
    • Cadence: Build minor and major CPS P&U Update Packages for customers
    • Minor updates as-needed, major updates quarterly
    • What to leave in…what to leave out?
  28. Enable faster update cadence

    Slide 30 - Enable faster update cadence

    • Quicker feature turnaround
    • Increased reliability
    • P&U as a controlled process
    • Pre-validated patch bundles
    • Halted process in case of update failure
    • Simpler validation matrix
    • Uniform installation base
    • Limited set of possible configurations
    • Improve Customer Support Service
    • Customer update level awareness
    • Customer setup easily reproducible
    • Uniform logging and validation approach
    • Additional benefits
  29. Lastly, here is the P&U “rack flow”

    Slide 31 - Lastly, here is the P&U “rack flow”

    • Compute Node
    • Highly Available Virtual Machine Workloads
    • Compute Node
    • Orchestration
    • Compute Node
    • Running Workloads Keep Running
    • Live Migration
    • Maintenance Mode
    • CPS Update 0.1
    • Inventory
    • Update Windows Server
    • Validate Windows Server
    • Update System Center
    • Validate System Center
    • Update Windows Azure Pack
    • Validate Windows Azure Pack
    • CPS Update
    • 0.2
  30. Backup & Disaster RecoveryThomas Roettinger

    Slide 32 - Backup & Disaster RecoveryThomas Roettinger

  31. "As a service provider, I can backup the management cluster and provide Backup-as-a-Service to my tenant VMs"

    Slide 33 - "As a service provider, I can backup the management cluster and provide Backup-as-a-Service to my tenant VMs"

    • "As a service provider, I can optimize my backup storage by using backup de-duplication"
    • CPS Backup with DPM
  32. CPS Backup Workflow

    Slide 34 - CPS Backup Workflow

    • Protection Environment is pre-configured in CPS
    • DPM with De-Dup enabled Backup storage
    • MC protection pre-configured
    • Daily back-up for predictable SLA
    • Fabric level Consistency during MC restore
    • Central DPM console for management
    • Included Automation of backup admin tasks
    • Protect newly provisioned tenant VMs with a click
    • Clean up of protected data for the deleted VMs
    • Manage the Backup intent across DPM servers (retention range, backup windows)
    • Exclusion list – (VMs that don’t need to be protected based on patterns)
    • Rack 1
    • Agg Switches
    • Mgmt Switches
    • Mgmt cluster
    • Edge cluster
    • Compute cluster
    • Storage cluster
    • Access switches
    • Management Cluster
    • SQL Cluster for dB’s (4 VM’s)
    • DPM (1)
    • (backing up Management cluster)
    • WAP Tenant (2)
    • WAP Admin Portal/API, Service Reporting (3)
    • Console (2)
    • SMA (3)
    • VMM (2)
    • WDS (1)
    • OM (3)
    • ADFS (2)
    • VMM Library(1)
    • WSUS (1)
    • SPF (2)
    • AD/DNS/DHCP (3)
    • Windows Server Backup
    • Infra Cluster
    • System Center
    • WAP
    • SQL
    • Compute Cluster
    • DPM (2) (for backing up Host Tenant VMs)
    • 2000
    • 8 DPM Servers
    • Tenant VMs
    • (Tenant’s Production workload)
    • Tenant VMs
    • (Tenant’s Production workload)
    • Tenant VMs
    • (Tenant’s Production workload)
  33. Schedule & Retention (MC)

    Slide 35 - Schedule & Retention (MC)

    • Component
    • Schedule
    • Retention Period
    • All VMs:
    • WSUS
    • WDS
    • All System Center components
    • Console VMs
    • Windows Azure Pack
    • Components running in management cluster
    • Components running in compute cluster
    • Once a week (8:00 PM every Saturday)
    • Two weeks
    • All databases:
    • WSUS
    • All System Center components
    • Windows Azure Pack
    • Express full backup every four hours (starting at 12:00 AM)
    • Five days
    • Active Directory/DNS/DHCP using Windows Server Backup
    • Once a day
    • Five days
  34. Tenant  (CC) - Schedule & Retention

    Slide 36 - Tenant (CC) - Schedule & Retention

    • Component
    • Schedule
    • Retention Period
    • Tenant VMs
    • Daily 10 PM to 6 AM
    • 7 Days
    • VMM Tenant Library Share
    • Saturday 10 PM
    • 2 Backups
    • Adding VMS to PG
    • Defined by Service Provider
    • NA
    • Pre-scheduled Runbook for VMM Tenant Library Share
    • Service Provider schedules Tenant Protection Runbook
    • Runbook defines what to exclude, run by Service Provider
  35. Storage Spaces with Dual Parity

    Slide 37 - Storage Spaces with Dual Parity

    • 72 Physical Disks Total
    • 8 SSDs marked as Journal
    • 64 HDDs
    • Total Space for Backup 115 TB
    • DPM VM ( Total 8)
    • 20 VHDs each 1 TB
    • 8 x 20 = 160 TB
    • Deduplication enabled
    • Savings 45 TB
    • Deduplication
  36. System can be come inconsistent after Restore

    Slide 38 - System can be come inconsistent after Restore

    • Restore of individual DBs
    • Scenario
    • VMM Database is restored
    • New Tenant created in WAP
    • Tenant missing in VMM DB
    • Solution
    • Data Consistency Runbook
    • Detect & Fix Inconsistencies
    • Recovery - Data Consistency
  37. Slide 39

    • Off Stamp Backup
    • Management Cluster
    • SQL Cluster for dB’s (4 VM’s)
    • DPM (1)
    • (backing up Management cluster)
    • WAP Tenant (2)
    • WAP Admin Portal/API, Service Reporting (3)
    • Console (2)
    • SMA (3)
    • VMM (2)
    • WDS (1)
    • OM (3)
    • ADFS (2)
    • VMM Library(1)
    • WSUS (1)
    • SPF (2)
    • AD/DNS/DHCP (3)
    • Windows Server Backup
    • Infra Cluster
    • System Center
    • WAP
    • SQL
    • Compute Cluster
    • DPM (2) (for backing up Host Tenant VMs)
    • 2000
    • 8 DPM Servers
    • Tenant VMs
    • (Tenant’s Production workload)
    • Tenant VMs
    • (Tenant’s Production workload)
    • Tenant VMs
    • (Tenant’s Production workload)
    • CPS STAMP
    • Off Stamp DPM Servers
    • Can even be Off Site
  38. Demo: Backup

    Slide 40 - Demo: Backup

    • Thomas Roettinger
  39. “As a service provider I can offer DR as a managed service for deployed multi-tier tenant workloads using my data centers”

    Slide 41 - “As a service provider I can offer DR as a managed service for deployed multi-tier tenant workloads using my data centers”

    • Topology
    • CPS Stamp to CPS Stamp (E2E)
    • CPS Stamp to Azure (E2A)
    • Disaster Recovery Overview
  40. DR Orchestration

    Slide 42 - DR Orchestration

    • SCVMM
    • Compute
    • Storage
    • Networks
    • DRP
    • CPS Stamp 1
    • CPS DR with Azure Site Recovery (E2E)
    • Data Channel on Customer Network
    • (Hyper-V Replica)
    • CPS Stamp 2
    • Azure Site Recovery
    • Azure Pack
    • SCVMM
    • Compute
    • Storage
    • Networks
    • DRP
    • Azure Pack
    • DR as a Plan/Add-On
    • SMA runbook to enable/disable DR protection for all VMs in subscription
    • Full integration of replica VMs with recovery site Azure Pack
    • DR Orchestration
  41. Demo: Disaster Recovery

    Slide 43 - Demo: Disaster Recovery

    • Thomas Roettinger
  42. DR Orchestration

    Slide 44 - DR Orchestration

    • DR Orchestration
    • CPS Stamp
    • DR to Azure (E2A)
    • Extensible Data Channel
    • Secondary Site
    • Microsoft Azure
    • Compute
    • Storage
    • Networks
    • SCVMM
    • Compute
    • Storage
    • Networks
    • DRP
    • Azure Pack
    • Azure Site Recovery
  43. In Review

    Slide 45 - In Review

    • CPS operational experience is optimized for low TCO of any Microsoft Cloud solution
  44. Slide 46

    • Ignite Azure Challenge Sweepstakes
    • Attend Azure sessions and activities, track your progress online, win raffle tickets for great prizes!
    • Aka.ms/MyAzureChallenge
    • Enter this session code online: BRK3460
    • NO PURCHASE NECESSARY. Open only to event attendees. Winners must be present to win. Game ends May 9th, 2015. For Official Rules, see The Cloud and Enterprise Lounge or myignite.com/challenge
  45. Slide 47

    • Learn more with FREE IT Pro Resources
    • Free technical training resources:
    • On-demand online training: http://aka.ms/moderninfrastructure
    • Expand your Modern Infrastructure Knowledge
    • Free ebooks:
    • Deploying Hyper-V with Software-Defined Storage & Networking: http://aka.ms/deployinghyperv
    • Microsoft System Center: Integrated Cloud Platform: http://aka.ms/cloud-platform-ebook
    • Join the IT Pro community:
    • Twitter @MS_ITPro
    • Get hands-on: Free virtual labs:
    • Microsoft Virtualization with Windows Server and System Center: http://aka.ms/virtualization-lab
    • Windows Azure Pack: Install and Configure: http://aka.ms/wap-lab