Data Warehousing and analysis in facebook

1.0x

Data Warehousing and analysis in facebook

Created 2 years ago

Duration 0:09:42
lesson view count 63
Select the file type you wish to download
Slide Content
  1. Data Warehousing and Analytics Infrastructure           at Facebook

    Slide 1 - Data Warehousing and Analytics Infrastructure at Facebook

    • Naveen Reddy Thumma
    • Y00749755
  2. Facebook is popular social networking platform used by millions

    Slide 2 - Facebook is popular social networking platform used by millions

    • of people around the world
    • It generates 10 to 15 TB of data every day and this rate is
    • increasing day by day
    • with such a large data collecting every day there is a need of a strong platform to process and store the data
    • To Do this job Two open source platforms Hadoop and Hive are
    • used in facebook to process the such a huge quantity of data
    • Facebook also uses scribeh a platform which aggregates the log collections
  3.   Hive is a platform created by facebook to support there data processing and analysis and this platform is created on top Hadoop frame work to have more advanced features

    Slide 3 - Hive is a platform created by facebook to support there data processing and analysis and this platform is created on top Hadoop frame work to have more advanced features

    • Hadoop is a popular Data processing frame work developed by apache which uses map
    • redusing techniques
    • Scribeh is a framework which collects the log from different servers and aggrigates it
    • Data Flow Architecture
    • The data flow architecture shows how two sources of data federated Mysql and web tire work and how data processing is done on facebooks large collection of data
  4. DATA FLOW ARCHITECTURE

    Slide 4 - DATA FLOW ARCHITECTURE

  5. Data Flow Architecture

    Slide 5 - Data Flow Architecture

    • The data from various webservers is pushed to set of scribe-hadoop cluster this Scribe servers aggregates the logs coming from different servers
    • then this data is compressed by copier jobs and transfers to hive Hadoop cluster here the copier runs at 5 to 15 minutes time intervals and copy out all the new files created in scribeh cluster and then the log data is pushed to hive-hadoop cluster
    • federated mysql tire also gets loaded to hive Haddop cluster in daily scrape processes
  6. Storage

    Slide 6 - Storage

    • due to this huge quantity of data Facebook uses gzip to compress the data as Hadoop allows to compress data using user specified codec
    • facebook also uses row columnar compression in hive for many tables
    • Facebook compression factor is around 6-7
    • Scaling HDFS NameNode
  7. DATA DISCOVERY AND ANALYSIS

    Slide 7 - DATA DISCOVERY AND ANALYSIS

  8. DATA DISCOVERY AND ANALYSIS

    Slide 8 - DATA DISCOVERY AND ANALYSIS

    • At Facebook querying and analysis of data is done predominantly through Hive. The data sets are published in Hive as tables with daily or hourly partitions
    • The tables and partitions are stored in hive which are stored as HDFS directories and these mapping and structural information of these objects are stored in hive metastore the driver uses this information to convert it in to hiveql
    • Interactive Ad hoc queries
    • At Facebook ad hoc queries are executed by the users through either HiPal, a web based graphical interface to Hive or Hive CLI a command line interface similar to mysql shell
    • Data Discovery
    • Periodic Batch Jobs
  9. RESOURCE SHARING

    Slide 9 - RESOURCE SHARING

    • The co-existence of interactive ad hoc queries and periodic batch jobs on the same set of cluster resources has many implications on how resources are shared between different jobs in the cluster
    • ad hoc users need minimal response time
    • periodic batch jobs require a predictable execution time as they are more concerned with data being available before a certain deadline
    • Facebook has been instrumental in the development of the Hadoop Fair Shar Scheduler which helps users to have their own set of pool of resources even if they are in same cluster
    • Operations that takes place in facebook