Cloud & Big data

Course description

firefoxData volumes are ever growing, for a large application spectrum going from traditional database applications, scientific simulations to emerging applications including Web 2.0 and online social networks. To cope with this added weight of Big Data, we have recently witnessed a paradigm shift in the way data is processed through the MapReduce model. First promoted by Google, MapReduce has become, due to the popularity of its open-source implementation Hadoop, the de facto programming paradigm for Big Data processing in large-scale data-centers and clouds.

 

Course objectives

The goal of this course is to serve as a first step towards exploring data analytics models and technologies used to handle Big Data such as MapReduce (and what’s after), Hadoop, Spark, Flink. An overview on Big Data including definitions, the source of Big Data, and the main challenges introduced by Big Data, will be presented. We will then present the MapReduce programming model as an important programming model for Big Data processing in the Cloud. Hadoop ecosystem and some of major Hadoop features will then be discussed. We will then discuss several approaches and methods used to optimise the performance of Hadoop in the Cloud. Finally, we will discuss the limitations of Hadoop and introduce new Big Data systems including Spark.

Several hand-ons could be provided to study the operation of Hadoop and post-Hadoop systems (Spark) along with the implementation of MapReduce applications.


Course Schedule and Resources (a work in progress)

Lectures

  • [Jan 9th] Introduction to Big data [pdf]
    The goal of this lecture is to provide an overview on Big Data including definitions, the source of Big Data, and the main challenges introduced by Big Data.
  • [Jan 9th, Jan 23rd] MapReduce System: Google File System (GFS) and MapReduce programming model [pdf]
    The goal of these two lectures is to introduce the main design goals and features of the distributed file systems (GFS) and the Google MapReduce programming model. For both systems we will cover issues related to fault-tolerance, data access and so on..
  • [Feb 27th, March 1st] Hadoop Ecosystem [pdf]
    We will dedicate two lectures to introduce Hadoop, the most widely used open-source implementation of MapReduce. We will dicover the Hadoop ecosystem and discuss several optimizations in Hadoop including stragglers mitigation and job scheduling .

Practical Sessions

  • [Jan 26th] Getting started with Hadoop [pdf] [Presentation]
    The goal of this TP is to study the implementation and the operation of the Hadoop Platform. We will see how to deploy the platform and how to send and to retrieve the data to/from the HDFS. Finally we will run simple examples using the MapReduce paradigm.
    You can download Hadoop HERE and TP-resources HERE
    • Hadoop Log files and GUI [PDF]
  • [Feb 9th] Deploy Hadoop on Grid'5000, Utlize and configure HDFS
    • Deploy Hadoop on One Node
      • You can find the Grid5k image and the env file in (/sibrahim/public/)
      • Check if you have Java installed and install it if needed.
      • Copy the TP-resources to your deployed environment.
    • Remember the follwoing command to reserve and deploy your environment:
      • oarsub -t deploy -l nodes=4,walltime=1:00 -I
      • kadeploy3 -a /home/yourUserName/public/MyImage.env -f $OAR_FILE_NODES -k .ssh/id_rsa.pub
    • Configure your HDFS [PDF]
  • [March 2nd] Deploy Hadoop on Grid'5000 (multiple nodes), Utilize and configure Hadoop
    • Deploy Hadoop on N Node (4-5 nodes) - Replication factor = 3
    • elinks for Hadoop GUI on Grid'5000
    • To Run Grep: bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+
    • Randomwriter: bin/hadoop jar hadoop-examples-*.jar randomwriter -Dtest.randomwrite.total_bytes=<value> output
    • Sort: bin/hadoop jar hadoop-examples-*.jar sort input output
    • Configure and optimze Hadoop cluster [PDF]
    • Running multiple MapReduce appliactions (Job scheduling) [PDF]
  • [March 13th] Utilize and configure Hadoop: Practical Evaluation [PDF]
  • [March 23rd] Developing MapReduce applications [PDF] [Source]
    • Try to do this expercise on your own machine
  • [March 30th] Writing an Hadoop MapReduce Program in Python [PDF]
    • mapper.py

    #!/usr/bin/env python
    """A more advanced Mapper, using Python iterators and generators."""

    import sys

    def read_input(file):
    for line in file:
    # split the line into words
    yield line.split()

    def main(separator='\t'):
    # input comes from STDIN (standard input)
    data = read_input(sys.stdin)
    for words in data:
    # write the results to STDOUT (standard output);
    # what we output here will be the input for the
    # Reduce step, i.e. the input for reducer.py
    #
    # tab-delimited; the trivial word count is 1
    for word in words:
    print '%s%s%d' % (word, separator, 1)

    if __name__ == "__main__":
    main()

    • reducer.py

    #!/usr/bin/env python
    """A more advanced Reducer, using Python iterators and generators."""

    from itertools import groupby
    from operator import itemgetter
    import sys

    def read_mapper_output(file, separator='\t'):
    for line in file:
    yield line.rstrip().split(separator, 1)

    def main(separator='\t'):
    # input comes from STDIN (standard input)
    data = read_mapper_output(sys.stdin, separator=separator)
    # groupby groups multiple word-count pairs by word,
    # and creates an iterator that returns consecutive keys and their group:
    # current_word - string containing a word (the key)
    # group - iterator yielding all ["&lt;current_word&gt;", "&lt;count&gt;"] items
    for current_word, group in groupby(data, itemgetter(0)):
    try:
    total_count = sum(int(count) for current_word, count in group)
    print "%s%s%d" % (current_word, separator, total_count)
    except ValueError:
    # count was not a number, so silently discard this item
    pass

    if __name__ == "__main__":
    main()

  • [March 30th] Getting started with Spark [PDF] [Spark-distribution] [Source]
  • [March 30th] Getting started with Yarn [PDF] [Yarn-distribution]
  • Optional: write your own script to deploy Hadoop in Grid'5000
  • Please check more at : https://hadoop.apache.org/docs/r1.0.4/commands_manual.html
  • Remember the Pad:
      • https://pad.inria.fr/p/g.9PS7LipkkreH3HKB$rkUhko7Eko3qADLc ("Cbd-ens-")

Note: If you are using Windows (Or prefer to use a VM on your Linux system), you can download a ready Virtual Machine from the following link:

https://filesender.renater.fr/?s=download&token=6660cc5c-3ace-0cf5-20d6-d7f7c2f10574 (This Link is valid from 25/01/2017 to 15/02/2017).

Download the file (Ubuntu-without.zip). Unzip the file and please try to use VMware-player on Windows or VMware-Workstation on Linux to run this VM.

Note2: If you are depoying Hadoop on Mac, you may need to add the following line to your Hadoop-env file:

export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk"


Reading

  • The Google File System [PDF]
  • MapReduce: Simplified Data Processing on Large Clusters [PDF]
  • Apache Hadoop YARN: Yet Another Resource Negotiator [PDF]
  • Discretized Streams: Fault-Tolerant Streaming Computation at Scale [PDF]
  • Hadoop’s Adolescence [PDF]

Presentation Topics:

  • Fault-Tolerance in MapReduce
    • Florin Dinu and T.S. Eugene Ng. 2012. Understanding the effects and implications of compute node related failures in hadoop. In Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing (HPDC '12).
    • Steven Y. Ko, Imranul Hoque, Brian Cho, and Indranil Gupta. 2010. Making cloud intermediate data fault-tolerant. In Proceedings of the 1st ACM symposium on Cloud computing (SoCC '10).
      • [Short Version] Steven Y. Ko, Imranul Hoque, Brian Cho, and Indranil Gupta. 2009. On availability of intermediate data in cloud computations. In Proceedings of the 12th conference on Hot topics in operating systems (HotOS'09).
    • J. A. Quiané-Ruiz, C. Pinkel, J. Schad and J. Dittrich. 2011. RAFTing MapReduce: Fast recovery on the RAFT. In Proceedings of 2011 IEEE 27th International Conference on Data Engineering (ICDE 2011).
    • Orcun Yildiz, Shadi Ibrahim, Gabriel Antoniu. 2016. Enabling fast failure recovery in shared Hadoop clusters: Towards failure-aware scheduling. Future Generation Computer Systems (FGCS).
  • Data skew in MapReduce
    • Yongchul Kwon and Magdalena Balazinska and Bill Howe and Jerome Rolia. 2011. A study of skew in mapreduce applications. In the 5th Open Cirrus Summit.
    • Shadi Ibrahim, Hai Jin, Lu Lu, Bingsheng He, Gabriel Antoniu, et al.. 2013. Handling Partitioning Skew in MapReduce using LEEN. Peer-to-Peer Networking and Applications, Springer (PPNA).
    • YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome Rolia. 2012. SkewTune: mitigating skew in mapreduce applications. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD '12).
    • Benjamin Gufler, Nikolaus Augsten, Angelika Reiser, Alfons Kemper. 2011. Handling Data Skew in MapReduce. In Closer, 2011.
  • Energy-Efficiency in MapReduce
    • Jacob Leverich and Christos Kozyrakis. 2010. On the energy (in)efficiency of Hadoop clusters. SIGOPS Oper. Syst. Rev. (March 2010).
    • Willis Lang and Jignesh M. Patel. 2010. Energy management for MapReduce clusters. Proc. VLDB Endow. (September 2010).
    • Rini T. Kaushik, Milind Bhandarkar, and Klara Nahrstedt. 2010. Evaluation and Analysis of GreenHDFS: A Self-Adaptive, Energy-Conserving Variant of the Hadoop Distributed File System. In Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science (CLOUDCOM '10).
      • [Short Version] Rini T. Kaushik and Milind Bhandarkar. 2010. GreenHDFS: towards an energy-conserving, storage-efficient, hybrid Hadoop compute cluster. In Proceedings of the 2010 international conference on Power aware computing and systems (HotPower'10).
    • Íñigo Goiri, Kien Le, Thu D. Nguyen, Jordi Guitart, Jordi Torres, and Ricardo Bianchini. 2012. GreenHadoop: leveraging green energy in data-processing frameworks. In Proceedings of the 7th ACM european conference on Computer Systems (EuroSys '12).
  • All the papers are available on the following link: https://filesender.renater.fr/?s=download&token=00f3d76c-af30-8962-e135-1a939caaa767
    • This Link is valid from 15/02/2017 to 18/02/2017.

Grading

Final course mark will be based on:

  • Presentations
  • Final practical exam.