Made Spark MapReduce obsolete

Qudosoft goes Big Data Part 2

"Qudosoft goes Big Data" is an ongoing series of blog posts in which we describe our experience of our first steps and best practices that have been identified.

In this blog post we present various technologies that we have used so far for the implementation of our projects and share our experiences.

Hadoop

As described in the first blog post, we opted for the Cloudera cluster software, which comes pre-installed and preconfigured with many technologies. In particular, it concerns Hadoop and other technologies that are based on Hadoop. That is why we usually talk about the "Hadoop ecosystem". Hadoop itself represents a framework based on distributed systems, consisting of a distributed file system called HDFS (Hadoop Distributed File System). HDFS is based on the Google File System presented by Google and a calculation platform called MapReduce, which is also based on a technology presented by Google. MapReduce enables distributed computing in the form of batch processing for larger amounts of data. The processing is divided into two phases, which are processed in several small steps. This forced division into two phases initially seems unnecessarily complicated, but offers a decisive advantage: Each of the phases can be completely parallelized and thus distributed over several computing nodes, which drastically shortens the computing time.

The first phase is the “map phase”, which prepares the data and then summarizes it in the second “reduce phase”. Larger data analyzes consist of several MapReduce phases. Counting words in a larger file is the standard example of introducing MapReduce. Each word is listed individually in the map phase. In the reduce phase, all identical words are passed on to an instance, which sums up the occurrences of this one word and thus lists a table of the words with the number. A visualization of the workflow can be seen in the following illustration.

 

One can imagine that MapReduce requires a certain mindset in order to be able to carry out various analyzes on the data.

The HDFS distributed file system is at the heart of Hadoop and many technologies are based on it. A larger file or set of files is saved redundantly and distributed on several computers. Imagine that a very large file is broken up into several small pieces and these can be handled independently of each other. The fragmentations are not all saved on one instance in the distributed system, but distributed so that the entire file can be regenerated, even if one instance fails. An example of the fragmentation of a file and the distributed storage can be found in the following figure.


Abstraction from MapReduce

As already mentioned, larger tasks require a certain way of thinking in order to write them in MapReduce. The ability to chain multiple MapReduce jobs enables such larger tasks to be dealt with. But you can also imagine how difficult it can be to find the cause of an error. Many jobs are based on very simple queries in order to gain insight into the data at all. Imagine very simple SQL queries, such as querying the number of data records for a certain criterion, etc. Writing a MapReduce job is time-consuming:

The job has to

  • written in a supported programming language
  • translated into an executable file
  • uploaded to the cluster
  • are executed
  • save the result
    • which in the worst case has to be downloaded locally in order to be able to view it.

Facebook and Yahoo thought about it and each came up with their own abstraction so as not to have to write MapReduce jobs.

Pig

Pig is a framework from Yahoo, which was developed during the same period as Hive. Scripts are written in the Pig Latin script language. Pig translates these scripts into MapReduce jobs and runs them in the background. In fact, we haven't used Pig yet because most of the simple queries can be made in Hive.

Hive

Hive is a framework from Facebook that allows the creation of a meta structure on data in order to then query it using its own SQL dialect (HiveQL). A template is placed on the corresponding data, which enables the data to be queried as in a real database. In the background, the SQL query is translated and executed in one or more optimized MapReduce jobs. Hive is ideal for many simple queries or the aggregation of data.

The execution of more complex logic in the queries at Hive is not intended. We mainly use Hive and a follow-up technology called Impala to explore the data. Hive can also be expanded with your own functions and then used in the query. These functions are called UDFs.

Hive has the advantage that you don't have to familiarize yourself with a new language, but can start directly in SQL. In addition, other technologies use the Hive Metastore (Impala and Spark). We reached the limits with Hive because the execution takes place as MapReduce jobs and these take noticeably longer than with Impala or Spark, for example. When we had to write a job that involved logic that was difficult to do in SQL, we chose Spark.

Further development of Hadoop

YARN was introduced in the second version of Hadoop. YARN itself resource management in a cluster system, whereby you can determine yourself which resources should be booked and how a calculation should be carried out (scheduling, node management, etc.). Until then, Hadoop - as already described above - consisted only of HDFS and the only calculation platform, MapReduce, with which more and more developers were dissatisfied. YARN made it possible to provide your own computing engines. As a developer you can also write your application with the YARN API, but the larger goal was to give framework developers the opportunity to build other computing engines in addition to MapReduce. The following is a graph of the change in the Hadoop stack.

P.S. Tez is another computing engine that takes care of translation into YARN jobs. However, this is optional and not a standard part of the Hadoop ecosystem in the second version.

As an alternative to YARN (or in addition to it) you can also use mesos. Yarn assigns fixed resources to a job and does not release them until the job is completed. This poses major problems for streaming jobs and also most real-time applications in Hadoop under YARN, because they are supposed to run permanently and thus constantly block resources. As I understand it, Mesos goes in a slightly different direction and only takes resources when it is needed and releases them again when a task in a job is completed. At the next task you will be asked again whether enough resources are available.

impala

Impala is a native analytics database for Hadoop developed by Cloudera. It is based on a technology introduced by Google called Dremel. It uses the same metastructure database as Hive. Instead of doing the query in MapReduce, Impala brings its own computing engine that specializes in distributed systems and that comes to Cloudera similar to running commercial parallel database systems. We personally can attest to the speed boost over Hive.

We primarily use Impala in parallel to Hive to explore and aggregate data. Impala does not have the abundance of functionalities like Hive (e.g. JSON generation or window functions), but Cloudera promises to follow suit.

Spark

Spark is the new star in Big Data heaven and now offers so much functionality that we want to cover Spark in a separate blog post. At this point we can already reveal that our larger and more complex jobs are written in Spark.