Science Hadoop Real World Solutions Cookbook Pdf


Wednesday, October 9, 2019

and depression to functional syndromes like irritable bowel, fibromyalgia Dummies, is a member of the Association for Hadoop Real-World Solutions. Do you need instant solutions to your IT questions? PacktLib is . Hadoop MapReduce Cookbook helps readers learn to process large and complex datasets. straightforward manner, with step-by-step instructions and real world examples. Request PDF on ResearchGate | Hadoop Real World Solutions Cookbook - Second Edition | Big data is the current requirement. Most organizations produce .

Language:English, Spanish, Hindi
Published (Last):03.06.2016
ePub File Size:30.67 MB
PDF File Size:18.27 MB
Distribution:Free* [*Regsitration Required]
Uploaded by: KARY

Hadoop Real-World Solutions Cookbook helps developers become more Big data is the current requirement. Most organizations produce huge amount of data every day. With the arrival of Hadoop-like tools, it has. Hadoop Real-World Solutions Cookbook- Second Edition - Sample Chapter - Free download as PDF File .pdf), Text File .txt) or read online for free. Chapter No.

June 13 Written By: Hadoop is a free, Java-based programming framework that enables the processing of large data in a distributed computing environment. It is part of the Apache open source project sponsored by the Apache Software Foundation. So we bring to you the Top 10 eBooks available on Hadoop that will help you to get your concepts clear. This book is a concise guide to getting started with Hadoop and getting the most out of your Hadoop clusters.

Through his innovative thinking and dynamic leadership, he has successfully completed various projects. He regularly blogs on his website http: You can connect with him on LinkedIn at https: Sign up to our emails for regular updates, bespoke offers, exclusive discounts and great free content. Log in. My Account. Log in to your account. Not yet a member? Register for an account and access leading-edge content on emerging technologies.

Register now. Packt Logo. My Collection. Deal of the Day Understand the fundamentals of C programming and get started with coding from ground up in an engaging and practical manner. Sign up here to get these deals straight to your inbox.

Find Ebooks and Videos by Technology Android. Packt Hub Technology news, analysis, and tutorials from Packt. Insights Tutorials. News Become a contributor. Categories Web development Programming Data Security. Subscription Go to Subscription. Subtotal 0. Title added to cart. Subscription About Subscription Pricing Login. Features Free Trial.

Search for eBooks and Videos. Over 90 hands-on recipes to help you learn and master the intricacies of Apache Hadoop 2. Are you sure you want to claim this product using a token? Tanmay Deshpande March Quick links: What do I get with a Packt subscription? What do I get with an eBook? What do I get with a Video? Frequently bought together. Learn more Add to cart. Data Processing and Modelling. Paperback pages. Book Description Big data is the current requirement. Table of Contents Chapter 1: Getting Started with Hadoop 2.

Executing the balancer command for uniform data distribution. Entering and exiting from the safe mode in a Hadoop cluster.

Chapter 2: Exploring HDFS. Changing the replication factor of an existing file in HDFS. Setting the HDFS block size for all the files in a cluster.

Setting the HDFS block size for a specific file in a cluster. Chapter 3: Mastering Map Reduce Programs. Writing the Map Reduce program in Java to analyze web log data. Implementing a user-defined counter in a Map Reduce program. Map Reduce program to partition data using a custom partitioner.

Chapter 4: Storing and processing Hive data in a sequential file format. Storing and processing Hive data in the ORC file format. Storing and processing Hive data in the Parquet file format.

Executing the MapReduce programming with an Hbase Table. Chapter 5: Advanced Data Analysis Using Hive.

Chapter 6: Chapter 7: Automation of Hadoop Tasks Using Oozie. Chapter 8: Creating an item-based recommendation engine using Mahout. Creating a user-based recommendation engine using Mahout. Chapter 9: Integration with Apache Spark. Creating Twitter trending topics using Spark Streaming. Hadoop itself is written in Java, and Java is, of course, a preferred way to write Map Reduce programs, but this does not restrict you to only using Java.

It provides libraries, such as Hadoop-Streaming, Hadoop Pipes, and so on so that you can write map reduce programs in most popular languages. Writing the Map Reduce program in Java to analyze web log data In this recipe, we are going to take a look at how to write a map reduce program to analyze web logs. Web logs are data that is generated by web servers for requests they receive. There are various web servers such as Apache, Nginx, Tomcat, and so on. Each web server logs data in a specific format.

In this recipe, we are going to use data from the Apache Web Server, which is in combined access logs. To read more on combined access logs, refer to http: Getting ready To perform this recipe, you should already have a running Hadoop cluster as well as an eclipse similar to an IDE. How to do it We can write map reduce programs to analyze various aspects of web log data. In this recipe, we are going to write a map reduce program that reads a web log file, results pages, views, and their counts.

Here is some sample web log data we'll consider as input for our program: Chapter 3 These combined Apache Access logs are in a specific format. Here is the sequence and meaning of each component in each access log:.

This is the identity of the user determined by an identifier this is not usually used since it's not reliable. Now, let's start a writing program in order to get to know the page views of each unique URL that we have in our web logs. First, we will write a mapper class where we will read each and every line and parse it to the extract page URL.

Here, we will use a Java pattern that matches a utility in order to extract information: Mastering Map Reduce Programs url. In the preceding mapper class, we read key value pairs from the text file. By default, the key is a byte offset the number of characters in a line , and the value is an actual line in a text file.

Next, we match the line with the Apache Access Log regex pattern so that we can extract the exact information we need. For a page view counter, we only need a URL. Mapper outputs the URL as a key and 1 as the value. So, we can count these URL in reducer. Here is the reducer class that sums up the output values of the mapper class: Now, we just need a driver class to call these mappers and reducers: As the operation we are performing is aggregation, we can also use a combiner here to optimize the results.

Here, the same reducer logic is being used as the one used for the combiner. To compile your program properly, you need to add two external JARs, hadoop-common2.

Make sure you add these two JARs in your build path so that your program can be compiled easily.

Hadoop Real-World Solutions Cookbook - PDF Drive

How it works The page view counter program helps us find the most popular pages, least accessed pages, and so on. Such information helps us make decisions about the ranking of pages, frequency of visits, and the relevance of a page.

When a program is executed, each line of the HDFS block is read individually and then sent to Mapper. Mapper matches the input line with the log format and extracts its page URL. Mapper emits the URL,1 type of key value pairs. These pairs are shuffled across nodes and partitioners to make sure that a similar URL goes to only one reducer. Once received by the reducers, we add up all the values for each key and emit them. This way, we get results in the form of a URL and the number of times it was accessed.

Hadoop Real-World Solutions Cookbook - Second Edition

Executing the Map Reduce program in a Hadoop cluster In the previous recipe, we took a look at how to write a map reduce program for a page view counter. In this recipe, we will explore how to execute this in a Hadoop cluster. How to do it To execute the program, we first need to create a JAR file of it. JAR stands for Java Archive file, which contains compiled class files. To create a JAR file in eclipse, we need to perform the following steps: Right-click on the project where you've written your Map Reduce Program.

Then, click on Export. Browse through the path where you wish to export the JAR file, and provide a proper name to the jar file. Click on Finish to complete the creation of the JAR file. Now, copy this file to the Hadoop cluster.

Hadoop Real-World Solutions Cookbook- Second Edition - Sample Chapter

If you don't already have your input log files in HDFS, use following commands: Now, it's time to execute the map reduce program. Use the following command to start the execution: This will start the Map Reduce execution on your cluster.

Here, logAnalyzer is the name of the JAR file we created through eclipse. It is also important to provide a fully qualified name to the class along with its package name.

Once the job is submitted, it first creates the Application Client and Application Master in the Hadoop cluster. The application tasks for Mapper are initiated in each node where data blocks are present in the Hadoop cluster. Once the Mapper phase is complete, the data is locally reduced by a combiner. Once the combiner finishes, the data is shuffled across the nodes in the cluster.

Unless all the mappers have finished, reducers cannot be started. Output from the reducers is also written to HDFS in a specified folder. The output folder to be specified should be a nonexisting folder in HDFS. If the folder is already present, then the program will give you an error. When all the tasks are finished for the application, you can take a look at the output in HDFS.

The following are the commands to do this: Adding support for a new writable data type in Hadoop In this recipe, we are going to learn how to introduce a new data type in Map Reduce programs and then use it.

Getting ready To perform this recipe, you should have a running Hadoop cluster as well as an eclipse that's similar to an IDE. Hadoop allows us to add new custom data types ,which are made up of one or more primary data types. In the previous recipe, you must have noticed that when you handled the log data structure, you had to remember the sequence in which each data component was placed.

This can get very nasty when it comes to complex programs. To avoid this, we will introduce a new data type in which WritableComparable can be used efficiently. To add a new data type, we need to implement the WritableComparable interface, which is provided by Hadoop. This interface provides three methods, readFields DataInput in , write DataOut out , and compareTo To , which we will need to override with our own custom implementation.

Here, we are going to abstract the log parsing and pattern matching logic from the user of this data type by providing a method that returns parsed objects: Mastering Map Reduce Programs new Text m. The following piece of code shows us how we can use the data type in our map reduce program; here, I am going to update the same program that we used in the previous recipe.

So, the mapper code will be updated as follows: The highlighted code shows you where we have used our own custom data type. Here, the reducer and driver code remain as it is. Refer to the previous recipe to know more about these two. To execute this code, we need to bundle the datatype class and map reduce program into a single JAR itself so that at runtime the map reduce code is able to find our newly introduced data type.

We know that when we execute the map reduce code, a lot of data gets transferred over the network when this shuffle takes place. Sometimes, the size of the keys and values that are transferred can be huge, which might affect network traffic. To avoid congestion, it's very important to send the data in a serialized format.

These are wrapper classes on top of primitive data types, which can be serialized and deserialized easily. The keys need to be WritableComparable, while the values need to be Writable. Technically, both keys and values are WritableComparable. Apart from the set of built-in data types, Hadoop also supports the introduction of custom and new data types that are WritableComparable.

This is done so that the handling of complex data becomes easy and serialization and deserialization is taken care of automatically. WritableComparable are data types that are easy to serialize and deserialize and can be compared in order to decide what their order is.

Implementing a user-defined counter in a Map Reduce program In this recipe, we are going to learn how to add a user-defined counter so that we can keep track of certain events easily. After every map reduce execution, you will see a set of system defined counters getting published, such as File System counters, Job counters, and Map Reduce Framework counters.

These counters help us understand the execution in detail. They give very detailed information about the number of bytes written to HDFS, read from HDFS, the input given to a map, the output received from a map, and so on. Similar to this information, we can also add our own user-defined counters, which will help us track the execution in a better manner. In earlier recipes, we considered the use case of log analytics. There can be chances that the input we receive might always not be in the same format as we expect it to be.

So, its very important to track such bad records and also avoid any failures because of them. In order to achieve this, in this recipe, we are going to add one custom counter that keeps track of such bad records without failing the task.

Mastering Map Reduce Programs First of all, we have to define the counter as enum in our program: Next, we will update our mapper code to use the defined counter, as shown here: The reducer code will remain as it is while we will update the driver code to print the final count of invalid records, as shown here: Chapter 3 job. Now, to demonstrate, I've added a few invalid records records with fewer columns than expected and added the log file to HDFS.

So, when I execute the program, I can see the invalid record count getting printed at the end of the execution: Hadoop commandline option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. Total input paths to process: Submitting tokens for job: The url to track the job: Running job: Custom counters are helpful in various situations such as keeping track of bad records, count outliers in the form of maximum and minimum values, summations, and so on.

The Hadoop framework imposes an upper limit on using these counters. In either case, they are referred to using the group and counter names. All the counters are managed at the Application Master level. Information about each increment or decrement is passed to the Application Master via heartbeat messages between the containers that run mappers and reducers.

It is better to keep the counters to a limited number as this causes an overhead on the processing framework. The best thing to do is to remember a thumb rule: Map Reduce program to find the top X In this recipe, we are going to learn how to write a map reduce program to find the top X records from the given set of values. A lot of the time, we might need to find the top X values from the given set of values. A simple example could be to find the top 10 trending topics from a Twitter dataset.

In this case, we will need to use two map reduce jobs. First of all, find out all the words that start with and the number of times each hashtag has occurred in a given set of data. The first map reduce program is quite simple, which is pretty similar to the word count program. But for the second program, we need to use some logic. In this recipe, we'll explore how we can write a map reduce program to find the top X values from the given set.

Now, though, lets try to understand the logic behind this.

As shown in the preceding figure, our logic includes finding the top 10 words from each input split and then sending these records to only one reducer. In the reducer, we will again find the top 10 words to get the final result of the top 10 records. Now, let's understand the execution. First, let's prepare the input. Here, we will use the word count program provided along with Hadoop binaries: First of all, let's put data in HDFS to be processed.

This output will be used as the input for our top 10 map reduce program. Chapter 3 Let's take a look at the mapper code: In the preceding code, we are using TreeMap to store the words and their count.

TreeMap helps store keys and values sorted order by the key. Here, we are using the count as the key and words as values. In each Mapper iteration, we check whether the size is greater than If it is, we remove the first key from the key map, which would be the lowest count of the set. This way, at the end of each mapper, we will emit the top 10 words of the reducer.

You can read more about TreeMap at http: Mastering Map Reduce Programs Now, let's take a look at the reducer code: In the reducer, we will again use TreeMap to find the top 10 of all the collected records from each Mapper. Here, is it very important to use only one reducer for the complete processing; hence, we need to set this in the Driver class, as shown here: Now, when you execute the preceding code, as a result, you will see the output in the form of the top 10 words due to their frequencies in the document.

You can modify the same program to get the top 5, 20, or any number. How it works Here, the logic is quite straightforward, as shown in the preceding diagram. The trick is using TreeMap, which stores data in a sorted key order. It is also important to use only one reducer, and if we can't, we will again get the number of sets of the top records from each reducer, which will not show you the correct output. Map Reduce program to find distinct values In this recipe, we are going to learn how to write a map reduce program to find distinct values from a given set of data.

Getting ready To perform this recipe, you should have a running Hadoop cluster as well as an eclipse that is similar to an IDE. How to do it Sometimes, there may be a chance that the data you have contains some duplicate values.

In SQL, we have something called a distinct function, which helps us get distinct values.

Hadoop Real-World Solutions Cookbook

In this recipe, we are going to take a look at how we can get distinct values using map reduce programs. Mastering Map Reduce Programs Let's consider a use case where we have some user data with us, which contains two columns: Let's assume that the data we have contains duplicate records, and for our processing needs, we only need distinct records through user IDs. Here is some sample data that we have where columns are separated by ' ': The idea here is to use the default reducer behavior where the same keys are sent to one reducer.

In this case, we will make userId the key and emit it to the reducer. In the reducer, the same keys will be reduced together, which will avoid duplicates. Let's look at the Mapper Code. We only want distinct user IDs, hence, we emit only user IDs as keys and nulls as values. Now, let's look at the reducer code: Chapter 3 Here, we only emit user IDs as they come.

This step removes duplicates as the reducer only treats the records by their keys and only one record per key is kept. The driver code remains simple, as shown here: Now, when we execute the code, we will see the following output: When mapper emits keys and values, the output is shuffled across the nodes in the cluster. Here, the partitioner decides which keys should be reduced and on which node.

On all the nodes, the same partitioning logic is used, which makes sure that the same keys are grouped together. In the preceding code, we use this default behavior to find distinct user IDs. Map Reduce program to partition data using a custom partitioner In this recipe, we are going to learn how to write a map reduce program to partition data using a custom partitioner.

Getting ready To perform this recipe, you should have a running Hadoop cluster running as well as an eclipse that's similar to an IDE.

During the shuffle and sort, if it's not specified, Hadoop by default uses a hash partitioner. We can also write our own custom partitioner with custom partitioning logic, such that we can partition the data into separate files.

Let's consider one example where we have user data with us along with the year of joining. Now, assume that we have to partition the users based on the year of joining that's specified in the record.

SEYMOUR from Montana
I do like reading comics obnoxiously . See my other posts. I am highly influenced by mixed climbing.