Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Cloud Computing: Resource Management and Data Stores in Distributed Systems, Study Guides, Projects, Research of Computer Networks

An in-depth analysis of scalable distributed data stores, resource management for multi-tenancy and elasticity, and virtualization techniques in the context of Cloud computing. It also covers storage systems like key-value stores and discusses research management issues such as automated provisioning, load balancing, and scheduling. The document further explains how to develop cloud applications using popular frameworks and compares their capabilities.

What you will learn

  • How can you develop cloud applications using popular frameworks, and what are their capabilities?
  • What are the semantics of different data stores and their applicability for developing Cloud applications?
  • How does resource management work in Cloud computing, and what are the different approaches?
  • How do resource management systems interact with frameworks in Cloud computing?
  • What are the mechanisms for virtualizing different subsystems in Cloud computing?

Typology: Study Guides, Projects, Research

2019/2020

Uploaded on 10/16/2022

luv-babbar
luv-babbar 🇮🇳

3 documents

1 / 22

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
ACKNOWLDGEMENT
Many people have contributed in a variety of ways in my summer training without which it was not
possible to complete my training. First, I would like to express my sincere gratitude to Almighty God
who granted me health and strength to complete the internship.
I would like to express my sincere gratitude to my supervisors of Team Coursera for providing their
invaluable guidance, comments and suggestions throughout the course.
I am also thankful to get constant guidance, supervision and advice from all the lecturers of the course
(by Prof. Umakishore Ramchandran) which helped me to successfully complete my Internship.
I would also like to express my thankfulness to the rest of the COURSERA team for their support and
guidance which helped me to overcome the challenges I faced during the past 6 weeks of my internship
and for designing such a fruitful knowledge content and a smooth and user-friendly interface for making
training possible in the tough times of Covid-19.
Last but not the least, I would like to thank NIT Raipur for allowing me to join this training and
thanking my parents for supporting me economically and emotionally.
ABSTRACT
This course provides an introduction to programming frameworks and their implementation issues in the
Cloud. It explains multiple topics, including: scalable distributed data stores, resource management (for
supporting multi-tenancy and elasticity) and virtualization techniques. Guidance in the implementation
of a basic version of the distributed runtime system for the Map-Reduce programming framework is also
provided.We also learnt about the concept of virtualization and the associated mechanisms of
virtualization in the Cloud. We further discussed storage systems for the Cloud, including key-value
stores and then we focused on research management issues including automated provisioning, load
balancing and scheduling.At last,we discussed scalability, performance characterization, and
benchmarking.
4
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16

Partial preview of the text

Download Cloud Computing: Resource Management and Data Stores in Distributed Systems and more Study Guides, Projects, Research Computer Networks in PDF only on Docsity!

ACKNOWLDGEMENT

Many people have contributed in a variety of ways in my summer training without which it was not possible to complete my training. First, I would like to express my sincere gratitude to Almighty God who granted me health and strength to complete the internship. I would like to express my sincere gratitude to my supervisors of Team Coursera for providing their invaluable guidance, comments and suggestions throughout the course. I am also thankful to get constant guidance, supervision and advice from all the lecturers of the course (by Prof. Umakishore Ramchandran) which helped me to successfully complete my Internship. I would also like to express my thankfulness to the rest of the COURSERA team for their support and guidance which helped me to overcome the challenges I faced during the past 6 weeks of my internship and for designing such a fruitful knowledge content and a smooth and user-friendly interface for making training possible in the tough times of Covid-19. Last but not the least, I would like to thank NIT Raipur for allowing me to join this training and thanking my parents for supporting me economically and emotionally. ABSTRACT This course provides an introduction to programming frameworks and their implementation issues in the Cloud. It explains multiple topics, including: scalable distributed data stores, resource management (for supporting multi-tenancy and elasticity) and virtualization techniques. Guidance in the implementation of a basic version of the distributed runtime system for the Map-Reduce programming framework is also provided.We also learnt about the concept of virtualization and the associated mechanisms of virtualization in the Cloud. We further discussed storage systems for the Cloud, including key-value stores and then we focused on research management issues including automated provisioning, load balancing and scheduling.At last,we discussed scalability, performance characterization, and benchmarking.

OBJECTIVES

● Explain and compare different popular programming frameworks for developing cloud applications, with a detailed description of the architecture of some of them. ● Describe the architecture of the various distributed data stores from the leading Cloud providers. ● Explain the semantics of the different data stores and their applicability for developing Cloud applications ● Describe the architecture of resource management systems in the Cloud, and distinguish between different approaches for resource management in the Cloud ● Explain how the resource management systems interact with the frameworks ● Describe the mechanisms for virtualizing the different subsystems INDEX PARTICULARS REMARKS Title - Approval Certificate - Certificate of Training - Acknowledgement - Abstract - Index - Learning Objectives Parallel programs Week- Programming frameworks Week- Storage systems Week- Resource management Week- Virtual technologies Week- Device virtualization Week- Conclusion - Bibliography -

system worries about distribution and scheduling of the computation respecting the data dependencies that have been put forth by the developer. Automatic parallelization is taking a sequential program and trying to parallelize it automatically,which is not practically possible. And so therefore, the developer supplied application component is made as the unit of scheduling and distribution. Multiple copies of that unit run in parallel on a whole bunch of data sets. This way programming models are developed that run on large-scale data centers. Failure is a fundamental feature of data center applications and the failures can come in various different ways. And so despite these failures, which may be of the computational elements, it could be of the networking fabric, despite all of these things, a deterministic computation of the application is developed by domain experts. So that's sort of the stage. Some of the proposed programming frameworks are Map-reduce, Dryad, Spark, Pig Latin, Hive and Apache Tez. MAP REDUCE FRAMEWORK: Basically in the MapReduce programming model proposed by Google, input and output to each of these two functions called map and reduce, they appear as key value pairs.There are several instances of the map function which is written by programmers,which work upon different datasets.Map function processes the key-value pair and return a key-value pair as an output.The key is the unique name that's being generated and the value is the number.So the interaction between every component is in terms of key value, it is just that the input key-value pair is different from the output key value pair that is coming out of the map function.The key value pair that is generated as the output of the reduce function is the same as the output of the map function. So that's essentially what the MapReduce programming framework is asking the programmer to do. Several processing steps in giant scale services can be expressed as MapReduce function, and all of the domain experts are asked to write two functions; a map function and a reduce function. They know exactly what the semantics of those functions are and they know that the input to the map function is a key value pair, the output is a key value pair which serves as the intermediate result for the reduce function to work on. All the details of scheduling and plumbing, all the thing is being done by the runtime, instantiating the number of mappers and reducers, and effecting the data movement between the mappers and the reducers. All of that is going to be done by the runtime. SHARDING: Sharding is a new term that has come up in data center applications. Sharding the data is taking a dataset, bringing it into slices that you can give to different mappers.That's the responsibility of the system. A distributed file system is used for communication between the mappers and reducers. There could be a default policy that is implemented in the system on how to shard the data or the system decides that depending on the number of mappers or the developer can provide the necessary details..

HEAVY-LIFTING BY MAP REDUCE RUNTIME:

Heavy lifting has to be done by a runtime developer. The master manages a pool of worker threads and it is going to assign to the worker threads mappers. And each mapper is going to work on a particular short of the data that has been decided by the system on how to split it. And so this is the work being done by the mapper. And what these mappers do is they generate intermediate results and write those results in intermediate files on local disks. And it is being written into a local disk for the following reason. It is possible that one of those nodes fails. And if it fails then we need to regenerate that particular computation in some of the mappers. Hence,output remains immutable. And so the new mapper is going to generate its own result and it is going to write that result somewhere. And finally, once it is all written and it has been notified, the master has been notified that the computation has completed successfully. Master traces the progress of work and the record of local disks upon which results are stored.And then there are a bunch of workers that are going to do the reduce function. And they can then do RPC on the data set coming from each one of these storage servers to get the intermediate results that they have to work on to produce the aggregated result. That's the purpose of the reducers. And so this is a map phase and the reduce phase. And the output of the reduce phase is the final result that is going to be written out, right.So in terms of management, the master's function spawn workers, assign mappers, assign reducers, and plum the mappers to the reducers. Master also has to manage the machines depending upon the number of available and required resources. Reducers do not need to wait until all mappers are done.They can work incrementally as more mappers finish, more intermediate results are available and they can fetch the intermediate results by RPC.But of course every reducer knows that it has to get intermediate results from all of these mappers.So until that is available, they cannot complete their work. But it need not be synchronous in terms of waiting for the mappers to finish before the reducers can start. The map-reduce framework works really well for offline processing of data which requires some statistics. E.g. crawling the web, to create indices for search databases, creating the ranks of pages that are found in the Internet. This is primarily focused on offline processing of a large corpus of data, which is the dominant workload that data center applications worry about today. DISADVANTAGES OF MAP-REDUCE: It aims for simplicity, which applies to a large class of applications. But it works at the expense of generality and performance. And in particular, it cannot have arbitrary computations, so we have to put every computation into the map-reduce framework, which is pretty artificial depending on the kinds of applications. It simply uses files for communication among the application components, and it is a strict two level graph. There's a mapper and reducer, and there's a single input, single output channel from a mapper to the reducer. So these are the restrictions that are there in the map-reduce framework. The map-reduce framework is extremely popular because a large class of applications fit into that mold.

with a name server to find out what are all the nodes that are available at any point of time. And then it is going to launch a portion of the application graph on the available compute nodes. And so the only interaction for the job manager with the computation is in the control plane of launching these application components of the subgraph onto the computation nodes. And all of the actual communication inter-subroutine or the data flow communication that's happening is entirely specified through the data plane as files or TCP IP connection or shared memory and so on. Dryad systems give the flexibility over the map-reduce framework, and the ability to construct arbitrary graphs and specify the kind of transport to use for communication among the members of the data flow graph. The complexity of the data center management is entirely in the job manager. It is not something that the application developer has to worry about.

PIG LATIN,HIVE AND APACHE TEZ:

Pig Latin was developed at Yahoo. It generates a programming framework that is in-between the declarative style of SQL and meaning that in SQL, because it's a database query and we have to be declarative in how we query and express the computation as a sequence of queries. MapReduce, on the other hand, is a procedural style. The reason is you want to break the rigidity of MapReduce. Pig Latin supports user defined functions, UDF, as they call it, as first class entities. It has primitives for grouping user-defined functions, joining, filtering and so on. These are all user-defined functions that are defined and integrated into the Pig Latin framework. So that makes it more powerful than the rigid framework of MapReduce.It also provides a nested data model. They have three levels of nesting. First level is called atom. So individual data elements are atoms and then we can have tuples and a bag is where we have a collection of items in one bag. Hive is another programming framework that is made popular through Facebook. What Facebook did was to build a system for querying and managing structured data built on top of the open source version of MapReduce called Hadoop. The key features of Hive are the fact that the queries can be expressed in an SQL-like declarative language, and it also allows embedding custom MapReduce scripts within that. The nice thing is that the Hive system, once we've designed it, can be compiled into a MapReduce framework completely and then run as a MapReduce job on data centers. So in some sense, providing a high level layer on top of MapReduce that combines the declarative features of SQL as well as the custom features of MapReduce. So in that sense, at one level you can see it has lots of similarities to Pig Latin. It uses Hadoop file system for the storage because MapReduce requires that you store the intermediate result in stable storage and it uses HDFS for that. The last one is a system that is called Apache. So it's a programming framework is supposed to deliver performance and there's a small company that started this product. It is similar in spirit to Dryad and to expressing data processing applications as dataflow graphs. It is also built on top of the Hadoop open source MapReduce framework and it is a resource management framework called Yarn. That's the one

on which TEZ itself is built. Today, a lot of Pig and Hive applications use these as the execution engine to launch on clusters because it gives you some other performance advantages over and above what MapReduce can do. That's the reason that Pig and Hive use TEZ as an execution engine for running their applications.

WEEK-

MODULE 3: STORAGE SYSTEMS FOR THE CLOUD

This module discusses storage systems for the cloud. Everything relating to the cloud is about scale, availability, reliability, and performance. But often the goals of reliability and performance are at odds with each other. Cloud storage systems address this conundrum with a paradigm shift that has swept the industry called NoSQL storage systems. OBJECTIVES:

  1. Describe the architecture of the various distributed data stores from the leading Cloud providers
  2. Distinguish between the capabilities of the different data store offerings.
  3. Explain the semantics of the different data stores and their applicability for developing Cloud applications

AMAZON DYNAMO:

PROBLEM TECHNIQUE ADVANTAGE

Partitioning Consistent hashing Incremental Scalability High availability for writes Vector clocks with reconciliation during reads Version size is decoupled from update rates Handling temporary failures Sloppy quorum and hinted handoff Provides high availability and durability guarantee when some of the replicas are not available Recovering from permanent failures Anti-entropy using Merkle trees Synchronises divergent replicas in the background Membership and failure detection Gossip-based membership protocol and failure detection Preserves symmetry and avoids a centralized registry for storing a membership and node liveness information

of the abstraction and adding new features. Bigtable is just a big table. The data model of the Bigtable is sparse and distributed because the table has to be partitioned and stored in several different nodes, and it has to be persistent because we're dealing with data that needs longevity, and is multi-dimensional, sorted. The table is indexed by a row key, a column key, and at timestamp. The value that is stored in the cell is an uninterpreted array of bytes. The data model Bigtable is organized into rows, and every row has a whole bunch of columns and you index into a particular entry of the table by giving this triple row key, column key, and timestamp.So what I mean by uninterpreted anytime is that the system doesn't care, doesn't know. The only thing that is meaningful to the system is what row key, column key, timestamp.These distinguished attributes of an entry that the system knows about. The rest of it is uninterpreted.

FACEBOOK CASSANDRA:

Facebook cassandra basically has the same data model as a Bigtable, very similar data model. It combines the good features of Dynamo and Bigtable. It has a data model similar to Bigtable; and for distribution and replication ideas, it uses concepts from Dynamo. Basically, it uses the row key and uses consistent hashing on the row key in order to do the distribution replication. Other than that, the APIs are exactly the same as in the Bigtable. It is simpler also than the Bigtable, and it allows us to get a particular row, delete a particular row and write to a particular row; and accessing a row is exactly similar to the Bigtable. Facebook implemented this primarily for managing the messages. It helps in sending messages, looking for specific messages from a specific individual, organizing it, and getting at that particular message.

GOOGLE SPANNER:

Spanner actually evolved from Bigtable. Many of the applications that Google supports, don't require a strong consistency semantics. They were perfectly happy with the weaker notions of consistency. But there are certain applications that need stronger semantics, especially in the presence of wide area and multi-site replication. One concrete example of that is ad management. Because the revenue stream for all of these companies is ad revenue. They wanted cheap masters correctly. So the ad servers require certain strong semantics. At the same time, they did not want to go to a traditional database because many of the data models are very similar to Bigtable. But it is just that the consistency semantics that they wanted is stronger than the consistency semantics that is provided by Bigtable. Because Bigtable gives consistent semantics only for a single rule.It’s important to have a semi-relational data model, not purely relational data model, so that some rows are related to one another. I need to have consistent operations that span over a set of rows. They're dealing with globally distributed multi-version databases. They want to do replication in very fine grain for disaster recovery and so on. What they want to be able to get is externally consistent reads. Hence,results obtained from everywhere are the same : externally consistent reads and similarly externally consistent writes at a particular timestamp. For a particular timestamp, what we read should be the same as what someone read. They have to provide

globally consistent reads across replicated databases. So the features in Spanner, it's this schematized semi-relational data model. They have a very rich query language very similar to SQL and general-purpose transactions with ACID properties associated with general databases. The way they accomplish this consistent, externally consistent reads is through an enriching called interval-based global time.

WEEK-

MODULE 4: RESOURCE MANAGEMENT FOR THE CLOUD

This module discusses the state of the art in Cloud resource management, both from research and practice that have been proposed for increasing the utilization of data center resources.Cloud is multi-tenancy with quality of service guarantees that are provided for different applications and different programming frameworks that these applications may have been written in. Computational resources are not just CPUs, memory footprint, and network capacities, both bandwidth and latency are equally, if not more, important to take into account while scheduling resources for the applications in a data center. OBJECTIVES:

  1. Describe the architecture of resource management systems in the Cloud
  2. Distinguish between different approaches for resource management in the Cloud
  3. Explain how the resource management systems interact with the frameworks We are concerned about the CPU scheduling, memory footprint, network bandwidth, storage bandwidth, the effective memory access time.Memory footprint is the amount of space that your program is going to occupy and latency to memory access, something that is well hidden by memory hierarchies and so on.One of the important metric in the context of cloud computing is utilization of the resource. And that is the percentage of time a particular resource is actively used by applications running on it. And in reality because all of these different resources are there inside the data center an application may use all or most of the resources that are allocated to it because that is the nature of these kinds of applications. And the scheduling discipline follows first come first serve scheduling shortest job for scheduling round robin. And so short is the remaining time for scheduling and priority queues. These are all different methods that are used for CPU scheduling and also multi-level priority queues, and also variance of all of these both preemptive as well as non-preemptive kinds of scheduling. And the metrics that are used for evaluating how good a scheduler is typically turnaround time that is seen by r application and response time, meaning how quickly the operating system response to events and variance of these metrics.And the notion of fairness is very important. And this is basically the perception of the user in terms of the resource allocation policy he or she is able to see from the system. For instance in a round-robin scheduler, it gives the feeling of fairness because all processes think they are created. So essentially each process is getting one end of the processor resource where n is the number of processes

application with a number of tasks, how do they interact with one another, what data locality do they have, how much of sharing is there. Those are things that are known to the schedulers and not to the Resource Manager itself. Therefore, they want to bump up the allocation of specific tasks to the computation elements up to each one of these frameworks. What the allocation module is going to do is, it's getting information from the slaves. The allocation modules work is to pick the framework to offer the resources to.It decides taking that and how to map the available resources to do computations. An offer is made to a framework and framework accepts that offer, and the allocation it gives back to the Mesos master which then can take that and then map it onto the slave.

WEEK-

MODULE 5: VIRTUAL TECHNOLOGIES

A detailed description of virtualizing the different hardware components of a computer system including CPU, memory, and I/O devices.Virtualization had its roots in the IBM systems dating back to the 60s and the 70s. It was resurrected in the late 90s and 2000s in its new form due to the need for resource consolidation, first in universities and private enterprises, and evolved in cloud computing and data centers. OBJECTIVES:

  1. Describe the mechanisms for virtualizing the different subsystems
  2. Distinguish between different types of virtualization
  3. Compare the mechanisms for virtualizing different subsystems
  4. Explain the virtualization techniques used by the Cloud providers Data centers run several different operating systems there and have accountability for all of the applications that run and the different operating systems. Hence, virtualization has taken such a strong route now in data centers. This gives the illusion to every operating system that they are completely in charge of the shared hardware resources, and there are two different ways in which you can accomplish this. There is a thin layer right between the hardware resources and the operating system that serves as a mediator between the operating system, and the physical resources called a hypervisor. There are two ways to realize this hypervisor.
  5. One is a native version of the hypervisor, meaning the hypervisor is running directly on bare metal, and the guest operating systems are living on top of the hypervisor, whether it is Windows operating system or Linux or what have you.
  6. The second approach is something called Hosted, used in VirtualBox and VMware, where we have a host operating system, and the hypervisor is really running as a process on top of the host operating system, and it is supporting virtualization of all the guest operating systems running on top of it.

BARE-METAL HYPERVISOR:

The bare metal or native virtualization can be operated as full virtualization or paravirtualization.

  1. In full virtualization,we take the operating system binary and run it directly on top of the hypervisor. We do not touch the operating system itself. It's just running unchanged, and all the processes of course on top of it are also running unchanged. The pros are that the operating system vendor doesn't have to think about it, they can simply put it in a data center. The con is that there could be some performance disadvantages because of the way each of these operating systems thinks about the physical hardware down below, and what exactly is the physical hardware that's available through the hypervisor. That can cause a little bit of a mismatch especially in the I/O subsystem performance, the operating systems may see some inefficiency if it is fully virtualized. Example-VMware.
  2. In paravirtualized, there is a well-defined interface that is made visible to the operating system through the hypervisor, and each of these operating systems know that they're not running on bare metal but they're running on top of a hypervisor, and therefore there is some change that has to be done, a minimal change that has to be done whenever these operating system have to do something that requires intervention by the hypervisor. Example-Xen. These are at the highest level the two different distinctions in terms of how you might realize virtualization on top of bare metal. Either fully virtualized, meaning the operating system doesn't have to be changed at all, or paravirtualization where there is a small change that needs to happen, and typically paravirtualized setup, the change to the operating system is necessary to run efficiently on top of the hypervisor. We have to virtualize the hardware.Every process involves virtualization as process has also a notion of virtualization of the physical resources. MEMORY VIRTUALIZATION: Every operating system supports virtual memory for the processes that are running on top of the operating system. And there is physical memory that is backing up the virtual memory that is assumed by each of the processes. Now this physical memory that is assumed by each of these operating systems is now an illusion. Virtual memory is already an illusion from the point of view of the processes, and now the physical memory is also an illusion because it is not really the physical memory that they have. We are actually sharing the actual memory that's available in the system through the hypervisor for multiple operating systems that are executing on top of it.Hence,the term machine memory to

system may have to throw some pages out to the disk so that it frees up some physical memory to allocate to the balloon driver and then give it to the hypervisor and the hypervisor can then give it to the starving VM that is requesting the memory in the first place. On the other hand, if the host needs less memory, then what it can do is it can tell the balloon driver that you can now deflate your balloon. Meaning, whatever memories that you've allocated, you release it. When you release it, more memory becomes available with a guest operating system so it can increase memory footprint. So this is how ballooning mechanism works and in doing this, it may have to page in the memory frames that it may have put on the disk because it's got now more memory than it used to because of the deflation of the balloon. This is a situation where dynamically deflating and inflating the balloon is possible according to the SLAs.. Based on that, the accounting can be kept by the Hypervisor, what is the actual usage so that the billing can be appropriate with respect to the agreement.Hypervisors know the actual use of memory in each one of the VMs and then make a decision which one it wants to take a memory away.

MEMORY ALLOCATION POLICIES:

Memory allocation policies are the SLAs that have been agreed upon between the VMs and the hypervisor. One approaches a pure share-based approach. Either its pay on the use basis or it is booked in advanced and paid accordingly or negotiating the rate based usage is a working set based approach, meaning that we know what is the working set of a VM, based on that, we can take away some memory and charge that particular VM less for the usage of memory. So those are policies that can be baked into the hypervisor. Another technique is a dynamic idle adjusted shared approach, and the idea here is to take away idle memory from VMs that are not using them very actively. So we will tax idle pages more than active pages.It may be a 100 percent taxation rate or zero percent which is a pure share-based approach. We're saying it's completely plutocratic, we're not taking anything away from what you paid for. The other extreme is of course, very socialistic, making it a 100 percent taxation rate, and typically use something that is somewhere in the middle, maybe about 75 percent taxation rate. The reason why we may not want to do a 100 percent, it might change. There can be a long hysteresis in terms of performance that will be observed on the Virtual Machine if there is a sudden outburst of load on that machine. So you always leave a certain amount of spare capacity for growth in every Virtual Machine. So that is the idea behind this policy, and that allows for allowing certain working set increases that might happen.

DEVICE VIRTUALIZATION:

Device virtualization. If it is fully virtualized, the model is trapped and emulated, implying the hypervisor is going to emulate it on our behalf. When it is paravirtualized, we have more opportunity for innovation because you can actually request certain things from the hypervisor and do it more efficiently. Typically the way paravirtualized operating systems work is that there are shared buffers between the guest operating system and the hypervisor. Using the shared buffers, the guest operating system can cooperate with the hypervisor to communicate what it needs to get done at a higher level of privilege than it can do on its own. There are things that it can do on its own.

VM MIGRATION:

It is used for rebalancing of resources..i.e if we want to rebalance the resources by moving some of the VM from one physical machine to another and incremental upgrades of the physical resources, or deal with faults that happen and correct the faults and so on. But the challenge is that we have SLAs with individual users of the data center. We have to make sure that you minimize the downtime and avoid disruption of active services. Active services are things in which there is a human in the loop. Terms of options for migration, there is a push phase, a stop and copy phase, and then a pull phase. The generic steps in migration, there are lots of details associated with VM migration and how to do it efficiently and so on. There's a pre-migration step, where we have to reserve some resources. When most of the immutable states have been migrated, then we make sure that evolving states have to be checkpointed and migrated, and that's the time to stop and copy. We may have to do a whole bunch of things in terms of network redirection, and file redirection, and all of these kinds of things that are associated with the environment in which this virtual machine is operating, and that might require contacting DNS servers to set up the routing table and things like this.Then, sync up all the VM state, and copy this entirely over to another VM. We may release the resources on the source machine if it's not working and then activate it on the target machine, and then the migration is complete. This is the way to reduce the amount of downtime by doing iterative pre-copy to make sure that most of the working set is moved over, and only the things that are changing, we're allowed to change, but at some point is suspended and most of it has been moved over, and then we've reset the redirection and so on. Then we can release the resources and activate it. Those are generic steps in migration.

CODE IMPLEMENTATION FOR MAP-REDUCE:

import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.fs.Path; public class WordCount{ public static class Map extends Mapper<LongWritable,Text,Text,IntWritable>; { public void map(LongWritable key, Text value,Context context) throws IOException,InterruptedException{ String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { value.set(tokenizer.nextToken()); context.write(value, new IntWritable(1)); } } } public static class Reduce extends Reducer<Text,IntWritable,Text,IntWritable> { public void reduce(Text key, Iterablet; values,Context context) throws IOException,InterruptedException { int sum=0; for(IntWritable x: values) { sum+=x.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf= new Configuration(); Job job = new Job(conf,"My Word Count Program"); job.setJarByClass(WordCount.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class);

job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); Path outputPath = new Path(args[1]); //Configuring the input/output path from the filesystem into the job FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); //deleting the output path automatically from hdfs so that we don't have to delete it explicitly outputPath.getFileSystem(conf).delete(outputPath); //exiting the job only if the flag value becomes false System.exit(job.waitForCompletion(true)? 0 : 1); } }

EXAMPLE:

The text is: When there is a will, there is a way. Mapper’s input: //key-value pair 0,When there is a will 20,there is a way Mapper’s output: //key-value pair When, there, is, a, will, there, is, a, way, Reducer’s input: //key-value pair When,[1] there,[1,1] is,[1,1] a,[1,1] will,[1] way,[1]