Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Big Data Analytics Notes for students and beginners, Study notes of Advanced Data Analysis

A concise and beginner-friendly guide to Big Data Analytics, covering key concepts, tools, and real-world applications. Ideal for students and aspiring data professionals looking to build a solid foundation in data-driven decision-making.

Typology: Study notes

2024/2025

Available from 05/14/2025

vedant-mandloi
vedant-mandloi 🇮🇳

1 document

1 / 179

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
UNIT 1:
What is Data?
Data refers to pieces of information. These bits of information can be facts, details, or statistics collected
or observed about various things. In the world of computing and technology, data is often used to describe
information that can be processed and analyzed by machines. Imagine data as tiny building blocks of
information. It can be numbers, words, or even pictures that we collect and use to understand things
better. Just like how you gather LEGO blocks to create something cool, we gather data to learn and make
decisions. Example: Let's say you have a list of temperatures for each day of the week. Each temperature
value is a piece of data. When you look at the entire list, you can find patterns like which days were
warmer or colder. This information is valuable, and it's a simple example of using data to understand and
make sense of things.
Types of Data
1. Quantitative Data: Quantitative data represents quantities and is measured on a numeric scale. It
can be further divided into discrete and continuous data.
Discrete Data: This type of data is counted and usually consists of whole numbers. It's
essentially data that can only take certain values and cannot be divided into smaller parts. For
example: The number of students in a class.
Continuous Data: Continuous data can take any value within a given range. It's measured
and can be broken down into smaller parts, even infinitely. For example: Height of students in
a class.
2. Qualitative Data: Qualitative data describes qualities or characteristics and cannot be measured
on a numeric scale. It's typically descriptive and subjective.
Nominal Data: This type of data categorizes variables into distinct groups or categories with
no intrinsic order. For example:Types of fruits (apple, banana, orange).
Ordinal Data: Ordinal data categorizes variables into ordered groups or categories. There is
a clear order, but the difference between each category is not necessarily uniform. For
example: Education levels (high school, bachelor's, master's, PhD).
Quantitative data deals with numbers, measurements, quantities, while qualitative data deals with
qualities, characteristics, and descriptions. Both types of data are essential in various fields like statistics,
research, and decision-making processes.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download Big Data Analytics Notes for students and beginners and more Study notes Advanced Data Analysis in PDF only on Docsity!

UNIT 1:

What is Data?

Data refers to pieces of information. These bits of information can be facts, details, or statistics collected or observed about various things. In the world of computing and technology, data is often used to describe information that can be processed and analyzed by machines. Imagine data as tiny building blocks of information. It can be numbers, words, or even pictures that we collect and use to understand things better. Just like how you gather LEGO blocks to create something cool, we gather data to learn and make decisions. Example: Let's say you have a list of temperatures for each day of the week. Each temperature value is a piece of data. When you look at the entire list, you can find patterns like which days were warmer or colder. This information is valuable, and it's a simple example of using data to understand and make sense of things.

Types of Data

  1. Quantitative Data : Quantitative data represents quantities and is measured on a numeric scale. It can be further divided into discrete and continuous data.
    • Discrete Data : This type of data is counted and usually consists of whole numbers. It's essentially data that can only take certain values and cannot be divided into smaller parts. For example: The number of students in a class.
    • Continuous Data : Continuous data can take any value within a given range. It's measured and can be broken down into smaller parts, even infinitely. For example: Height of students in a class.
  2. Qualitative Data : Qualitative data describes qualities or characteristics and cannot be measured on a numeric scale. It's typically descriptive and subjective.
    • Nominal Data : This type of data categorizes variables into distinct groups or categories with no intrinsic order. For example:Types of fruits (apple, banana, orange).
    • Ordinal Data : Ordinal data categorizes variables into ordered groups or categories. There is a clear order, but the difference between each category is not necessarily uniform. For example: Education levels (high school, bachelor's, master's, PhD). Quantitative data deals with numbers, measurements, quantities, while qualitative data deals with qualities, characteristics, and descriptions. Both types of data are essential in various fields like statistics, research, and decision-making processes.

UNIT 1:

Computer Data as Information

Computer data is like the building blocks of information that a computer uses. It can be anything from words, pictures, sounds, to even programs that make the computer do different things. Computer data is information processed or stored by a computer. This information may be in the form of text documents, images, audio clips, software programs, or other types of data. Imagine your computer is like a super-smart friend. You can give this friend all sorts of things to play with, like letters, pictures, music, and more. All these things you give are like the data for your computer. Now, your friend is not just playing around; it's actually doing some work with these toys. This work is done by the computer's brain, called the CPU. The CPU processes or works on the data you gave, just like your friend playing with toys. After playing and working, your friend might want to keep things organized. So, it puts the toys in different boxes or folders. In the computer world, these boxes or folders are like files and folders where the processed information is stored on the computer's hard disk. Example: Let's say you have a computer and you want to write a story. The words you type on the computer are the data. Now, when you hit the "save" button, the computer's brain (CPU) does some work to remember what you wrote. That saved story is now stored in a file on your computer, just like putting it in a special box. So, in simple terms, computer data is like the toys you give to your computer friend, and the processed information is like the organized collection of those toys stored in boxes on your computer.

What is Big Data?

Big Data is a massive collection of data that continues to grow dramatically over time. It is a data set that is so huge and complicated that no typical data management technologies can effectively store or process it. Big Data is like regular data, but it is much larger. A data which are very large in size. Normally we work on data of size MB(WordDoc ,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte size is called Big Data. Big Data refers to an enormous amount of information that keeps getting bigger and more complex as time goes on. It's a type of data that is so massive and intricate that regular methods of handling and processing data are not effective for it. To put it simply, Big Data is like your everyday data, but on a much larger scale. Typically, when we work with data, it's in the range of megabytes (like Word documents or Excel sheets) or sometimes gigabytes (like movies or large code files). However, when we talk about Big Data, we're dealing with data that's in the order of petabytes, which is a staggering amount—10 to the power of 15 bytes! Here's a fun fact: around 90% of the data we have today has been created in just the last three years. That gives you an idea of how rapidly data is growing. In simple terms, Big Data is like trying to manage a library with billions of books, each containing an immense amount of information. Regular methods of cataloging and organizing won't work, and that's where specialized technologies and tools come in to make sense of this vast sea of information. Example: Think about a social media platform like Facebook. It processes and stores an enormous amount of data every second—posts, photos, likes, comments, and more. The sheer volume and variety of this data make it a classic example of Big Data. Traditional databases and tools wouldn't be able to handle the scale and complexity of information generated by millions (or even billions) of users worldwide. Specialized Big Data technologies are used to manage and analyze this data effectively.

UNIT 1:

Example: Imagine you run a pizza shop. Big Data for your business could be the huge amount of information you get from online orders, customer reviews, delivery routes, and even social media mentions. By analyzing this data (Volume), you can offer personalized deals to your customers (Value), ensure that the information you have is reliable (Veracity), represent your sales trends in easy-to-read charts (Visualization), handle different types of data like text orders and delivery locations (Variety), quickly respond to changing demands (Velocity), and make your promotions go viral on social media (Virality).

Challenges of Conventional System

There are main three challenges of conventional system, which are as follows:

1. Volume of Data - The volume of data increasing day by day, especially the data generated from machine, telecommunication service, airline services, data from sensors, etc… - The rapid growth in data every year is coming with new source of data which are emerging. - As per survey, the growth in volume of data is so rapid that it is expected by IBM that by 2020 around 35 zettabyte of data will get stored in the world. 2. Processing and Analyzing - Processing of such large volume of data is major challenge and is very difficult. - Organization make use of such large volume of data by analyzing in order to achieve their business goals. - Taking out insights from such large amount of data is time consuming and it also takes lot of effort to do. - Processing and analyzing of data is also costly since the data is in different format and is complex. 3. Management of Data - As the data gathered have different formats like structured, semi-structured and unstructured, it is very challenging to manage such different variety of data.

UNIT 1:

Types of Big Data

1. Unstructured - Any data with unknown form or the structure is classified as unstructured data. - Unstructured data is like a messy room where things are scattered around, and you're not sure where everything is. It doesn't have a predefined format or structure. - In addition to the size being huge, un-structured data poses multiple challenges in terms of its processing for deriving value out of it. - Typical example of unstructured data is, a heterogeneous data source containing a combination of simple text files, images, videos like search in Google Engine. - Now a day organizations have wealth of data available with them but unfortunately they don't know how to derive value out of it since this data is in its raw form or unstructured format. - Example:- Think of Google Search – it deals with various types of information like text, images, and videos. All this data is raw and not neatly organized. 2. Semi-structured - Semi-structured data is a bit like a room where things are not neatly arranged but have some labels or tags on them. It contains both structured and unstructured elements. - Semi-structured data can contain both the forms of data. - Semi-structured data pertains to the data containing both the formats mentioned above, that is, structured and unstructured data. - To be precise, it refers to the data that although has not been classified under a particular repository (database), yet contains vital information or tags that segregate individual elements within the data. - Web application data, which is unstructured, consists of log files, transaction history files etc. - Online transaction processing systems are built to work with structured data wherein data is stored in relations (tables). - User can see semi-structured data as a structured in form but it is actually not defined with e.g. a table definition in relational DBMS. - Personal data stored in a XML file: Prashant RaoMale35 Seema R.Female41 Satish ManeMale29 3. Structured - Any data that can be stored, accessed and processed in the form of fixed format is termed as a "Structured" data.

UNIT 1:

What do you mean by Big Data Analytics?

Big Data Analytics refers to the process of examining large and complex datasets to uncover hidden patterns, correlations, and insights that can help businesses and organizations make better-informed decisions. It involves using various tools, techniques, and technologies to analyze massive volumes of data that traditional data processing applications might struggle to handle. Imagine you have a huge pile of puzzle pieces. Each piece represents a piece of information, like customer data, online activities, or sensor readings. Big Data Analytics is like putting these puzzle pieces together to reveal the bigger picture—insights and trends that can be valuable for making smarter decisions. Big Data Analytics is the method of examining vast and diverse sets of data, often in real-time, to extract meaningful patterns, trends, and information that can guide decision-making processes. Example: Let's say a company collects data on how customers interact with their website, including what products they view, how much time they spend on each page, and whether they make a purchase. By using Big Data Analytics, the company can analyze this large dataset to identify patterns, such as which products are popular, what factors influence purchasing decisions, and how to optimize the website for a better user experience. This valuable information can then be used to enhance marketing strategies, improve customer satisfaction, and ultimately boost sales.

Importance of Big Data

 Complex or massive data sets which are quite impractical to be managed using the traditional database system and software tools are referred to as big data.  Big data is utilized by organizations in one or another way. It is the technology which possibly realizes big data’s value.  It is the voluminous amount of both multi-structured as well unstructured data.

Case Study : Netflix

 It is very popular entertainment company work in online on-request web based video streaming for its customers.  It has been determined to be able to predict what precisely its customers will appreciate viewing with Big Data.  Recently, Netflix begun positioning itself as a content creator, not simply a distribution medium which is solidly said based on data analytics.  Data likes are recommendation engines take care of customers watch, regularly playback halted, ratings and so on.  It has incorporated with Hadoop, Hive and Pig and other traditional business intelligence.

UNIT 1:

Intelligent Data Analysis

 Intelligent Data Analysis (IDA) is one of the major issues in the field of artificial intelligence and information.  Intelligent Data Analysis (IDA) is like a smart detective for information hidden in a massive pile of data. Imagine you have a big box of puzzle pieces, and IDA is the clever friend who not only puts the puzzle together but also discovers new patterns and stories you didn't know were there.  It helps us make better decisions by uncovering hidden and valuable information in large sets of data.  IDA is a smart way of using computers to dig into a ton of information and find things we didn't know before.  It includes three major steps:

  1. Data Preparation:- Getting the relevant information from different sources and organizing it into a usable set.
  2. Rules Finding or Data Mining :- Figuring out patterns or rules in the data using special methods or algorithms.
  3. Result Validation and Explanation:- Checking if the discovered patterns make sense and explaining them in a way that's easy for people to understand.  IDA is to extract useful knowledge, the process demands a combination of extraction, analysis, conversion, classification, organization, reasoning, and so on.  IDA can use concepts like machine learning and deep learning to make the process even smarter.  It will help in many area:
  4. Banking & Securities, Communications, Media, & Entertainment
  5. Healthcare Providers Example: Let's say you work in a library, and there's a huge collection of books. IDA is like a magical librarian who not only helps organize the books but also discovers interesting connections. For example, it might notice that people who borrow mystery novels on Fridays also tend to like adventure books. This kind of insight can help the library make better decisions, like suggesting new books to readers based on their preferences. In the same way, IDA works with much larger and more complex sets of data, helping businesses, researchers, and decision-makers find hidden patterns and make smarter choices.

UNIT 1:

  • Distributed Architecture: Big Data systems often use distributed architectures, meaning they spread the workload across multiple machines. This helps with scalability and performance. Example: Think of a traditional data system like an old-fashioned filing cabinet. It has drawers and folders, each with a specific structure. You can organize papers neatly, but if you suddenly have too many papers, it becomes hard to manage. On the other hand, imagine Big Data as a modern, highly efficient library. It can handle not only an enormous number of books but also different types - novels, magazines, audio books, and more. It's flexible, efficient, and can provide information quickly, making it more suited for the vast amounts and varied types of data we deal with today.

UNIT 1:

Case Study of Big Data Solution

 Undoubtedly Big Data has become a major game change in most part of the cutting edge industries over the last few years.  As Big Data keeps on going day by day, the number of various organizations that are adopting Big Data keeps on expanding.  Let’s discuss example:  An e-commerce site XYZ (having 100 million users) wants to offer a gift voucher of 100$ to its top 10 customers who have spent the most in the previous year.  Moreover, they want to find the buying trend of these customers so that company can suggest more items related to them.  Issues: Huge amount of unstructured data which needs to be stored, processed and analyzed.  Solution:  Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed File System) which uses commodity hardware to form clusters and store data in a distributed fashion. It works on Write once, read many times principle.  Processing: Map Reduce paradigm is applied to data distributed over network to find the required output.  Analyze: Pig, Hive can be used to analyze the data.  Cost: Hadoop is open source so the cost is no more an issue.

UNIT 1.

History of Hadoop

Hadoop's story begins at Yahoo! in the mid-2000s. Doug Cutting, along with Mike Cafarella, started working on an open-source project inspired by Google's MapReduce and Google File System (GFS) papers. They named it after a toy elephant belonging to Cutting's son, named Hadoop. ❖ Core Components: Hadoop is comprised of two core components:

  • Hadoop Distributed File System (HDFS) : Inspired by Google's GFS, HDFS is a distributed file system that stores data across multiple machines.
  • MapReduce : Inspired by Google's MapReduce, this programming model processes large datasets across a distributed cluster. ❖ Early Development: Hadoop's early versions focused on scalability and fault-tolerance. It allowed users to store and process huge amounts of data across clusters of commodity hardware. ❖ Apache Hadoop Project: In 2006, Yahoo! released Hadoop as an open-source project under the Apache Software Foundation. This move spurred widespread adoption and collaboration within the tech community. ❖ Evolution: Over the years, Hadoop has seen significant evolution. Various sub-projects have been added to enhance its capabilities, including:
  • Hadoop Common : Libraries and utilities used by other Hadoop modules.
  • Hadoop YARN (Yet Another Resource Negotiator) : Introduced in Hadoop 2.0, YARN serves as a resource management layer allowing different data processing engines to run on Hadoop.
  • Apache Hive : A data warehousing infrastructure built on top of Hadoop, providing a SQL-like query language called HiveQL.
  • Apache Pig : A high-level scripting language designed for querying and analyzing large datasets.
  • Apache Spark : While not part of the original Hadoop project, Spark is often used alongside Hadoop for data processing due to its speed and ease of use.
  • Apache HBase : A distributed, scalable NoSQL database that runs on top of Hadoop. ❖ Industry Adoption: Hadoop quickly gained popularity across various industries due to its ability to handle Big Data challenges. Companies like Facebook, Twitter, LinkedIn, and Netflix adopted Hadoop for data storage, processing, and analysis. ❖ Use Cases: Hadoop has been used for a wide range of use cases, including:
  • Log Processing : Analyzing log files from web servers, applications, and systems.
  • Data Warehousing : Storing and querying large volumes of structured data.
  • Machine Learning : Training and deploying machine learning models on large datasets.
  • Recommendation Systems : Generating personalized recommendations based on user behavior.
  • Fraud Detection : Identifying fraudulent activities by analyzing large transaction datasets. Example: Let's say a company wants to analyze its customer data, which includes millions of records of purchases, interactions, and demographics. They can use Hadoop to store this data across a cluster of inexpensive servers using HDFS. Then, they can leverage MapReduce or Apache Spark to process this data in parallel, extracting insights such as customer preferences, trends, and patterns. In summary, Hadoop revolutionized the way we handle Big Data, providing a scalable, cost-effective solution for storage, processing, and analysis. Its open-source nature and robust ecosystem have made it a cornerstone of modern data infrastructure.

UNIT 1.

Apache Hadoop

Apache Hadoop is an open-source software framework used for distributed storage and processing of large datasets on clusters of commodity hardware. In simpler terms, it's like a massive storage and computing system spread across many computers working together. Components of Hadoop:

  1. Hadoop Distributed File System (HDFS): HDFS is the storage part of Hadoop. It breaks large files into smaller pieces and distributes them across different machines in a cluster. This allows for scalable, reliable, and high-throughput access to data. Example: Think of HDFS like a library where books are divided into chapters, and each chapter is stored in a different room. When you need to read a book, you gather all the chapters from different rooms and read them together.
  2. MapReduce: MapReduce is the processing part of Hadoop. It allows users to write programs to process vast amounts of data in parallel across a distributed cluster. It consists of two main phases: the Map phase, where data is divided into smaller chunks and processed independently, and the Reduce phase, where the results from the Map phase are combined to produce the final output. Example: Imagine you have a huge pile of papers with words written on them. In the Map phase, you sort these papers by the first letter of each word. Then, in the Reduce phase, you count how many papers start with each letter to determine the frequency of words starting with that letter. Why use Hadoop?
  3. Scalability: Hadoop can scale horizontally, meaning you can add more machines to your cluster as your data grows, providing virtually limitless storage and processing power.
  4. Fault-tolerance: Hadoop is designed to handle hardware failures gracefully. If a machine in the cluster fails, Hadoop ensures that the data stored on that machine is replicated on other machines, so no data is lost and processing can continue without interruption. Example of Hadoop in action: Let's say you're a social media company analyzing user interactions. You have terabytes of data containing user posts, comments, likes, and shares. With Hadoop, you can store all this data in HDFS and then use MapReduce to analyze it. For example, you could use MapReduce to:
  • Count the number of likes each post received.
  • Identify trending topics by analyzing the frequency of certain keywords in posts.
  • Recommend friends to users based on their past interactions. Overall, Apache Hadoop provides a powerful and cost-effective solution for storing, processing, and analyzing large-scale data, making it invaluable for organizations dealing with Big Data.

UNIT 1.

Hadoop Ecosystem

Hadoop is an open-source framework that allows for the distributed storage and processing of large data sets using a cluster of commodity hardware. Its ecosystem consists of various tools and components that work together to handle big data processing tasks.

  1. HDFS (Hadoop Distributed File System): HDFS is like a big virtual storage space that can spread across many computers. It's designed to hold really huge amounts of data and break it into small pieces, storing them on different machines. Example: Imagine you have a giant puzzle. Instead of keeping all the pieces in one box, you distribute them in multiple boxes. Each box is like a computer in HDFS.
  2. HBase: HBase is like a smart, big table where you can store lots of organized information. It's good for storing data with lots of rows and columns, and it works closely with HDFS. Example: Think of it like an Excel sheet where each row and column can hold a massive amount of data, and it can handle many, many rows and columns.
  3. MapReduce: MapReduce is a way to process (analyze or manipulate) big data by breaking it into smaller tasks and working on them separately. Example: Imagine you have a thousand-page book, and you want to count the number of times a word appears. MapReduce helps you by assigning different pages to different people, and each person counts the words on their assigned pages.
  4. YARN: YARN is like the manager of a busy restaurant. It keeps track of who is doing what and makes sure that each task gets the resources it needs. Example: In a restaurant, the manager (YARN) makes sure the chefs (tasks) have the right ingredients and kitchen space to cook their dishes.

UNIT 1.

  1. HIVE: Hive is like a translator that helps you ask questions about your data in a language that Hadoop understands. Example: If you have a friend who only speaks French (Hadoop), Hive helps you by translating your questions (queries) from English into French so your friend can understand.
  2. PIG: Pig is another friend who helps you analyze and process your data, but in a different way. It uses a language called Pig Latin. Example: If you want to analyze a big pile of fruits, Pig (using Pig Latin) helps you sort them, group them, or do other things with them.
  3. Sqoop: Sqoop is like a bridge that helps move data between Hadoop and traditional databases. Example: If you have data in a regular database and want to bring it into Hadoop, Sqoop helps by building a bridge so the data can travel easily between the two.
  4. Oozie: Oozie is like a scheduler that plans the order in which different tasks in Hadoop should run. Example: Think of Oozie as a manager who decides when the chef should start preparing ingredients, when the waiter should set the tables, and so on.
  5. Flume: Flume is like a pipeline that efficiently collects and delivers data from many sources to Hadoop. Example: Imagine a system of pipes collecting water from different places and channeling it to a central reservoir. Flume does something similar but with data.
  6. Zookeeper: Zookeeper is like a secretary that helps Hadoop systems coordinate and work together smoothly. Example: In a big office, a secretary (Zookeeper) keeps track of everyone's schedules, making sure meetings happen on time and tasks are coordinated properly. In summary, the Hadoop Ecosystem consists of various tools that together provide a comprehensive solution for storing, processing, and analyzing large-scale data. Each component plays a specific role, contributing to the efficiency and scalability of big data processing tasks.

Unit 2

Cluster

  • A cluster is a group of servers or computers and other resources that act like a single system and enable high availability and, in some cases, load balancing and parallel processing. e.g. 6 Node Cluster – It means 6 computers are connected in a network
  • A Hadoop cluster is like a team of computers (servers) that work together to store and process large amounts of data. It's a way to use the power of many machines to handle big tasks.
  • Imagine you have a gigantic puzzle that you want to solve. If you try to solve it alone, it might take a very long time. But if you have a bunch of friends helping you, each working on a different part of the puzzle, you can finish it much faster.
  • In the tech world, data is like the puzzle, and a Hadoop cluster is your team of computers working together. Each computer in the cluster, called a node, has its own piece of the data to handle. By dividing the work among these computers, you can process huge amounts of data much more quickly.

Node:

  • Node refers to servers or computers. e.g. 10 Node– It means 10 computers or Servers

Unit 2

Community Hardware:

  • Commodity hardware is a term for affordable devices that are generally compatible with other such devices.Commodity hardware is like the everyday stuff you can find easily and is not super fancy or specialized. It's the affordable and common kind of equipment that works well with other similar devices. So, it's not about being low quality; it's about being practical and not tied to any specific brand.
  • A commodity server is a commodity computer that is dedicated to running server programs and carrying out associated tasks.Now, think of a commodity server as a regular computer, but it's mainly dedicated to doing server jobs. Servers are like traffic controllers for the internet – they manage requests and share information between different devices. So, a commodity server is just an affordable computer that does this job without needing to be super high-end or fancy.
  • Commodity hardware does not imply low quality, but rather, affordability and not sticking to any particular vendor.
  • Imagine you're building a small website for your hobby. You don't need a super expensive, top-of- the-line server. Instead, you can use a regular computer that's good enough to handle the website's traffic. That regular computer is your commodity server, and the parts you use, like the basic processor, memory, and storage, are all examples of commodity hardware.

Rack awareness:

Rack awareness in Hadoop is a way of organizing and managing data storage in a distributed computing environment. In simpler terms, imagine you have a big system with lots of computers (nodes) working together to process and store data. These nodes are grouped into racks, which are like large shelves holding several computers. Now, rack awareness is the idea that when you're storing or retrieving data, it's better to use nodes that are on the same rack or shelf, rather than picking nodes randomly. This is because communication between nodes on the same rack is faster and more efficient than communicating between nodes on different racks. By default, when you install Hadoop, it assumes that all the computers (nodes) are on the same rack. However, in a real-world scenario, you might have multiple racks of nodes. Rack awareness helps Hadoop make smarter decisions about where to place and access data so that it can take advantage of the proximity of nodes on the same rack. Example: Let's say you have a Hadoop cluster with three racks: Rack A, Rack B, and Rack C. Each rack contains several nodes. Now, if you want to store a piece of data, rack awareness will try to place that data on nodes within the same rack. This way, when you need to retrieve the data, the communication happens more quickly because the nodes are physically closer to each other.