




























































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
A concise and beginner-friendly guide to Big Data Analytics, covering key concepts, tools, and real-world applications. Ideal for students and aspiring data professionals looking to build a solid foundation in data-driven decision-making.
Typology: Study notes
1 / 179
This page cannot be seen from the preview
Don't miss anything!
Data refers to pieces of information. These bits of information can be facts, details, or statistics collected or observed about various things. In the world of computing and technology, data is often used to describe information that can be processed and analyzed by machines. Imagine data as tiny building blocks of information. It can be numbers, words, or even pictures that we collect and use to understand things better. Just like how you gather LEGO blocks to create something cool, we gather data to learn and make decisions. Example: Let's say you have a list of temperatures for each day of the week. Each temperature value is a piece of data. When you look at the entire list, you can find patterns like which days were warmer or colder. This information is valuable, and it's a simple example of using data to understand and make sense of things.
Computer data is like the building blocks of information that a computer uses. It can be anything from words, pictures, sounds, to even programs that make the computer do different things. Computer data is information processed or stored by a computer. This information may be in the form of text documents, images, audio clips, software programs, or other types of data. Imagine your computer is like a super-smart friend. You can give this friend all sorts of things to play with, like letters, pictures, music, and more. All these things you give are like the data for your computer. Now, your friend is not just playing around; it's actually doing some work with these toys. This work is done by the computer's brain, called the CPU. The CPU processes or works on the data you gave, just like your friend playing with toys. After playing and working, your friend might want to keep things organized. So, it puts the toys in different boxes or folders. In the computer world, these boxes or folders are like files and folders where the processed information is stored on the computer's hard disk. Example: Let's say you have a computer and you want to write a story. The words you type on the computer are the data. Now, when you hit the "save" button, the computer's brain (CPU) does some work to remember what you wrote. That saved story is now stored in a file on your computer, just like putting it in a special box. So, in simple terms, computer data is like the toys you give to your computer friend, and the processed information is like the organized collection of those toys stored in boxes on your computer.
Big Data is a massive collection of data that continues to grow dramatically over time. It is a data set that is so huge and complicated that no typical data management technologies can effectively store or process it. Big Data is like regular data, but it is much larger. A data which are very large in size. Normally we work on data of size MB(WordDoc ,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte size is called Big Data. Big Data refers to an enormous amount of information that keeps getting bigger and more complex as time goes on. It's a type of data that is so massive and intricate that regular methods of handling and processing data are not effective for it. To put it simply, Big Data is like your everyday data, but on a much larger scale. Typically, when we work with data, it's in the range of megabytes (like Word documents or Excel sheets) or sometimes gigabytes (like movies or large code files). However, when we talk about Big Data, we're dealing with data that's in the order of petabytes, which is a staggering amount—10 to the power of 15 bytes! Here's a fun fact: around 90% of the data we have today has been created in just the last three years. That gives you an idea of how rapidly data is growing. In simple terms, Big Data is like trying to manage a library with billions of books, each containing an immense amount of information. Regular methods of cataloging and organizing won't work, and that's where specialized technologies and tools come in to make sense of this vast sea of information. Example: Think about a social media platform like Facebook. It processes and stores an enormous amount of data every second—posts, photos, likes, comments, and more. The sheer volume and variety of this data make it a classic example of Big Data. Traditional databases and tools wouldn't be able to handle the scale and complexity of information generated by millions (or even billions) of users worldwide. Specialized Big Data technologies are used to manage and analyze this data effectively.
Example: Imagine you run a pizza shop. Big Data for your business could be the huge amount of information you get from online orders, customer reviews, delivery routes, and even social media mentions. By analyzing this data (Volume), you can offer personalized deals to your customers (Value), ensure that the information you have is reliable (Veracity), represent your sales trends in easy-to-read charts (Visualization), handle different types of data like text orders and delivery locations (Variety), quickly respond to changing demands (Velocity), and make your promotions go viral on social media (Virality).
There are main three challenges of conventional system, which are as follows:
1. Volume of Data - The volume of data increasing day by day, especially the data generated from machine, telecommunication service, airline services, data from sensors, etc… - The rapid growth in data every year is coming with new source of data which are emerging. - As per survey, the growth in volume of data is so rapid that it is expected by IBM that by 2020 around 35 zettabyte of data will get stored in the world. 2. Processing and Analyzing - Processing of such large volume of data is major challenge and is very difficult. - Organization make use of such large volume of data by analyzing in order to achieve their business goals. - Taking out insights from such large amount of data is time consuming and it also takes lot of effort to do. - Processing and analyzing of data is also costly since the data is in different format and is complex. 3. Management of Data - As the data gathered have different formats like structured, semi-structured and unstructured, it is very challenging to manage such different variety of data.
1. Unstructured - Any data with unknown form or the structure is classified as unstructured data. - Unstructured data is like a messy room where things are scattered around, and you're not sure where everything is. It doesn't have a predefined format or structure. - In addition to the size being huge, un-structured data poses multiple challenges in terms of its processing for deriving value out of it. - Typical example of unstructured data is, a heterogeneous data source containing a combination of simple text files, images, videos like search in Google Engine. - Now a day organizations have wealth of data available with them but unfortunately they don't know how to derive value out of it since this data is in its raw form or unstructured format. - Example:- Think of Google Search – it deals with various types of information like text, images, and videos. All this data is raw and not neatly organized. 2. Semi-structured - Semi-structured data is a bit like a room where things are not neatly arranged but have some labels or tags on them. It contains both structured and unstructured elements. - Semi-structured data can contain both the forms of data. - Semi-structured data pertains to the data containing both the formats mentioned above, that is, structured and unstructured data. - To be precise, it refers to the data that although has not been classified under a particular repository (database), yet contains vital information or tags that segregate individual elements within the data. - Web application data, which is unstructured, consists of log files, transaction history files etc. - Online transaction processing systems are built to work with structured data wherein data is stored in relations (tables). - User can see semi-structured data as a structured in form but it is actually not defined with e.g. a table definition in relational DBMS. - Personal data stored in a XML file:
Big Data Analytics refers to the process of examining large and complex datasets to uncover hidden patterns, correlations, and insights that can help businesses and organizations make better-informed decisions. It involves using various tools, techniques, and technologies to analyze massive volumes of data that traditional data processing applications might struggle to handle. Imagine you have a huge pile of puzzle pieces. Each piece represents a piece of information, like customer data, online activities, or sensor readings. Big Data Analytics is like putting these puzzle pieces together to reveal the bigger picture—insights and trends that can be valuable for making smarter decisions. Big Data Analytics is the method of examining vast and diverse sets of data, often in real-time, to extract meaningful patterns, trends, and information that can guide decision-making processes. Example: Let's say a company collects data on how customers interact with their website, including what products they view, how much time they spend on each page, and whether they make a purchase. By using Big Data Analytics, the company can analyze this large dataset to identify patterns, such as which products are popular, what factors influence purchasing decisions, and how to optimize the website for a better user experience. This valuable information can then be used to enhance marketing strategies, improve customer satisfaction, and ultimately boost sales.
Complex or massive data sets which are quite impractical to be managed using the traditional database system and software tools are referred to as big data. Big data is utilized by organizations in one or another way. It is the technology which possibly realizes big data’s value. It is the voluminous amount of both multi-structured as well unstructured data.
It is very popular entertainment company work in online on-request web based video streaming for its customers. It has been determined to be able to predict what precisely its customers will appreciate viewing with Big Data. Recently, Netflix begun positioning itself as a content creator, not simply a distribution medium which is solidly said based on data analytics. Data likes are recommendation engines take care of customers watch, regularly playback halted, ratings and so on. It has incorporated with Hadoop, Hive and Pig and other traditional business intelligence.
Intelligent Data Analysis (IDA) is one of the major issues in the field of artificial intelligence and information. Intelligent Data Analysis (IDA) is like a smart detective for information hidden in a massive pile of data. Imagine you have a big box of puzzle pieces, and IDA is the clever friend who not only puts the puzzle together but also discovers new patterns and stories you didn't know were there. It helps us make better decisions by uncovering hidden and valuable information in large sets of data. IDA is a smart way of using computers to dig into a ton of information and find things we didn't know before. It includes three major steps:
Undoubtedly Big Data has become a major game change in most part of the cutting edge industries over the last few years. As Big Data keeps on going day by day, the number of various organizations that are adopting Big Data keeps on expanding. Let’s discuss example: An e-commerce site XYZ (having 100 million users) wants to offer a gift voucher of 100$ to its top 10 customers who have spent the most in the previous year. Moreover, they want to find the buying trend of these customers so that company can suggest more items related to them. Issues: Huge amount of unstructured data which needs to be stored, processed and analyzed. Solution: Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed File System) which uses commodity hardware to form clusters and store data in a distributed fashion. It works on Write once, read many times principle. Processing: Map Reduce paradigm is applied to data distributed over network to find the required output. Analyze: Pig, Hive can be used to analyze the data. Cost: Hadoop is open source so the cost is no more an issue.
Hadoop's story begins at Yahoo! in the mid-2000s. Doug Cutting, along with Mike Cafarella, started working on an open-source project inspired by Google's MapReduce and Google File System (GFS) papers. They named it after a toy elephant belonging to Cutting's son, named Hadoop. ❖ Core Components: Hadoop is comprised of two core components:
Apache Hadoop is an open-source software framework used for distributed storage and processing of large datasets on clusters of commodity hardware. In simpler terms, it's like a massive storage and computing system spread across many computers working together. Components of Hadoop:
Hadoop is an open-source framework that allows for the distributed storage and processing of large data sets using a cluster of commodity hardware. Its ecosystem consists of various tools and components that work together to handle big data processing tasks.
Unit 2
Unit 2
Rack awareness in Hadoop is a way of organizing and managing data storage in a distributed computing environment. In simpler terms, imagine you have a big system with lots of computers (nodes) working together to process and store data. These nodes are grouped into racks, which are like large shelves holding several computers. Now, rack awareness is the idea that when you're storing or retrieving data, it's better to use nodes that are on the same rack or shelf, rather than picking nodes randomly. This is because communication between nodes on the same rack is faster and more efficient than communicating between nodes on different racks. By default, when you install Hadoop, it assumes that all the computers (nodes) are on the same rack. However, in a real-world scenario, you might have multiple racks of nodes. Rack awareness helps Hadoop make smarter decisions about where to place and access data so that it can take advantage of the proximity of nodes on the same rack. Example: Let's say you have a Hadoop cluster with three racks: Rack A, Rack B, and Rack C. Each rack contains several nodes. Now, if you want to store a piece of data, rack awareness will try to place that data on nodes within the same rack. This way, when you need to retrieve the data, the communication happens more quickly because the nodes are physically closer to each other.