Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Data Analytics Cheat Sheet: Master Data Analytics in Just 6 Pages, Cheat Sheet of Advanced Data Analysis

Guru Gobind Singh Indraprastha University Advanced Data Analysis

Simplify your Data Analytics learning with this 6-page cheat sheet! Covering essential topics like data visualization, statistical analysis, machine learning basics, and data processing techniques, this concise guide is perfect for students, analysts, and exam preparation. Whether you're starting out or need a quick refresher, this cheat sheet provides all the key concepts in an easy-to-understand format. Download now to enhance your data analytics skills and excel in your studies or career!

Typology: Cheat Sheet

2023/2024

Available from 01/11/2025

helpfulengineer23 🇮🇳

4 documents

1 / 6

This page cannot be seen from the preview

Don't miss anything!

Partial preview of the text

Download Data Analytics Cheat Sheet: Master Data Analytics in Just 6 Pages and more Cheat Sheet Advanced Data Analysis in PDF only on Docsity!

DATA ANALYTICS: Data analytics is the process of examining raw data to uncover patterns, insights, and trends that can Ds Decision Maki dive decision-making. It plays a critical role across helping bu ‘optimise operations, ‘customer behaviour, and predict future outcomes. Aspect Structured Data Semi-Structured Data | Unstructured Data Lighly onsaniyed ane aia thal deer t Dati lacks opie ii strate formatred ota that follows © predatineci cata contorm stricly t9 2 structure but containg tormat, making it harder to snalyze or process. model tags or markers. Sourees SCI Databasox MySCil, ——-XMIVISON fileg NeSOl__Teat Documense: Ward, PTs Social PosieSQl; Spreadshesk: | Balabancss MenaaDB, Media Pista: bar Excel, Google Shests; EF Cassanairo; Emails, Media Fes: Images, videos, audio Systems: Financial Data (meracats}: AFI; Log Emails thocly cantent):loT Devices files Images, audio signals, reeferm legs uses Bi: Sales and! inventory ‘Web Pata Fxchanae: Sentiment Analysis: Public osiniony svi CRM: Stet AB byw services: fom NI lel ienightsy lmsage Reeeition, sstomer details; files: System monitoring: Big Data Voice Assistants: Spesch analysis Reporting: Automation, ‘QUASI-STRUCTURED DATA; Definition: Loosely structured data with inconsistent formatting but some organizational properties for partial processing or categorisation. Sources: Clickstream Data: User navigation patterns. Web Scraped. Data. Event Logs: Software logs. Metadata. Uses: Behavioral Analysi ‘Mapping. Web Analytics: Identifying trends or . Data Mining: r DATA Aspect Quantitative Oats Qualitative Dats Definition Numnerival information dealing: with Beseriptive inferrvstens del nag wll chara levies and measurable, abjective dimensions lescripters, observed subjectively (e.g. smells, tastes, like height, wieth, length, textures, attactiveness, color remperature, prices, crea, end volume. Types Continuous Data: Measurements Categorical Data: Charactarictice (eg. gender, arial that cah be infinitely divides status); may use numerical labels (eg. "I" for male, "2 height in meters, centimet for female). Binary Data: Two exclusive categories (e.g Diserate Data: heme thet can be ‘ve/fais2). Nominal Data: Labels wishout quantizative counted and are finite ears number value (ne clon, seboel natrie). Ordinal Date of ales en group OF 10) Caloqaties with mecming ful numerical ores to enlauurant rating an see of O to 4) Examples Dimersions Height, width, length; ‘Smells, tastes, textures: Attractiveness; Celer of ters: Temperature; Prices: Aven: Volume Rating systems wth meaningful order (ordinal data! 11, Time Series Data Definition: Observations on at regular time over time. oP ‘over years, stock prices, daily sales, money supply. Uses: Forecasting, trend analysis, financial and economic analysis. 2, Cross-Section Data Definition: Data collected at a single point in time or over time without focusing on time. ‘LAccuracy:| is correct in every detail. Accurate data reflects real-world situations, preventing significant P and ‘2. Reliability: The collection process is: with no: dons to other trusted ; data builds trust, while ‘can lead to costly ‘and reputational damage. ‘3. Relevance; Data is pertinent to its intended purpose. Relevant data is essential; unnecessary data wastes time and {is up-to-date and can imely data pr 4, Timeliness: decisions, saving time and money while protecting the organization's reputation. Information is comprehensive and meets data needs. Complete data ensures usability; Incomplete ‘data can hinder processes like mallings or analyses. WHAT IS BIG DATA? Big Data refers to high-volume, high-velocity, and/or high-variety information assets that cannot be ly stored or pr using computing within a specific timeframe. it requires advanced processing to enable: making, insight: y, and process opti ‘Example of Big Data: An example of Big Datais attempting to attach a 100-megabyte document to an email. Most email systems cannot: such large , illustrating how data. ng, limits can be classified as Big Data. P ‘Structure of data Data Tre Data Structures are: Big Data Where Does Big Data Come From? Big Data is generated from various fields, including: 1. Social Networking Sites: Platforms like Facebook and Twitter collect user data, posts, and shared links. 2. Search Engines: Retrieve data from databases, generating information based on user queries. 3. Medical History: Hospitals create extensive medical -docun health issues and 4, Online Ee pl collect data on ‘customer preferences and behavior. 5. Stock Exchange: The trading of shares generates data on market activities and investor behavior. WHATS BIG DATA ANALYTICS? Big Data Analytics is the process of extracting meaningful insights from large data to Ps tions, trends, and pr ‘Methodology: I extracting useful Information through data management and subs includes 1g, and Identifying useful information to assist decision-making. Importance: Big Data Analytics can lead to new revenue ‘opportunities, effective marketing, enhanced customer service, improved operational efficiency, and competitive advantages. ; Once data is prepared, it can be analyzed using tools like Data Mining (identifying patterns), Predictive Analytics (forecasting behavior), Machine Learning (analyzing large datasets), and Deep Learning (complex data analysis (CHARACTERISTICS OF BIG DATA: 1. Volume: The amount of data generated, requiring assessment and growth: forecasting across various sources (e.9., databases, spreadsheets). 2, Velocity: The speed at which data is generated and by digital real-time analytics, social media, and ioT devices. 3. Variaty: The different types of data generated, including structured, semi-structured, and unstructured data. Integration techniques are used for analysis. ere The quality of data, which can be inconsistent and uncertain, affecting analysis 5.Vnlue: The ‘of converting data into useful insights. Data must be processed to extract meaningful information, ‘making Value the most critical characteristic. SIGNIFICANCE OF THE V'S OF BIG DATA; 0. informed Decision-Making: Enables data-driven decisions by identifying patterns. 1. Cus Allows real-time per ‘of user experiences. 2. Security Concems: the need for safeguarding sensitive +3. Innovation: Drives ents in data storage and pr solutions. 4, Sil Demand: Creates a need for professionals skilled in Big Data Analytics. 5, Problem-Solving: Helps tackle global issues like climate forecasting. 6. Ethics: Raises concerns about blas and inequality in Al algorithms. i: The data analysis process consists of four basic steps: 1. Gathering Data - Collect data from various sources and itfor better analysis. 2. Data + Store and manage the data in and Excel sheets. 3. Ste Analysis - rpret the stored data using models to identify ‘ends, utiizng programming languages ike Python and R. 4. Daa Presentation - Format the data fr accessibility and Examples: Opinion polls, income: 1 GNP per capita for - Uses: Analyzing ips variables, sample studies. i : Combination of cross-section and time-series data nding, particularly for those responsible for growth, analysis, efficiency, and operations. ‘Need for Data Analytics: Data analytics enhances efficiency and performance across industries by identifying patterns. where different units are observed overtime. Examples: GNP per capita for European countries over ten years. Uses: It provides with a competitive edge through king. By ng big data with Al and Suva the inpacof polo social chanae-.zanaanacal aan: Time en dt fhe machine leaning, companies can accurately predict outcomes and tailor targeted marketing strategies. Additionally, units over ‘Examples: Wage and employment history of ‘over analytics offers insights into experiences, allowing for per ptimized operations, and ton years, company performance overtime, Uses: Tracking changes within ene, studying behaviour or trends over time. ; MapReduce is a programming model used for large-scale data processing in distributed systems. It splits tasks into smaller sub-tasks and processes them in parallel, making it an essential component of big data ‘technologies like Hadoop. Key Concepts: - Distributed Computing: MapReduce processes data distributed across multiple nodes, ensuring Improved employee productivity. # APACHE HBASE: Apache Hbase is a distributed, scalable NoSQL database designed for real-time read and write ‘operations in big data applications. iis built on Hadoop's HDFS and is ideal for handling sparse, large datasets. ‘Key Features:- Column-oriented storage for efficient access. - Scalable horizontally across commodity servers. Real-time reads and writes with strong consistency. - Schema-less with support for semi-structured data, scalability and fault >. = Divide and Conquer: Tasks are divided into ‘chunks and p d using Mep : with Hadoop, luce, and Spark. and Reduce functions. 1. Input Spitting: Input data Is divided into chunks. 2, Mapping: The Map function creates intermediate key-value pairs. 3. Shufling and Sorting: Key-value pairs are grouped and sorted by keys. 4. Reducing: The Reduce function aggregates the grouped data. 5. Output: The final result is written to the output file. HDFS) ‘Core Components: - Mapper: Generates key-value pairs from input data. - Reducer: Aggregates key-value pairs based on kys.- Master Node: Manages tsk scheduling, - Worker Nodes: Execute Map and Reduce acs Advantages: y, Fault Te , Simplicity, Cost: atch Provecing Dak UO Limited HerstieProcercin, Progamming Model ‘Modem Alternatives: - Apache Spark: Enhances performance by keeping data in memory. - Tez: Reduces latency in workflows. - Flink: Supports both batch and real-time processing. ‘Limitations: ‘Architecture: - HBase Master: Manages cluster and regions.- ZooKeeper: Coordination and configuration management - RegionServers: Store and manage data regions. - MemStore: In-memory write cache before flushing to HDFS. Data Model: - Tables with rows identified by row keys.- Columns grouped into column famiis.- Timestemps enable versioned cells. Use Cases:- Time-series data, real-time analytics, content management, social media feeds. ‘Advantages: , flexible schema, fest, and fault-tolerant. Integrates well ith the Hadoop ecosy. Limitations: - Complex to configure, high memory usage, no query language. AER Ma ied dat ptr dead ori dat aos, od ergs nasi pron. and event 19 for enterprise-prade performance, scalability, and high ‘vallablty. Key Features Unified platform integrating file storage, databases, and event streaming Real-time ‘capabilites for dynamic applications - Multi-cloud and hybrid environment support - High availability with data #E APACHE HADOOP: Apache Hadoop is an open-source for processing of large across replication - Clusters. it scales from a single server to thousands of machines, enabling both structured and unstructured data processing. ‘Core Components: 1. HDFS: Distributed storage with data replication (default replication factor: 3). 2, MapReduce: Parallel computation for batch processing, divided into Map, Shuffle, and Reduce phases. 3. YARN: ‘Manages resources and job scheduling across nodes. 4. Hadoop Common: Provides shared utllties and libraries. ‘Architecture: - Master-Slave: NameNode (HDFS), ResourceManager (YARN) as master; DataNodes (HDFS), ‘NodeManagers (VARN) as slave. - Data Flow: Data stored in HDFS, processed via MapReduce, results returned to HDFS. Advantages: - Scalable, fault-tolerant, cost-effective, flexible, and open-source. Limitations: - High latency, disk /0 overhead, complex to use, inefficient for iterative processing. Ecosystem: Includes tools like Pig (MapReduce), Hive (SQL-ike query engine), HBase (NoSQL), Spark (real-time analytics), and more. _Used in retail, healthcare, finance, social media, and telecom for analytics and resource ‘optimization. # APACHE PIG: Apache Pig is a high-level platform for processing large datasets in Hadoop using a scripting language integration with big data tools ike Spark and TensorFiow ‘Core Components 1. MapR-FS: Distributed POSIX-compliant filesystem 2. MapR-DB: NoSQL database o real-time analytics 3. MapR Streams: Event streaming system 4. MCS: Web interface for managing clusters 5. MEP: Integrates Hadoop, Spark, Hive, and Dri ‘Key Architectural Features Data replication for availability and disaster recovery Container support with Kubernetes - Security features like encryption and access control Use Cases -loT sensor data processing - Fraud detection in finance - Retail personalization - Healthcare genomics ‘Advantages - High performance and scalability - ‘loud and big data tool = Simplified ‘Challenges - High cost and vendor dependency - Requires technical expertise for advanced features ‘# SHARDING: Sharding isa technique to distribute data across multiple servers to improve scalability, performance, and ‘fault tolerance. Datais spit into smaller pieces called shards, each operating as an independent databas .- Shard: A data partition. - Shard Key: Determines how data is distributed. - Horizontal Partitioning: Distributes rows across multiple nodes. - Replication: Coples shards for fault tolerance. ‘Why Use Sharding?- Scalability: scaling. - Pe latency and throughput. - Fault Tolerance: Allows partial availability if a shard fails. - Manageability: Easier to manage smaller datasets. Sharding Strategies: - Range-Based: Divides data by value ranges. - ash-Based: Uses a hash function for data called Pig Latin. I simplifies MapReduce programming and supports data aggregation, and analy Koy Features: - High-Level Language: Pig Latin, optimized for big data, similar to SQL. - Data Handling: Supports text, ISON, XML, and binary formats. - Optimization: Automatically optimizes execution plans. + Extensibility: Custom functions in Java and Python. - Hadoop Integration: Converts scripts into MapReduce jobs. - Fault Tolerance: inherits Hadoop's fault tolerance. ‘Architecture: 1. Pig Latin Seript: User writes tasks. 2. Parser: Checks syntax and creates a logical plan. 3. Optimizer: Optimizes the plan. 4. Compiler: Converts to MapRedluce Jobs. 5. Execution Engine: Executes jobs. Data Model: - Atom: Single data value. - Tuple: Collection of fields. - Bag: Unordered collection of tuples. = Map: Key-value pairs. EAPACHE HIVE: Apache Hive isa data warehouse infrastructure built on Hadoop that enables querying and managing ~6 IP by location. - Composite: Combines multiple strategies. 3; Sharing distributes data across nodes, while replication creates copies of data across nodes. Sharding allows scaling, offers fault through data ‘Advantages: - Scalbilty, faster performance, fuit isolation, and cost-effective scaling. ‘Challenges: Complexity rebalancing, hot spots, and consistency issues. {Use Cases: - Web apps, time-series data, and high-traffic databases. AMAZON $3 : Amazon $3(Simple Storage Service) isa scalable, high-speed, and cost-effective object storage service by AWS designed fr storing and retrieving data. it offers high availabilty, durability, and security. 5.» Scalability: Adjusts to growing data needs. - Durability: 99.99% durability with replication across multiple ‘Availabilty Zones. - Availabilty: 99.99% uptime with higher availabilty option. - Security: AES-256 encryption, large datasets using a SQL-ke language called HiveQL. it simplifies Hadoop's complexity by providing a user-friendly interface for those familiar with SQL, making it ideal for data analysis, warehousing, and report generation. Key Features:- SQl--ike interface (HiveQL) - Integration with Hadoop (HDFS for storage, MapReduce for execution) = Schema on read and data partitioning Extensibility with custom UDFS - Supports various storage formats (e.g, ORC, Parquet) SSL/TLS, and fine-grained access control. - Flexibility: Stores any object type, up to 5 TB per object. = Cost-Effective: Offers tiered pricing like S3 Standard, ntelligent-Tiering, and Glacier. How $3 Works: Buckets: Containers for data with unique global names. - Objects: Data with metadata and a unique key. - Keys: Unique identifier for objects. - Regions: Buckets created in specific AWS regions. ‘Gana Cancana:- Storage Classes: Options for frequent or infrequently accessed dete - Versioning: Protects against Architecture: - User interface for interaction (CLI, JDBC/ODBC, web) - Driver, Compiler, and ‘execution - Execution Engine supports MapReduce, Tez, or Spark Advantages: - Easy for SQL users, scalable, and extensible - Integrates with BI tools (Tableau, Power Bl) Limitations: - Not real-time, high latency, and limited ACID transaction support Use Cases: - Log processing, data warehousing, business intelligence, and ETL pipelines. for query and’ Lifecycle object and Cross-Region or Same-Region replication for redundeney ~ Access Control: Managed via policies, AM roles, and ACLS. Use Cases: - Backup and Recovery, Data Archiving, Big Data Analytics, Content Delivery, Application Hosting. ‘Benefits: - High Performance, Global Access, Integration with AWS services, Customizable Security. Limitations: - Not: for! databases, requires understanding of AWS services, network latency.

Data Analytics Cheat Sheet: Master Data Analytics in Just 6 Pages, Cheat Sheet of Advanced Data Analysis

Related documents

Partial preview of the text

Download Data Analytics Cheat Sheet: Master Data Analytics in Just 6 Pages and more Cheat Sheet Advanced Data Analysis in PDF only on Docsity!