Partial preview of the text
Download Data Analytics Cheat Sheet: Master Data Analytics in Just 6 Pages and more Cheat Sheet Advanced Data Analysis in PDF only on Docsity!
DATA ANALYTICS: Data analytics is the process of examining raw data to uncover patterns, insights, and trends that can Ds Decision Maki dive decision-making. It plays a critical role across helping bu ‘optimise operations, ‘customer behaviour, and predict future outcomes. Aspect Structured Data Semi-Structured Data | Unstructured Data Lighly onsaniyed ane aia thal deer t Dati lacks opie ii strate formatred ota that follows © predatineci cata contorm stricly t9 2 structure but containg tormat, making it harder to snalyze or process. model tags or markers. Sourees SCI Databasox MySCil, ——-XMIVISON fileg NeSOl__Teat Documense: Ward, PTs Social PosieSQl; Spreadshesk: | Balabancss MenaaDB, Media Pista: bar Excel, Google Shests; EF Cassanairo; Emails, Media Fes: Images, videos, audio Systems: Financial Data (meracats}: AFI; Log Emails thocly cantent):loT Devices files Images, audio signals, reeferm legs uses Bi: Sales and! inventory ‘Web Pata Fxchanae: Sentiment Analysis: Public osiniony svi CRM: Stet AB byw services: fom NI lel ienightsy lmsage Reeeition, sstomer details; files: System monitoring: Big Data Voice Assistants: Spesch analysis Reporting: Automation, ‘QUASI-STRUCTURED DATA; Definition: Loosely structured data with inconsistent formatting but some organizational properties for partial processing or categorisation. Sources: Clickstream Data: User navigation patterns. Web Scraped. Data. Event Logs: Software logs. Metadata. Uses: Behavioral Analysi ‘Mapping. Web Analytics: Identifying trends or . Data Mining: r DATA Aspect Quantitative Oats Qualitative Dats Definition Numnerival information dealing: with Beseriptive inferrvstens del nag wll chara levies and measurable, abjective dimensions lescripters, observed subjectively (e.g. smells, tastes, like height, wieth, length, textures, attactiveness, color remperature, prices, crea, end volume. Types Continuous Data: Measurements Categorical Data: Charactarictice (eg. gender, arial that cah be infinitely divides status); may use numerical labels (eg. "I" for male, "2 height in meters, centimet for female). Binary Data: Two exclusive categories (e.g Diserate Data: heme thet can be ‘ve/fais2). Nominal Data: Labels wishout quantizative counted and are finite ears number value (ne clon, seboel natrie). Ordinal Date of ales en group OF 10) Caloqaties with mecming ful numerical ores to enlauurant rating an see of O to 4) Examples Dimersions Height, width, length; ‘Smells, tastes, textures: Attractiveness; Celer of ters: Temperature; Prices: Aven: Volume Rating systems wth meaningful order (ordinal data! 11, Time Series Data Definition: Observations on at regular time over time. oP ‘over years, stock prices, daily sales, money supply. Uses: Forecasting, trend analysis, financial and economic analysis. 2, Cross-Section Data Definition: Data collected at a single point in time or over time without focusing on time. ‘LAccuracy:| is correct in every detail. Accurate data reflects real-world situations, preventing significant P and ‘2. Reliability: The collection process is: with no: dons to other trusted ; data builds trust, while ‘can lead to costly ‘and reputational damage. ‘3. Relevance; Data is pertinent to its intended purpose. Relevant data is essential; unnecessary data wastes time and {is up-to-date and can imely data pr 4, Timeliness: decisions, saving time and money while protecting the organization's reputation. Information is comprehensive and meets data needs. Complete data ensures usability; Incomplete ‘data can hinder processes like mallings or analyses. WHAT IS BIG DATA? Big Data refers to high-volume, high-velocity, and/or high-variety information assets that cannot be ly stored or pr using computing within a specific timeframe. it requires advanced processing to enable: making, insight: y, and process opti ‘Example of Big Data: An example of Big Datais attempting to attach a 100-megabyte document to an email. Most email systems cannot: such large , illustrating how data. ng, limits can be classified as Big Data. P ‘Structure of data Data Tre Data Structures are: Big Data Where Does Big Data Come From? Big Data is generated from various fields, including: 1. Social Networking Sites: Platforms like Facebook and Twitter collect user data, posts, and shared links. 2. Search Engines: Retrieve data from databases, generating information based on user queries. 3. Medical History: Hospitals create extensive medical -docun health issues and 4, Online Ee pl collect data on ‘customer preferences and behavior. 5. Stock Exchange: The trading of shares generates data on market activities and investor behavior. WHATS BIG DATA ANALYTICS? Big Data Analytics is the process of extracting meaningful insights from large data to Ps tions, trends, and pr ‘Methodology: I extracting useful Information through data management and subs includes 1g, and Identifying useful information to assist decision-making. Importance: Big Data Analytics can lead to new revenue ‘opportunities, effective marketing, enhanced customer service, improved operational efficiency, and competitive advantages. ; Once data is prepared, it can be analyzed using tools like Data Mining (identifying patterns), Predictive Analytics (forecasting behavior), Machine Learning (analyzing large datasets), and Deep Learning (complex data analysis (CHARACTERISTICS OF BIG DATA: 1. Volume: The amount of data generated, requiring assessment and growth: forecasting across various sources (e.9., databases, spreadsheets). 2, Velocity: The speed at which data is generated and by digital real-time analytics, social media, and ioT devices. 3. Variaty: The different types of data generated, including structured, semi-structured, and unstructured data. Integration techniques are used for analysis. ere The quality of data, which can be inconsistent and uncertain, affecting analysis 5.Vnlue: The ‘of converting data into useful insights. Data must be processed to extract meaningful information, ‘making Value the most critical characteristic. SIGNIFICANCE OF THE V'S OF BIG DATA; 0. informed Decision-Making: Enables data-driven decisions by identifying patterns. 1. Cus Allows real-time per ‘of user experiences. 2. Security Concems: the need for safeguarding sensitive +3. Innovation: Drives ents in data storage and pr solutions. 4, Sil Demand: Creates a need for professionals skilled in Big Data Analytics. 5, Problem-Solving: Helps tackle global issues like climate forecasting. 6. Ethics: Raises concerns about blas and inequality in Al algorithms. i: The data analysis process consists of four basic steps: 1. Gathering Data - Collect data from various sources and itfor better analysis. 2. Data + Store and manage the data in and Excel sheets. 3. Ste Analysis - rpret the stored data using models to identify ‘ends, utiizng programming languages ike Python and R. 4. Daa Presentation - Format the data fr accessibility and Examples: Opinion polls, income: 1 GNP per capita for - Uses: Analyzing ips variables, sample studies. i : Combination of cross-section and time-series data nding, particularly for those responsible for growth, analysis, efficiency, and operations. ‘Need for Data Analytics: Data analytics enhances efficiency and performance across industries by identifying patterns. where different units are observed overtime. Examples: GNP per capita for European countries over ten years. Uses: It provides with a competitive edge through king. By ng big data with Al and Suva the inpacof polo social chanae-.zanaanacal aan: Time en dt fhe machine leaning, companies can accurately predict outcomes and tailor targeted marketing strategies. Additionally, units over ‘Examples: Wage and employment history of ‘over analytics offers insights into experiences, allowing for per ptimized operations, and ton years, company performance overtime, Uses: Tracking changes within ene, studying behaviour or trends over time. ; MapReduce is a programming model used for large-scale data processing in distributed systems. It splits tasks into smaller sub-tasks and processes them in parallel, making it an essential component of big data ‘technologies like Hadoop. Key Concepts: - Distributed Computing: MapReduce processes data distributed across multiple nodes, ensuring Improved employee productivity. # APACHE HBASE: Apache Hbase is a distributed, scalable NoSQL database designed for real-time read and write ‘operations in big data applications. iis built on Hadoop's HDFS and is ideal for handling sparse, large datasets. ‘Key Features:- Column-oriented storage for efficient access. - Scalable horizontally across commodity servers. Real-time reads and writes with strong consistency. - Schema-less with support for semi-structured data, scalability and fault >. = Divide and Conquer: Tasks are divided into ‘chunks and p d using Mep : with Hadoop, luce, and Spark. and Reduce functions. 1. Input Spitting: Input data Is divided into chunks. 2, Mapping: The Map function creates intermediate key-value pairs. 3. Shufling and Sorting: Key-value pairs are grouped and sorted by keys. 4. Reducing: The Reduce function aggregates the grouped data. 5. Output: The final result is written to the output file. HDFS) ‘Core Components: - Mapper: Generates key-value pairs from input data. - Reducer: Aggregates key-value pairs based on kys.- Master Node: Manages tsk scheduling, - Worker Nodes: Execute Map and Reduce acs Advantages: y, Fault Te , Simplicity, Cost: atch Provecing Dak UO Limited HerstieProcercin, Progamming Model ‘Modem Alternatives: - Apache Spark: Enhances performance by keeping data in memory. - Tez: Reduces latency in workflows. - Flink: Supports both batch and real-time processing. ‘Limitations: ‘Architecture: - HBase Master: Manages cluster and regions.- ZooKeeper: Coordination and configuration management - RegionServers: Store and manage data regions. - MemStore: In-memory write cache before flushing to HDFS. Data Model: - Tables with rows identified by row keys.- Columns grouped into column famiis.- Timestemps enable versioned cells. Use Cases:- Time-series data, real-time analytics, content management, social media feeds. ‘Advantages: , flexible schema, fest, and fault-tolerant. Integrates well ith the Hadoop ecosy. Limitations: - Complex to configure, high memory usage, no query language. AER Ma ied dat ptr dead ori dat aos, od ergs nasi pron. and event 19 for enterprise-prade performance, scalability, and high ‘vallablty. Key Features Unified platform integrating file storage, databases, and event streaming Real-time ‘capabilites for dynamic applications - Multi-cloud and hybrid environment support - High availability with data #E APACHE HADOOP: Apache Hadoop is an open-source for processing of large across replication - Clusters. it scales from a single server to thousands of machines, enabling both structured and unstructured data processing. ‘Core Components: 1. HDFS: Distributed storage with data replication (default replication factor: 3). 2, MapReduce: Parallel computation for batch processing, divided into Map, Shuffle, and Reduce phases. 3. YARN: ‘Manages resources and job scheduling across nodes. 4. Hadoop Common: Provides shared utllties and libraries. ‘Architecture: - Master-Slave: NameNode (HDFS), ResourceManager (YARN) as master; DataNodes (HDFS), ‘NodeManagers (VARN) as slave. - Data Flow: Data stored in HDFS, processed via MapReduce, results returned to HDFS. Advantages: - Scalable, fault-tolerant, cost-effective, flexible, and open-source. Limitations: - High latency, disk /0 overhead, complex to use, inefficient for iterative processing. Ecosystem: Includes tools like Pig (MapReduce), Hive (SQL-ike query engine), HBase (NoSQL), Spark (real-time analytics), and more. _Used in retail, healthcare, finance, social media, and telecom for analytics and resource ‘optimization. # APACHE PIG: Apache Pig is a high-level platform for processing large datasets in Hadoop using a scripting language integration with big data tools ike Spark and TensorFiow ‘Core Components 1. MapR-FS: Distributed POSIX-compliant filesystem 2. MapR-DB: NoSQL database o real-time analytics 3. MapR Streams: Event streaming system 4. MCS: Web interface for managing clusters 5. MEP: Integrates Hadoop, Spark, Hive, and Dri ‘Key Architectural Features Data replication for availability and disaster recovery Container support with Kubernetes - Security features like encryption and access control Use Cases -loT sensor data processing - Fraud detection in finance - Retail personalization - Healthcare genomics ‘Advantages - High performance and scalability - ‘loud and big data tool = Simplified ‘Challenges - High cost and vendor dependency - Requires technical expertise for advanced features ‘# SHARDING: Sharding isa technique to distribute data across multiple servers to improve scalability, performance, and ‘fault tolerance. Datais spit into smaller pieces called shards, each operating as an independent databas .- Shard: A data partition. - Shard Key: Determines how data is distributed. - Horizontal Partitioning: Distributes rows across multiple nodes. - Replication: Coples shards for fault tolerance. ‘Why Use Sharding?- Scalability: scaling. - Pe latency and throughput. - Fault Tolerance: Allows partial availability if a shard fails. - Manageability: Easier to manage smaller datasets. Sharding Strategies: - Range-Based: Divides data by value ranges. - ash-Based: Uses a hash function for data called Pig Latin. I simplifies MapReduce programming and supports data aggregation, and analy Koy Features: - High-Level Language: Pig Latin, optimized for big data, similar to SQL. - Data Handling: Supports text, ISON, XML, and binary formats. - Optimization: Automatically optimizes execution plans. + Extensibility: Custom functions in Java and Python. - Hadoop Integration: Converts scripts into MapReduce jobs. - Fault Tolerance: inherits Hadoop's fault tolerance. ‘Architecture: 1. Pig Latin Seript: User writes tasks. 2. Parser: Checks syntax and creates a logical plan. 3. Optimizer: Optimizes the plan. 4. Compiler: Converts to MapRedluce Jobs. 5. Execution Engine: Executes jobs. Data Model: - Atom: Single data value. - Tuple: Collection of fields. - Bag: Unordered collection of tuples. = Map: Key-value pairs. EAPACHE HIVE: Apache Hive isa data warehouse infrastructure built on Hadoop that enables querying and managing ~6 IP by location. - Composite: Combines multiple strategies. 3; Sharing distributes data across nodes, while replication creates copies of data across nodes. Sharding allows scaling, offers fault through data ‘Advantages: - Scalbilty, faster performance, fuit isolation, and cost-effective scaling. ‘Challenges: Complexity rebalancing, hot spots, and consistency issues. {Use Cases: - Web apps, time-series data, and high-traffic databases. AMAZON $3 : Amazon $3(Simple Storage Service) isa scalable, high-speed, and cost-effective object storage service by AWS designed fr storing and retrieving data. it offers high availabilty, durability, and security. 5.» Scalability: Adjusts to growing data needs. - Durability: 99.99% durability with replication across multiple ‘Availabilty Zones. - Availabilty: 99.99% uptime with higher availabilty option. - Security: AES-256 encryption, large datasets using a SQL-ke language called HiveQL. it simplifies Hadoop's complexity by providing a user-friendly interface for those familiar with SQL, making it ideal for data analysis, warehousing, and report generation. Key Features:- SQl--ike interface (HiveQL) - Integration with Hadoop (HDFS for storage, MapReduce for execution) = Schema on read and data partitioning Extensibility with custom UDFS - Supports various storage formats (e.g, ORC, Parquet) SSL/TLS, and fine-grained access control. - Flexibility: Stores any object type, up to 5 TB per object. = Cost-Effective: Offers tiered pricing like S3 Standard, ntelligent-Tiering, and Glacier. How $3 Works: Buckets: Containers for data with unique global names. - Objects: Data with metadata and a unique key. - Keys: Unique identifier for objects. - Regions: Buckets created in specific AWS regions. ‘Gana Cancana:- Storage Classes: Options for frequent or infrequently accessed dete - Versioning: Protects against Architecture: - User interface for interaction (CLI, JDBC/ODBC, web) - Driver, Compiler, and ‘execution - Execution Engine supports MapReduce, Tez, or Spark Advantages: - Easy for SQL users, scalable, and extensible - Integrates with BI tools (Tableau, Power Bl) Limitations: - Not real-time, high latency, and limited ACID transaction support Use Cases: - Log processing, data warehousing, business intelligence, and ETL pipelines. for query and’ Lifecycle object and Cross-Region or Same-Region replication for redundeney ~ Access Control: Managed via policies, AM roles, and ACLS. Use Cases: - Backup and Recovery, Data Archiving, Big Data Analytics, Content Delivery, Application Hosting. ‘Benefits: - High Performance, Global Access, Integration with AWS services, Customizable Security. Limitations: - Not: for! databases, requires understanding of AWS services, network latency.