Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Big Data Unit 1 revision slides, Slides of Computer Science

these are the unit 1 revision questions of Big Data from my 5th semester made by professor KSS

Typology: Slides

2023/2024

Available from 12/03/2024

vedantika-sharma-pesu-rr-2022-2026-
vedantika-sharma-pesu-rr-2022-2026- 🇮🇳

1 document

1 / 31

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
REVISION
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f

Partial preview of the text

Download Big Data Unit 1 revision slides and more Slides Computer Science in PDF only on Docsity!

REVISION

Which of the following statements about TestDFSIO is true? A) TestDFSIO primarily measures the network bandwidth between DataNodes and the NameNode. B) TestDFSIO is used to test the durability of HDFS by intentionally introducing faults into the system. C) TestDFSIO can only perform read operations on HDFS to benchmark its performance. D) TestDFSIO allows for both read and write operations to benchmark the I/O performance of HDFS, including the option to clean up the test data after benchmarking. What does running MRBench in a Hadoop cluster primarily help in assessing? A) The maximum number of MapReduce jobs that can be run simultaneously in a cluster. B) The efficiency of HDFS in storing and retrieving large datasets. C) The overhead associated with running a large number of small MapReduce jobs, including job setup and teardown times. D) The capability of the NameNode to manage metadata for a large number of files and blocks.

In Hadoop's HDFS, how is the block size configured, and what is its scope? A) Block size is globally set at the NameNode and cannot be changed for individual files; all files in the HDFS share the same block size. B) Block size is configured on a per-DataNode basis, allowing different DataNodes to store parts of the same file in blocks of varying sizes. C) Block size can be specified per file at the time of file creation, overriding the default block size set in the HDFS configuration. D) Block size is determined dynamically by the HDFS client at the time of file writing, based on the file size and network conditions. What are the options for recovering accidentally deleted data in Hadoop's HDFS? A) Deleted data in HDFS is irrecoverable as HDFS permanently deletes files immediately without any backup or recovery option. B) HDFS supports an "Undelete" feature that allows users to recover deleted data within a configurable time frame after deletion. C) Accidentally deleted data can be recovered from the HDFS trash directory, provided the trash feature is enabled and the data has not been permanently deleted from the trash. D) Data recovery in HDFS can only be achieved by restoring data from external backups manually maintained by the Hadoop administrators, as HDFS does not support any built- in data recovery mechanisms.

Which of these formats supported

by sqoop

Hive

Question: Which of the following statements is true about Apache Hive?

A) Hive supports real-time updates and deletions in the same manner as traditional relational

database systems.

B) Hive enforces primary key and foreign key constraints during data insertion and manipulation to

ensure data integrity.

C) Hive has built-in support for ACID transactions, enabling full-fledged UPDATE, DELETE, and MERGE

operations as in traditional SQL databases from its inception.

D) Hive allows the definition of primary keys and foreign keys for informational purposes, but does

not enforce these constraints for data integrity during operations.

Which data format among the following would LEAST benefit from a multi-threaded record reader in a MapReduce job? (a) Large JSON files stored in separate Avro files per partition. (b) Highly compressed log files stored sequentially in a single file. (c) Parquet files containing millions of small records distributed across multiple files. (d) CSV files split into chunks based on a specific record ID range. What is the typical threading model of a record reader in MapReduce? (a) Single-threaded (b) Multi-threaded (c) Depends on the specific implementation and framework

When processing a compressed archive with MapReduce, which approach is generally NOT recommended for determining input split size? (a) Splitting based on the size of uncompressed data within the archive. (b) Using the default split size configured in Hadoop. (c) Manually setting the split size based on cluster resources and archive size. (d) Splitting based on the number of files within the compressed archive. In a MapReduce job processing a large text file, what factor is MOST important when determining the optimal input split size? (a) Number of available mappers in the cluster (b) Average record size within the file (c) Network bandwidth between mappers and data nodes (d) Total size of the input file Where are the intermediate outputs (key-value pairs) generated by mappers in a MapReduce job typically stored? (a) On the local file system of each mapper node. (b) On the HDFS Namenode. (c) In a dedicated temporary storage system specific to MapReduce. (d) Directly written to the final output file in HDFS. Where is the final output of a MapReduce job usually written? (a) On the local file system of the reducer node. (b) In the same directory as the input data on HDFS. (c) Distributed across all data nodes in HDFS based on a partitioner function. (d) Saved in a special directory managed by the JobTracker.

Scenario: You have a large, directed graph stored in a distributed file system, and you need to find the shortest path between two specific nodes (source and destination). You plan to utilize MapReduce for parallel processing. Which of the following best describes the workflow for achieving this with MapReduce? (a) Map: Processes each node, emitting distances to all other nodes as key-value pairs (node, distance). Reduce: Aggregates minimum distances for each node, identifying the shortest path from the source. (b) Map: Emits all possible paths from the source node to each other node as keys. Reduce: Identifies the shortest path among all emitted paths for each destination node. (c) Map: Emits edges as key-value pairs (source node, destination node, edge weight). Reduce: Iteratively updates distances based on received edges, eventually converging to the shortest paths. (d) Map: Not used. Reduce: Directly analyzes the entire graph to find the shortest path between the specified nodes. CREATE TABLE IF NOT EXISTS toy_products (ProductCategory String, Productid int, ProductNarne String, ProductPrice float) COMMENT 'Toy details' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS TEXTFILE

  1. Performance Challenge: You have a Hive table with millions of customer records and need to frequently analyze purchase data for specific customer segments based on their country. Which combination of bucketing and partitioning would NOT significantly improve query performance for these segment- specific analyses? (a) Bucketing by customer ID and partitioning by country. (b) Partitioning by country and no bucketing. (c) No bucketing or partitioning. (d) Bucketing by country and partitioning by purchase
  2. Data Skew Challenge: You have a large Hive table storing website clickstream data. The table is partitioned by date, but click activity is heavily skewed towards specific product categories on certain days. What bucketing strategy could help mitigate the impact of data skew on query performance? (a) Bucketing by click timestamp within each date partition. (b) Bucketing by product category within each date partition. (c) Bucketing by user ID within each date partition. (d) No bucketing is necessary as partitioning already addresses skew.