























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
these are the unit 1 revision questions of Big Data from my 5th semester made by professor KSS
Typology: Slides
1 / 31
This page cannot be seen from the preview
Don't miss anything!
Which of the following statements about TestDFSIO is true? A) TestDFSIO primarily measures the network bandwidth between DataNodes and the NameNode. B) TestDFSIO is used to test the durability of HDFS by intentionally introducing faults into the system. C) TestDFSIO can only perform read operations on HDFS to benchmark its performance. D) TestDFSIO allows for both read and write operations to benchmark the I/O performance of HDFS, including the option to clean up the test data after benchmarking. What does running MRBench in a Hadoop cluster primarily help in assessing? A) The maximum number of MapReduce jobs that can be run simultaneously in a cluster. B) The efficiency of HDFS in storing and retrieving large datasets. C) The overhead associated with running a large number of small MapReduce jobs, including job setup and teardown times. D) The capability of the NameNode to manage metadata for a large number of files and blocks.
In Hadoop's HDFS, how is the block size configured, and what is its scope? A) Block size is globally set at the NameNode and cannot be changed for individual files; all files in the HDFS share the same block size. B) Block size is configured on a per-DataNode basis, allowing different DataNodes to store parts of the same file in blocks of varying sizes. C) Block size can be specified per file at the time of file creation, overriding the default block size set in the HDFS configuration. D) Block size is determined dynamically by the HDFS client at the time of file writing, based on the file size and network conditions. What are the options for recovering accidentally deleted data in Hadoop's HDFS? A) Deleted data in HDFS is irrecoverable as HDFS permanently deletes files immediately without any backup or recovery option. B) HDFS supports an "Undelete" feature that allows users to recover deleted data within a configurable time frame after deletion. C) Accidentally deleted data can be recovered from the HDFS trash directory, provided the trash feature is enabled and the data has not been permanently deleted from the trash. D) Data recovery in HDFS can only be achieved by restoring data from external backups manually maintained by the Hadoop administrators, as HDFS does not support any built- in data recovery mechanisms.
Which data format among the following would LEAST benefit from a multi-threaded record reader in a MapReduce job? (a) Large JSON files stored in separate Avro files per partition. (b) Highly compressed log files stored sequentially in a single file. (c) Parquet files containing millions of small records distributed across multiple files. (d) CSV files split into chunks based on a specific record ID range. What is the typical threading model of a record reader in MapReduce? (a) Single-threaded (b) Multi-threaded (c) Depends on the specific implementation and framework
When processing a compressed archive with MapReduce, which approach is generally NOT recommended for determining input split size? (a) Splitting based on the size of uncompressed data within the archive. (b) Using the default split size configured in Hadoop. (c) Manually setting the split size based on cluster resources and archive size. (d) Splitting based on the number of files within the compressed archive. In a MapReduce job processing a large text file, what factor is MOST important when determining the optimal input split size? (a) Number of available mappers in the cluster (b) Average record size within the file (c) Network bandwidth between mappers and data nodes (d) Total size of the input file Where are the intermediate outputs (key-value pairs) generated by mappers in a MapReduce job typically stored? (a) On the local file system of each mapper node. (b) On the HDFS Namenode. (c) In a dedicated temporary storage system specific to MapReduce. (d) Directly written to the final output file in HDFS. Where is the final output of a MapReduce job usually written? (a) On the local file system of the reducer node. (b) In the same directory as the input data on HDFS. (c) Distributed across all data nodes in HDFS based on a partitioner function. (d) Saved in a special directory managed by the JobTracker.
Scenario: You have a large, directed graph stored in a distributed file system, and you need to find the shortest path between two specific nodes (source and destination). You plan to utilize MapReduce for parallel processing. Which of the following best describes the workflow for achieving this with MapReduce? (a) Map: Processes each node, emitting distances to all other nodes as key-value pairs (node, distance). Reduce: Aggregates minimum distances for each node, identifying the shortest path from the source. (b) Map: Emits all possible paths from the source node to each other node as keys. Reduce: Identifies the shortest path among all emitted paths for each destination node. (c) Map: Emits edges as key-value pairs (source node, destination node, edge weight). Reduce: Iteratively updates distances based on received edges, eventually converging to the shortest paths. (d) Map: Not used. Reduce: Directly analyzes the entire graph to find the shortest path between the specified nodes. CREATE TABLE IF NOT EXISTS toy_products (ProductCategory String, Productid int, ProductNarne String, ProductPrice float) COMMENT 'Toy details' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS TEXTFILE