Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Data Mining - Data Mining is an area in computer science, Study notes of Data Mining

Query processing ,trees grid files,spatial data

Typology: Study notes

2017/2018

Uploaded on 10/22/2018

joonageorge
joonageorge 🇮🇳

3

(1)

9 documents

1 / 23

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Distributed Databases
Module 2
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17

Partial preview of the text

Download Data Mining - Data Mining is an area in computer science and more Study notes Data Mining in PDF only on Docsity!

Distributed Databases

Module 2

Introduction to Distributed Databases

  • (^) Data in a distributed database system is stored across several sites.
  • (^) Each site is typically managed by a DBMS that can run independently of the other sites.

Distributed data independence…

  • (^) Users should be able to ask queries without specifying where the referenced relations, or copies or fragments of the relations, are located.
  • (^) This principle is a natural extension of physical and logical data independence.
  • (^) Queries that span multiple sites should be optimized systematically in a cost-based manner, taking into account communication costs and differences in local computation costs.

Distributed transaction atomicity

  • (^) Users should be able to write transactions that access and update data at several sites just as they would write transactions over purely local data.
  • (^) The effects of a transaction across sites should continue to be atomic.

Contd..

  • (^) The key to building heterogeneous systems is to have well-accepted standards for gateway protocols.
  • (^) A gateway protocol is an API that exposes DBMS functionality to external applications.
  • (^) Examples include ODBC and JDBC.
  • (^) By accessing database servers through gateway protocols, their differences (in capabilities, data formats, etc.) are masked, and the differences between the different servers in a distributed system are bridged to a large degree.

Distributed DBMS Architectures

  • (^) Client-Server
  • (^) Collaborating Server
  • (^) Middleware

Contd…

  • (^) This architecture has become very popular for several reasons.
  • (^) First, it is relatively simple to implement due to its clean separation of functionality and because the server is centralized.
  • (^) Second, expensive server machines are not underutilized by dealing with mundane user-interactions, which are now relegated to inexpensive client machines.
  • (^) Third, users can run a graphical user interface that they are familiar with, rather than the (possibly unfamiliar and unfriendly) user interface on the server.

Collaborating Server Systems

  • (^) We can have a collection of database servers, each capable of running transactions against local data, which cooperatively execute transactions spanning multiple servers.
  • (^) When a server receives a query that requires access to data at other servers, it generates appropriate subqueries to be executed by other servers and puts the results together to compute answers to the original query.
  • (^) Ideally, the decomposition of the query should be done using cost- based optimization, taking into account the costs of network communication as well as local processing costs.

Contd…

  • (^) We can think of this special server as a layer of software that coordinates the execution of queries and transactions across one or more independent database servers; such software is often called middleware.
  • (^) The middleware layer is capable of executing joins and other relational operations on data obtained from the other servers, but typically, does not itself maintain any data.

Storing Data in a Distributed DBMS

  • (^) In a distributed DBMS, relations are stored across several sites.
  • (^) Accessing a relation that is stored at a remote site incurs message- passing costs.
  • (^) To reduce this overhead, a single relation may be partitioned or fragmented across several sites.
  • (^) The fragments are stored at the sites where they are most often accessed, or replicated at each site where the relation is in high demand.

Contd…

  • (^) The motivation for replication is twofold:
  • (^) Increased availability of data:
    • (^) If a site that contains a replica goes down, we can find the same data at other sites.
    • (^) Similarly, if local copies of remote relations are available, we are less vulnerable to failure of communication links.
  • (^) Faster query evaluation:
    • (^) Queries can execute faster by using a local copy of a relation instead of going to a remote site.
  • (^) There are two kinds of replication, called synchronous and asynchronous replication, which differ primarily in how replicas are kept current when the relation is modified.
  • (^) Typically, the tuples that belong to a given horizontal fragment are identified by a selection query; - (^) for example, employee tuples might be organized into fragments by city, with all employees in a given city assigned to the same fragment. - The horizontal fragment shown corresponds to Chicago. By storing fragments in the database site at the corresponding city, we achieve locality of reference. - (^) Chicago data is most likely to be updated and queried from Chicago, and storing this data in Chicago makes it local (and reduces communication costs) for most queries.
  • (^) Similarly, the tuples in a given vertical fragment are identified by a projection query.
  • (^) The vertical fragment in the figure results from projection on the first two columns of the employees relation.

Replication

  • (^) Replication means that we store several copies of a relation or relation fragment.
  • (^) An entire relation can be replicated at one or more sites.
  • (^) Similarly, one or more fragments of a relation can be replicated at other sites.
  • (^) For example, if a relation R is fragmented into R1, R2, and R3, there might be just one copy of R1, whereas R2 is replicated at two other sites and R3 is replicated at all sites.

Distributed Data Independence

  • (^) Distributed data independence means that users should be able to write queries without regard to how a relation is fragmented or replicated;
  • (^) It is the responsibility of the DBMS to compute the relation as needed.
  • (^) This property implies that users should not have to specify the full name for the data objects accessed while evaluating a query.