Optimizing DDBS Query Performance – An In-depth Study
By
Rana Sohail
Rana Sohail
MSCS (Networking), MIT
Abstract— In a distributed
database system (DDBS) the data resides over different sites that are
geographically separated. Whenever required data is approached, the way is
through a query system. The process of the query is quite complicated because the data
is always distributed geographically and gathering it is costly. The cost is
in terms of communicating to the sites and then processing of data collection
according to the query. The involvement of the time factors plays an important role in
the whole process. There are sets of operations involved in the execution of
the query which gets complications once augmented by associated value parameters
of the query. These factors reduce the output of query execution and some means are
required to enhance the overall performance of query in a DDBS.
Since the data
is available at different locations, therefore, enhancing the timeframe or in
other words optimizing the query is not a that simple and easy task. The query
optimization depends upon certain information about the organization and
availability of index. Furthermore, data is saved in tabulated form, therefore, cardinality is another important factor which requires attention.
In this paper,
the problems related to query optimization are discussed in detail. Certain
available techniques are considered along with the comparison among each other
to identify the better option for adoption. In addition, the proposed technique is
also introduced which may address the problem in a befitting manner.
Keywords— DDBS, query, optimization, index, cardinality
I.
INTRODUCTION
In every field of
life, there is a need for information which should be handy and available all
the time. This has been made possible through the establishment of data
warehouses that are located in different places. Since a huge amount of data
is stored in these warehouses in the form of databases, therefore, getting
concerned with information is quite a complicated and complex task. Different sources
keep on feeding these databases with updates and new entries all the time. The
increase in data and demand of users decreases the performance of these
databases. There is also a possibility where required data is not completely
available at one location and may be other locations are also needed to be
searched out. This aspect makes the situation more difficult. The adopted searching
technique also plays a very vital role in this regard which is a distributed query
system [1].
A distributed database
management system (DDBMS) is a technique that is used to improve the
performance of these distributed databases. DDBMS provides a single platform to
the users where they can approach to the required data [2].
As the data should
be accessible to everyone so if it is placed centrally in one place then it
would be very easy for all to have direct access through the telecommunication
network. The centralized database is easy to handle in terms of updating,
editing or deleting, in short, the management is much easier. But the major
drawback is if it goes down then there is no access to it. This disadvantage pushes
the designers to adopt the decentralized database system which is also called
as distributed database (DDB) system as explained in [3].
The paper is
organized in sections. In section II, the fundamentals of DDBS are described. In
section III, the distributed query processing system is highlighted. In Section IV,
the system of query optimization is elaborated along with the proposed solution. In
section V, the discussion will be concluded.
II.
FUNDAMENTAL OF DDBS
Keeping in view the
disadvantage of the centralized database system, the database decentralization is
adopted. In this system the data is spread over a number of distant sites which
have their own independent computers and data stores. The beauty of the system
is that it gives a look of a single database. As the users approach through a
query for some data, the process of gathering information is in the back ground
and they get their desired data. The users even not in picture how and from
where the information is collected. It is called the transparency of the location
which is quite complicated but well managed through DBMS as in [4].
The DDB is managed
by DDBMS software which ensures its transparency and DDBS in their integrated
form. This integration is attained by amalgamation of networking and database
technologies. The merged state gives out a look of a single machine to the
users. The DDBS architecture is manufactured by combination of hardware (HW)
and software (SW). The details are elaborated as under:-
A. DDBS
Architecture – Hardware
There could be
number of topologies through which nodes are connected physically and form a
network. Two important topologies are explained here.
1) Client/ Server: As the users (clients) are connected to some
source (server) for data collection therefore an architecture of client –
server is the best one to be explained. The client-server shares functions of
DBMS among themselves. The query of the client is decomposed into local queries
and forwarded to concerned servers which in turn process and send back to the
related clients. Although it has certain
advantages as clients have full functionality and better performance but it has
disadvantages as well like server bottleneck and failures as in [5]. Their
relationship is depicted in figure 1 as explained in [3].
Fig. 1 – Client/
Server Architecture
2) DDBS: The data is saved at different sites/
location by means of three techniques. In the first technique the data is
fragmented into segments which are distributed among the sites, this is called
“no replication”. In the second one the duplication or replication of data is
carried out and each set is saved at different sites, this is called
“replication”. In third technique, data is fragmented and each segment is
placed at each site like first technique. A complete set of data is replicated
and placed at headquarters like second technique. This is called “selective
replication”. All these techniques are illustrated in figure 2 as explained in
[3].
Fig. 2 – DDBS
B. DDBS
Architecture – Software
The DDBS is dependent on three software
modules namely the server, the client and the communication software. The first
one manages the local data and second one deals with the distribution, logging,
request entertainment and non-local transactions. The third one provides the
means of communication among the client-server. Figure 3 gives out the details
as elaborated in [3].
Fig. 3 – Client/
Server Software
C. Fragmentation
or Replication
There are two
methods by which DDBS is implemented, fragmentation and replication. In the
former method, data is divided into segments and distributed among the sites
whereas in later, copy of complete data is placed among all the sites. The
fragmentation is of three types, their details are as under:-
1) Horizontal
Fragmentation: Since the saved data is in tabulated form
therefore a table is fragmented horizontally along its tuples. There are two kinds
of it; primary and derived horizontal fragmentation. Primary horizontal fragmentation
of a class relation is carried out by queries predicates of the same class
whereas derived horizontal fragmentation of a class relation is carried out
basing on horizontal fragmentation of another class as in [6], [7]. Figure 4
depicts the horizontal fragmentation where attributes (Attrib) are taken as a
whole and rows are partitioned as under:-
Attrib_Id
|
Attrib_1
|
Attrib_2
|
Attrib_3
|
Fragment 1
|
|||
Fragment 2
|
|||
Fragment 3
|
Fig. 4 – Horizontal
Fragmentation
2) Vertical
Fragmentation: A table is fragmented vertically through
columns. Number of required attributes are partitioned. This not only improves
the performance of the database but also reduces the time consumption as
explained in [8]. Figure 5 shows the vertical fragmentation where Attrib are
partitioned as under:-
Attrib_Id
|
Attrib_1
|
Attrib_2
|
Attrib_Id
|
Attrib_3
|
Attrib_4
|
Fragment
1
|
Fragment
2
|
Fig. 5 – Vertical Fragmentation
3) Hybrid Fragmentation:
It is defined as a simultaneous process of horizontal and vertical
fragmentation application over a relation. Now it can be achieved either
horizontal partitioning first and vertical later or other way round as
explained in [9]. Figure 6 shows the hybrid fragmentation where rows and
columns are partitioned as under:-
Attrib_Id
|
1
|
2
|
3
|
4
|
5
|
Fragment 1
|
Fragment 2
|
||||
Fragment 3
|
Fragment 4
|
Fig. 6 – Hybrid
Fragmentation
4) Replication:
It is a process through which complete data is copied and stored at
different sites. The replicated data is available to the users and they share
the same level of information. There are number of merits of this like
availability of data in case of any site break down, a faster query execution
is possible and a reduced data movement over a network is expected. Likewise
there are number of demerits like requirement of more disk storage space,
updating of all sites required and expensiveness in terms of concurrency
control/ recovery techniques as in [10], [11].
D. Query Processing
The
processing of queries in a centralized DBS is much easier than DDBS because
data is available centrally and easy to handle and inquire. Whereas data which
is distributed at different sites either fragmented or replicated is quite
difficult to inquire with. As the data
is distributed at different sites but users are ignorant of this fact therefore
once they forward a query, they don’t expect any delay. The main aim is to
minimize the cost of operations; processing and communication cost. It depends
upon two factors; query operations and query’s values as in [12].
III.
DISTRIBUTED QUERY PROCESSING SYSTEM
Query processing is a study
which deals with an algorithm design that caters for query analysis and then
converting it to certain operations which manipulate data. Algorithm’s task is
to transform the high-level query into lower-level query, in other words from
relational calculus to relational algebra. Unlike centralised system, in
distributed system, relational algebra requires separate operators along with
execution strategies for data exchange among sites. Users are unaware of any
distribution of data at different sites and they have a feeler of getting their
information from a single site. This aspect of assumed single site is made
possible by the distributed query processing system as in [13]. The details of
query processing is as under:-
A.
Query
Processing – Architecture
A universal architecture is
explained in the subsequent paragraph which may be used for all types of
database system like parallel, distributed or centralized. In this a structured
query language (SQL) is placed as input to the query processor. In turn query
is translated and optimized into executable plan which acquire the results. The
results of a dynamic SQL are offered to the users directly. Whereas executable
plan of embedded SQL is stored in database and results are produced on running
of application. Figure 7 illustrates the relationship and working of different
components as in [14], the details are as under:-
Fig. 7 – Query
Processing Procedure
1) Parser: It parses the query and translates to query
graph.
2) Query
Rewrite: It transforms the query and make
it ready for optimization by removing the redundant predicates. The expressions
are simplified and sub-queries are un-nested. Table partitions are selected in
order to answer the query.
3) Query
Optimizer: It optimizes the query and
allocates the memory for every operation’s execution. It also decides that
which operation is executed at which site.
4) Plan: The details of execution of operations are
specified in a plan. The plan is like a tree where each node acts as an
operator which executes one operation like sorting, scanning, join, semi-join
etc. The edges show the relationship of consumers and producers of
operators.
5) Plan
Refinement/Code Generation: It converts
the optimized query into executable plan. The code evaluates the predicates and
expressions.
6) Query
Execution Engine: A model is generated where all
operators behave like iterators with same interface. These iterators are
plugged in and plan can be executed. This model also facilitates pipelining
where results could be shared among operators thus enhances the performance.
7) Catalogue: It maintains a record of all happenings
during the process. It holds the database schema, partitioning schema, physical
information and statistics. In centralized database catalogue is saved at the
same place but in distributed database it is replicated at all the sites.
B. Query
Processor – Characteristics
As the processor behaviour
is different while dealing with a centralized database and distributed database
therefore evaluation is difficult to perform. However there are number of
characteristics which are quite helpful for evaluation. Following eight
characteristics gives out the working of distributed database as elaborated in
[13]. It is pertinent to mention that first four are common to centralized and
distributed database:-
1) Languages: Relational DBMS and object DBMS input
languages are based on relational calculus and object calculus respectively.
There is a need of accurate mapping from input to output language by query
processor.
2) Optimization
– Types: There is a need to reduce the
cost effect of every strategy. The optimizing is done through searching a
solution space, estimating the strategies cost and selecting strategy with less
cost. Another type is the usage of heuristics where intermediate relations are
minimized and semijoins are used instead of join operators.
3) Optimization
– Timing: The timing of optimization might
differ from query execution. It can be done statically before query execution
or dynamically after query execution.
4) Statistics: It is helpful in selection of operator’s
priority for both static and dynamic query optimization.
5) Decision
by Sites: A query may be answered by a
single or several sites in a static optimization. These sites participate in
strategy selection which are to be applied for replying the query. If a
strategy generation is done by single site then it is considered as centralized
system where knowledge of entire database should be available at that
particular site. If strategy generation is acquired through number of sites
then local information suffices the problem. There is another approach of
hybrid system where both decision at major level are taken by one site and
minor decisions (locally) are at taken by other sites.
6) Network
Topology Exploitation: The
distributed query processor exploits the topology of a network. In local area
network (LAN) the capability of broadcasting is exploited to reduce the
communication cost. Whereas in wide area network (WAN) the communication cost
is used to minimize the cost function of entire network.
7) Replicated
Fragments Exploitation: To achieve
optimized read performance and reliability at higher rate the fragmented
segments should be replicated and placed at all sites.
8) Semijoins: These are used to minimize the data size
which is exchanged among sites, but it increases the time of local processing.
Generally these are used in conjunction with join operands as in [15].
IV. QUERY OPTIMIZATION SYSTEM
Query
optimization plays an important role in the query processing system. The
database optimizer has to face more difficulty in distributed database
environment as compared to the centralized database system as in [16]. There are certain details of query processing
to obtain the optimization in a distributed database which are shown in figure
8 as follow:-
Fig. 8 – Query Optimization System
A. Optimization in Three Phases
There are commonly
three phases as explained in [17].
1) Phase – I: Fragmentation of database is done.
2) Phase – II: Allocation of fragments to different sites is
done.
3) Phase – III: Execution of task is done.
B. Optimization Steps
There are number of
execution strategies of a query and selecting the best among them is called
query optimization. The process of a distributed query has following steps as
in [18], [19]:-
1) Local Process: It has got two aspects; selection and
projection through which local processing is carried out.
2) Reduction: The relation size is reduced considerably
through the joins and semijoins operands. This is done by a sequence in which
reducers are used.
3) Final Process: All the relations which are produced as a
result of local processing are forwarded to the assembly site where they are
assembled into a final result. The answer to the query is constructed on the
basis of the final result as in [20].
4) Query Strategy Optimization: A single query contains
several strategies because the data is distributed among different sites. All
such strategies possess different level of cost effects. The I/O price and cost
of CPU are added with the communication cost of that particular strategy cost
as well, as explained in [21].
C. Optimizing
Techniques
There are number of ways
through which query performance can be enhanced. The details are follows:-
1) Query – Rewrite: A query can be written in many ways and
developers always try to adopt that way by which they achieve best performance.
Few tips are mentioned below as in [22]:-
·
Complete knowledge
of data
·
Query with
less multiple joins
·
No cursors
in query
·
Non
Correlated Scalar sub-query to be avoided
·
Schema to
be visited time and again
2) Indexing: It helps in reducing the consumption of I/O
and also speed up the performance of a query. The database indices are used to
retrieve data (object or record) in a quick manner according to the criteria. Few
tips are mentioned below:-
·
By
removing the clustered index and adding non-clustered index
·
Creation
of selective index
·
An index
with column
·
Drop
indexes not in use
·
Automatic
indexing and retrieval method as in [23]
·
Small
tables not to be indexed
·
Use of
ODRA indexing as in [24]
·
Data
streaming on scalable ordered indexing as in [25]
3) Response Time and Total Time: In a DBMS once a query is launched,
the important factor which matter is the response time and the total time. Here
response means the time taken to retrieve the first record and total means the
retrieval of complete query result. To increase the performance and reduced the
timing following functions are used:-
·
By sorting
operation.
·
By
constructing plurality of descendent query nodes and table nodes as in [26].
·
By
preparing the database and estimating the response timing of resource and query
as in [27].
·
By setting
the priorities for DBMS scheduling as in [28].
·
By using
algorithm GENERAL including PARALLEL and SERIAL algorithms as in [29].
4) Statistics: Once the query response time and total time
has to be enhanced then there is a need of statistical data for that purpose.
The database administrator (DBA) should know following statistical basic
concept for optimizing performance of DDBMS:-
·
Cardinality
is the way through which unique values are determined in a data set as in [30].
·
Density is
used to measure value’s uniqueness in a data set. The high density indices are
ignored by the query optimizer.
·
Selectivity
is the tool through which a number of rows are measured. It is used by the query
optimizer to determine the query plan’s cost as in [31].
5) Referential Integrity: The query gets a joined table only in case
its reference is specified in the query. This reduces the load of data and
enhance the speed of query. The referential integrity ensures that table having
concerned data in its joined table do exist as in [32].
6) Optimizing Data Model: The query optimization mainly depends upon
the data model in use. It has to be chosen with a lot of care. The model should
include those parameters which are needed in query processing.
D. Proposed
Solution
Nexus to above there could
be number of solutions to optimize the query process. Merging or combination of
following solutions will give out a viable solution:-
1) Search
Space and Search Strategy: It can be
enhance by adding latest and update features into the search space criteria
such as new languages of database and storage system (physical) as in [33]. By
this mean the query optimization is extended. In the same way the search
strategy should also be enhanced by separating it from search space.
2) Dynamic
Programming (DP): In [34] a solution
has been proposed in an environment of homogeneous DDBS where query is
optimized with an approach of DP. In a DP the optimization can be further
refined by generalization of SQL servers.
3) Reformulation
of Query: There are chances where the
query is not formulated in an accurate format by the users as they are mostly
ignorant. The optimizer is confused of the fact and keep on trying to trigger
the query but in vain as explained in [35] & [36]. There is a need of query
reformulation and if the query optimizer is programmed as such by the
programmers that in whatever form the query is asked, optimizer automatically
pursue in the right direction.
4) Data
Guides: There is another way of query
optimization as elaborated in [37]. It is done by data guides which is useful
for semi-structured databases. These data guides serve as schemas which act
dynamically and beneficial for storing/ retrieval of information, query
formulation and query optimization.
5) Avant-garde
Schema: A schema is devised where
those sub-trees which do not have relevant data structures are not searched. In
[38] two types of schemas are discussed. One is related to the relational
database and other deals with object-oriented database. By this mean the query
is optimized.
V. CONCLUSION
There are three
environments of DBMS; centralized, distributed and parallel database system. This
paper has been developed to highlight the query optimization procedure in an
environment of distributed DBMS. The optimization of a query is based upon the
speed of the processing. The speed is related to time which is the response
time and total time taken to complete the query from start till end. The query
processing system in DDBMS has been elaborated in detail. The number of techniques
has also been discussed which are very efficient and popular as well in the
optimizing process of query.
In a nutshell,
this paper encompasses the basics of DDBMS and distributed query processing system in general and optimization of
query processing in particular. The available and adopted techniques of query optimization
have also been elaborated in detail. A proposed solution has also been
highlighted in the light of available techniques.
References
[2]
C. Liu and C. Yu,
"Performance Issues in Distributed Query Processing," IEEE, TOPADS, vol.
4 no. 8, Aug. 1993, USA, pp. 889-905.
[3]
H. Rababaah, “Distributed
Databases Fundamentals and Research”, Indiana Uni. USA, 2005, pp. 1-15.
[4]
M.L. Gillenson, “Fundamentals
of Database Management Systems”, chap. 13, 2004, pp. 291-310.
[5]
M.T. Ozsu and P. Valduriez,
“Distributed and Parallel Database Systems”, ACM Comp. Surveys, vol. 28, no. 1,
Mar. 1996, USA, pp. 125-128.
[6]
L. Bellatreche et. al. “Algorithms
and Support for Horizontal Class Partitioning in Object-Oriented Databases” DAPD,
vol. 8, 2000, pp. 155–179.
[7]
F.A. Baiao et. al. “Horizontal
Fragmentation in Object DBMS: New Issues and Performance Evaluation”, proc. 19th
IEEE, intl. performance, comp. comm. Conf. Feb. 2000, pp. 2.
[8]
S.B. Navathe and M. Ra, “Vertical
partitioning for database design: A Graphical Algorithm”, ACM SIGMOD Record,
Uni. of Florida, 1989, USA, pp. 440-450.
[9]
S.B. Navathe et. al. “A
Mixed Fragmentation Methodology for Initial Distributed Database Design”, Journal
of Comp. and Software, USA, 1995, pp. 1-34.
[10]
S. Vasileva et. al. “Some
Models of a Distributed Database Management System with Data Replication”,
proc. Intl. Conf. on CompSysTech, 2007 pp. 1-6.
[11]
“Data Replication” website.
[Online]. Available: http://ecomputernotes.com/database-system/adv-database/data
replication
[12]
B.M.M. Alom, “Query
Processing and Optimization in Distributed Database”, IJCSNS, vol. 9, no. 9,
Sept. 2009, Australia, pp. 143-153.
[13]
M.T. Özsu and P. Valduriez,
“Principles of Distributed Database Systems”, 3rd Ed. Springer, New
York, 2011, pp. 206-220.
[14]
D. Kossmann, “The State of
the Art in Distributed Query Processing”, ACM Comp. Surveys, vol. 32, no. 4,
Dec. 2000, Passau Uni. pp. 425-428
[15]
M.S. Chen and P.S. Yu, “Combining
Join and Semi-Join Operations for Distributed Query Processing”, IEEE, Transac.
On knowledge and data engr. vol. 5,
no. 3, Jun. 1993, pp. 534-542.
[16]
D. Abdullah, “Query
Optimization in Distributed Databases”, 2003, pp. 1-12.
[17]
R.S. Virk and Dr. G. Singh,
“Optimizing Access Strategies for a Distributed Database Design using Genetic
fragmentation”, IJCSNS, vol. 11, no.6, Jun 2011, USA, pp. 180-183.
[18]
G. Mahajan et. al. “Query
Optimization in DDBS”, IJCAIT, vol. 1, no. 1, Jul. 2012, USA, pp. 4-6.
[19]
B.M.M. Alom, et. al. “Query
Processing and Optimization in Distributed Database Systems”, IJCSNS, vol. 9 no.
9, Sept. 2009, USA, pp. 143-152,
[20]
P.R. Bhuyar et. al. “Horizontal
Fragmentation Technique in Distributed Database”, IJSR, vol. 2, no. 5, May 2012,
pp. 1-7.
[21]
L. Zhou et. al. “The
Semi-join Query Optimization in Distributed Database System”, CITCS, 2012, pp.
606-609.
[22]
J. McHugh and J. Widom, “Query
optimization for XML”, ILPUBS, Stanford Uni. 1999, pp. 1-5.
[23]
S. Deerwester et. al. “Indexing
by Latent Semantic Analysis”, JASIS, vol. 41, no. 6, Sept. 1990, pp. 391.
[24]
T.M. Kowalski et. al. “Query
Optimization by Indexing in the ODRA OODBMS”, Annales UMCS Informatica, vol. 9,
no. 1, 2015, pp. 77-97.
[25]
S. Badiozamany and T. Risch,
“Scalable ordered indexing of streaming data”, 3rd Intl. Workshop
ADMS, VLDB 2012, Turkey, pp. 1-11.
[26]
W. Du et. al. “Reducing
query response time using tree balancing”, proc. ACM SIGMOD 1995, Intl. conf. USA, vol. 24, no. 2 May 1995, pp.
293-303.
[27]
N. Tomov et. al. “Practical
Response Time Estimation in Parallel Relational Database Systems”, pp. 1-35.
[28]
M.J. Carey et. al. “Priority
in DBMS Resource Scheduling”, Uni. of Wisconsin, CSTR no. 828, Mar. 1989, pp.
1-33.
[29]
P.M.G. Apers et. al. “Optimization
Algorithms for Distributed Queries”, IEEE, TOSE, vol. SE-9, no. 1, Jan. 1983,
pp. 57-68.
[30]
B. Schiefer et al. “Method
for estimating cardinalities for query processing in a relational database
management system”, US Patent, 2nd Jun. 1998, pp. 1-16.
[31]
A. Ailamaki et. al. “DBMSs on
a Modern Processor: Where Does Time Go?” proc. Of 25th VLDB conf.
Scotland, 1999, pp. 1-12.
[32]
D. Maier et. al. “Development
of an object-oriented DBMS”, ACM, SIGPLAN notices, proc. of conf. vol. 21, no.
11 Nov. 1986, pp. 472-482.
[33]
R.S.G. Lanzelotte et. al. “Extending
the Search Strategy in a Query Optimizer”, proc. of 17th Intl. conf.
on very large databases, Barcelona, Sept. 1991, pp. 363-373.
[34] A. Mishra et. al. “Dynamic Programming Solution for Query Optimization
in Homogeneous Distributed Databases”, IJERT, vol. 1, no. 6, Aug. 2012, pp.
1-17.
[35] C.S. Bruce, “Workplace Experiences of Information Literacy”.
IJIM, vol. 19, no. 1, Feb. 1999, pp. 33-37.
[36] A.V. Bitar and A.M. Melki, “A Proposed Query Optimization Method
to Boost Students use of Search Engines”, Proc. of ICICTE, 2008, pp. 615-624.
[37] R. Goldman and J. Widom, “DataGuides: Enabling Query Formulation
and Optimization in Semistructured Databases”, ILPUBS, Stanford Uni. 1997, pp.
1-21.
[38] P. Buneman et. al. “A Query Language and Optimization Techniques
for Unstructured Data”, ACM, SIGMOD proc. Of Intl. conf. on mgmt. of data, Jun.
1996, 1-26.