Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

The Concept of Informtion Storage and Information Retreival , Thesis of Computer Science

Information storage and retrieval, the systematic process of collecting and cataloging data so that they can be located and displayed on request. Computers and data processing techniques have made possible the high-speed, selective retrieval of large amounts of information for government, commercial, and academic purposes. There are several basic types of information-storage-and-retrieval systems.

Typology: Thesis

2016/2017

Uploaded on 12/24/2017

murtala_saraki_musa
murtala_saraki_musa 🇮🇳

5

(3)

3 documents

1 / 10

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Chapter 1
Introduction to Information Storage and Retrieval
Information storage and retrieval
Information storage and retrieval, the systematic process of collecting and cataloging data so that
they can be located and displayed on request. Computers and data processing techniques have
made possible the high-speed, selective retrieval of large amounts of information for
government, commercial, and academic purposes. There are several basic types of information-
storage-and-retrieval systems.
Document-retrieval systems store entire documents, which are usually retrieved by title or by key
words associated with the document. In some systems, the text of documents is stored as data.
This permits full text searching, enabling retrieval on the basis of any words in the document. In
others, a digitized image of the document is stored, usually on a write-once optical disc.
Database systems store the information as a series of discrete records that are, in turn, divided
into discrete fields (e.g., name, address, and phone number); records can be searched and
retrieved on the basis of the content of the fields (e.g., all people who have a particular telephone
area code). The data are stored within the computer, either in main storage or auxiliary storage,
for ready access.
Reference-retrieval systems store references to documents rather than the documents themselves.
Such systems, in response to a search request, provide the titles of relevant documents and
frequently their physical locations. Such systems are efficient when large amounts of different
types of printed data must be stored. They have proven extremely effective in libraries, where
material is constantly changing.
Data
Data is a collection of raw facts from which conclusions may be drawn. Handwritten letters, a
printed book, a family photograph, a movie on video tape, printed and duly signed copies of
mortgage papers, a bank’s ledgers, and an account holder’s passbooks are all examples of data.
Before the advent of computers, the procedures and methods adopted for data creation and
sharing were limited to fewer forms, such as paper and film. Today, the same data can be
converted into more convenient forms such as an e-mail message, an e-book, a bitmapped image,
or a digital movie. This data can be generated using a computer and stored in strings of 0s and
1s, as shown in the Figure Bellow. Data in this form is called digital data and is accessible by the
user only after it is processed by a computer.
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download The Concept of Informtion Storage and Information Retreival and more Thesis Computer Science in PDF only on Docsity!

Chapter 1

Introduction to Information Storage and Retrieval

Information storage and retrieval

Information storage and retrieval, the systematic process of collecting and cataloging data so that they can be located and displayed on request. Computers and data processing techniques have made possible the high-speed, selective retrieval of large amounts of information for government, commercial, and academic purposes. There are several basic types of information- storage-and-retrieval systems.

Document-retrieval systems store entire documents, which are usually retrieved by title or by key words associated with the document. In some systems, the text of documents is stored as data. This permits full text searching, enabling retrieval on the basis of any words in the document. In others, a digitized image of the document is stored, usually on a write-once optical disc.

Database systems store the information as a series of discrete records that are, in turn, divided into discrete fields (e.g., name, address, and phone number); records can be searched and retrieved on the basis of the content of the fields (e.g., all people who have a particular telephone area code). The data are stored within the computer, either in main storage or auxiliary storage, for ready access.

Reference-retrieval systems store references to documents rather than the documents themselves. Such systems, in response to a search request, provide the titles of relevant documents and frequently their physical locations. Such systems are efficient when large amounts of different types of printed data must be stored. They have proven extremely effective in libraries, where material is constantly changing.

Data

Data is a collection of raw facts from which conclusions may be drawn. Handwritten letters, a printed book, a family photograph, a movie on video tape, printed and duly signed copies of mortgage papers, a bank’s ledgers, and an account holder’s passbooks are all examples of data. Before the advent of computers, the procedures and methods adopted for data creation and sharing were limited to fewer forms, such as paper and film. Today, the same data can be converted into more convenient forms such as an e-mail message, an e-book, a bitmapped image, or a digital movie. This data can be generated using a computer and stored in strings of 0s and 1s, as shown in the Figure Bellow. Data in this form is called digital data and is accessible by the user only after it is processed by a computer.

With the advancement of computer and communication technologies, the rate of data generation

and sharing has increased exponentially. The following is a list of some of the factors that have contributed to the growth of digital data:

Increase in data processing capabilities: Modern-day computers provide a significant increase in processing and storage capabilities. This enables the conversion of various types of content and media from conventional forms to digital formats.

Lower cost of digital storage: Technological advances and decrease in the cost of storage devices have provided low-cost solutions and encouraged the development of less expensive data storage devices. This cost benefit has increased the rate at which data is being generated and stored.

Affordable and faster communication technology: The rate of sharing digital data is now much faster than traditional approaches. A handwritten letter may take a week to reach its destination, whereas it only takes a few seconds for an e-mail message to reach its recipient.

Inexpensive and easier ways to create, collect, and store all types of data, coupled with increasing individual and business needs, have led to accelerated data growth, popularly termed the data explosion. Data has different purposes and criticality, so both individuals and businesses have contributed in varied proportions to this data explosion.

The importance and the criticality of data vary with time. Most of the data created holds significance in the short-term but becomes less valuable over time. This governs the type of data storage solutions used. Individuals store data on a variety of storage devices, such as hard disks, CDs, DVDs, or Universal Serial

Bus (USB) flash drives.

Businesses generate vast amounts of data and then extract meaningful information from this data to derive economic benefits. Therefore, businesses need to maintain data and ensure its availability over a longer period. Furthermore, the data can vary in criticality and may require special handling. For example, legal and regulatory requirements mandate that banks maintain account information for their customers accurately and securely. Some businesses handle data for millions of customers, and ensure the security and integrity of data over a long period of time. This requires high capacity storage devices with enhanced security features that can retain data for a long period.

Types of Data

Because information is critical to the success of a business, there is an ever-present concern

about its availability and protection. Legal, regulatory, and contractual obligations regarding the availability and protection of data only add to these concerns. Outages in key industries, such as financial services, Telecommunications, manufacturing, retail, and energy cost millions of U.S. dollars per hour.

Information is increasingly important in our daily lives. We have become information dependents of the twenty-first century, living in an on-command, on-demand world that means we need information when and where it is required. We access the Internet every day to perform searches, participate in social networking, send and receive e-mails, share pictures and videos, and scores of other applications. Equipped with a growing number of content-generating devices, more information is being created by individuals than by businesses. Information created by individual’s gains value when shared with others. When created, information resides locally on devices such as cell phones, cameras, and laptops. To share this information, it needs to be uploaded via networks to data centers. It is interesting to note that while the majority of information is created by individuals, it is stored and managed by a relatively small number of organizations. Figure 1-1 depicts this virtuous cycle of information. The importance, dependency, and volume of information for the business world also continue to grow at astounding rates. Businesses depend on fast and reliable access to information critical to their success. Some of the business applications that process information include airline reservations, telephone billing systems, e- commerce, ATMs, product designs, inventory management, e-mail archives, Web portals, patient records, credit cards, life sciences, and global capital markets. The increasing criticality of information to the businesses has amplified the challenges in protecting and managing the data. The volume of data that business must manage has driven strategies to classify data according to its value and create rules for the treatment of this data over its lifecycle. These strategies not only provide financial and regulatory benefits at the business level, but also manageability benefits at operational levels to the organization. Data centers now view information storage as one of their core elements, along with applications, databases, operating systems, and networks. Storage technology continues to evolve with technical advancements offering increasingly higher levels of availability, security, scalability, performance, integrity, capacity, and manageability

Information storage

Introduction

Information storage is a central pillar of information technology. A large quantity of digital information is being created every moment by individual and corporate consumers of IT. This information needs to be stored, protected, optimized, and managed. Not long ago, information

storage was seen as only a bunch of disks or tapes attached to the back of the computer to store

data. Even today, only those in the storage industry understand the critical role that information storage technology plays in the availability, performance, integration, and optimization of the entire IT infrastructure. Over the last two decades, information storage has developed into a highly sophisticated technology, providing a variety of solutions for storing, managing, connecting, protecting, securing, sharing, and optimizing digital information. With the exponential growth of information and the development of sophisticated products and solutions, there is also a growing need for information storage professionals. IT managers are challenged by the ongoing task of employing and developing highly skilled information storage professionals. Many leading universities and colleges have started to include storage technology courses in their regular computer technology or information technology curriculum, yet many of today’s IT professionals, even those with years of experience, have not benefited from this formal education, therefore many seasoned professionals—including application, systems, database, and network administrators—do not share a common foundation about how storage technology affects their areas of expertise.

Information retrieval

Information retrieval, commonly referred to as IR, is the process by which a collection of information is represented, stored, and searched in order to extract items that match the specific parameters of a user's request—or query—for information. Though information retrieval can be a manual process, as in using an index to find certain information within a book, the term is usually applied when the collection of information is in electronic form, and the process of matching query and document is carried out by computer. The collection usually consists of text documents (either bibliographic information such as title, citation and abstract, or the complete text of documents such as journal articles, magazines, newspapers, or encyclopedias). Collections of multimedia documents such as images, video clips, music, and sound are also becoming common, and information retrieval methods are being developed to search these types of collections as well.

The information retrieval process begins with an information need— someone (referred to as the user) requires certain information to answer a question or carry out a task. To retrieve the information, the user develops a query, which is the expression of the information need in concrete terms ("I need information on whitewater rafting in the Grand Canyon").

The query is then translated into the specific search strategy best suited to the document collection and search engine to be searched (for example, "whitewater ADJ rafting AND grand ADJ canyon" where ADJ means "adjacent" and AND means "and"). The search engine matches the terms of the search query against terms in documents in the collection, and it retrieves the items that match the user's request, based on the matching criteria used by that search engine. The retrieved documents can be viewed by the user, who decides whether they are relevant; that is, whether they meet the original information need.

Information retrieval is a complex process because there is no infallible way to provide a direct connection between a user's query for information and documents that contain the desired information. Information retrieval is based on a match between the words used to formulate the query and the words used to express concepts or ideas in a document. A search may fail because the user does not correctly guess the words that a useful document would contain, so important material is missed. Or, the user's search terms may appear in retrieved documents that pertain to a subject other than the one intended by the user, so material is retrieved which is not useful.

Problems with Boolean Retrieval

Boolean searching has been criticized because it requires searchers to understand and apply basic Boolean logic in constructing their search strategies, rather than posing their queries in natural

language. Another criticism is that Boolean searching requires that terms in the retrieved document exactly match the query terms, so potentially useful information may be missed because a document does not contain the specific term the searcher thought to use. A Boolean search essentially divides a database into two parts: documents that match and those that do not match the query. The number of documents retrieved may be zero, if the query was very specific, or it could be tens of thousands if very common terms were used. All documents retrieved are treated equally so the system cannot make recommendations about the order in which they should be viewed. Because of its complexity, Boolean searching has often been carried out by information professionals such as librarians who act as research intermediaries for their patrons.

Boolean retrieval has also been criticized on the basis of performance. The standard measures of performance for IR systems are precision and recall. Precision is a measure of the ability of a system to retrieve only relevant documents (those which match the subject of the user's query). Recall is a measure of the ability of the system to retrieve all the relevant documents in the system. Using these measures, the performance of Boolean systems has been criticized as inadequate, leading to the continuing search for other ways to retrieve information electronically.

Alternatives to Boolean Retrieval

Since the 1960s and 1970s, IR researchers explored ways to improve the performance of information retrieval systems. Gerard Salton (1927–1995), a professor at Cornell University, was a key figure in this research. For more than thirty years, he and his students worked on the Smart system, a research environment that allowed them to explore the impact of varying parameters in the retrieval system. Using measures such as precision and recall, he and other researchers found that performance improvements can be made by implementing systems with features such as term weighting, ranked output based on the calculation of query-document similarity, and relevance feedback.

In these systems, documents are represented by the terms they contain. The list of terms is often referred to as a document vector and is used to position the document in N-dimensional space (where N is the number of unique terms in the entire collection of documents). This approach to IR is referred to as the "vector space model."

For each term, a weight is calculated using the statistics of term frequency, which represents the importance of the term in the document. A common method is to calculate the tfxidf value (term frequency x inverse document frequency). In this model the weight of a term in a document is proportional to the frequency of occurrence of the term in the document, and inversely proportional to the frequency with which the term occurs in the entire document collection. In other words, a good index term is one that occurs frequently in a particular document but infrequently in the database as a whole.

The query is also considered as a vector in N-dimensional space, and the distance between a document and a query is an indication of the similarity, or degree of match, between them. This distance is quantified by using a distance measure, commonly a similarity function such as the cosine measure. The results are sorted by similarity value and displayed in order, best match first.

The relevance feedback feature allows the user to examine documents and make some judgments

about their relevance. This information is used to recalculate the weights and rerank the documents, improving the usefulness of the document display.

These systems allow the user to state an information need in natural language, rather than

constructing a formal query as required by Boolean systems. The ranked output also imposes an order on the documents retrieved, so that the first documents to be viewed are most likely to be relevant. The search is modified automatically based on the user's feedback to the system.

More recently, information retrieval systems have been developed to search the World Wide Web

. These search engines use software programs called crawlers that locate pages on the web which are indexed on a centralized server. The index is used to answer queries submitted to the web search engine. The matching algorithms used to match queries with web pages are based on the Boolean or vector space model.

Individual search engines vary in terms of the information on the web page that they index, the factors used in assigning term weights, and the ranking algorithm used. Some search engines index information extracted from hyperlinks as well as from the text itself. Because information on the search engine is usually proprietary , details of the algorithms are not readily available. Comparisons of retrieval performance are also difficult because the systems index different parts of the web and because they undergo constant change. Recall is impossible to measure because the potential number of pages relevant to a query is so large.

The Future of Information Retrieval

Researchers continue to improve the performance of information retrieval systems. An ongoing series of experiments called TREC (Text Retrieval Evaluation Conference) is conducted annually by the National Institute of Standards and Technology to encourage research in information retrieval and its use in real-world systems.

One long-term goal is to develop systems that do more than simply identify useful documents. By considering a database as a knowledge base rather than simply a collection of documents, it may be possible to design retrieval systems that can interpret documents and use the knowledge they contain to answer questions. This will require developments in artificial intelligence (AI) , natural language processing, expert systems , and related fields. Research so far has concentrated primarily on relatively narrow subject areas, but the goal is to create systems that can understand and respond to questions in broad subject areas.

"Information Retrieval." Computer Sciences.. Retrieved October 03, 2016 from

Encyclopedia.com: http://www.encyclopedia.com/computing/news-wires-white-papers-and- books/information-retrieval

Learn more about citation styles

Tools