BASIC NETWORK AND DATA ANALYSIS | Cheat Sheet Computer Networks

MODULE -I

DATA GATHERING AND PREPARATION

BIG DATA ANALYTICS:

The volume of data that one has to deal has exploded to unimaginable levels in the past decade,

and at the same time, the price of data storage has systematically reduced. Private companies

and research institutions capture terabytes of data about their users‘ interactions, business, social

media, and also sensors from devices such as mobile phones and automobiles. The challenge of

this era is to make sense of this sea of data. This is where big data analytics comes into picture.

Big Data Analytics largely involves collecting data from different sources, mange it in a way

that it becomes available to be consumed by analysts and finally deliver data products useful to

the organization business.

Partial preview of the text

Download BASIC NETWORK AND DATA ANALYSIS and more Cheat Sheet Computer Networks in PDF only on Docsity!

MODULE - I

DATA GATHERING AND PREPARATION

BIG DATA ANALYTICS:

The volume of data that one has to deal has exploded to unimaginable levels in the past decade, and at the same time, the price of data storage has systematically reduced. Private companies and research institutions capture terabytes of data about their users‘ interactions, business, social media, and also sensors from devices such as mobile phones and automobiles. The challenge of this era is to make sense of this sea of data. This is where big data analytics comes into picture. Big Data Analytics largely involves collecting data from different sources, mange it in a way that it becomes available to be consumed by analysts and finally deliver data products useful to the organization business.

Big Data Analytics - Data Life Cycle: Traditional Data Mining Life Cycle: In order to provide a framework to organize the work needed by an organization and deliver clear insights from Big Data, it‘s useful to think of it as a cycle with different stages. It is by no means linear, meaning all the stages are related with each other. This cycle has superficial similarities with the more traditional data mining cycle as described in CRISP methodology. CRISP-DM Methodology: The CRISP-DM methodology that stands for Cross Industry Standard Process for Data Mining is a cycle that describes commonly used approaches that data mining experts use to tackle problems in traditional BI data mining. It is still being used in traditional BI data mining teams. Take a look at the following illustration. It shows the major stages of the cycle as described by the CRISP-DM methodology and how they are interrelated. CRISP-DM was conceived in 1996 and the next year, it got underway as a European Union project under the ESPRIT funding initiative. The project was led by five companies: SPSS, Terradata, Daimler AG, NCR Corporation, and OHRA (an insurance company). The project was finally incorporated into SPSS. The methodology is extremely detailed oriented in how a data mining project should be specified. Let us now learn a little more on each of the stages involved in the CRISP-DM life cycle −  Business Understanding − This initial phase focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition. A preliminary plan is designed to

SEMMA Methodology: SEMMA is another methodology developed by SAS for data mining modeling. It stands for S ample, E xplore, M odify, M odel, and A sses. Here is a brief description of its stages −  Sample − The process starts with data sampling, e.g., selecting the dataset for modeling. The dataset should be large enough to contain sufficient information to retrieve, yet small enough to be used efficiently. This phase also deals with data partitioning.  Explore − This phase covers the understanding of the data by discovering anticipated and unanticipated relationships between the variables, and also abnormalities, with the help of data visualization.  Modify − The Modify phase contains methods to select, create and transform variables in preparation for data modeling.  Model − In the Model phase, the focus is on applying various modeling (data mining) techniques on the prepared variables in order to create models that possibly provide the desired outcome.  Assess − The evaluation of the modeling results shows the reliability and usefulness of the created models. The main difference between CRISM–DM and SEMMA is that SEMMA focuses on the modeling aspect, whereas CRISP-DM gives more importance to stages of the cycle prior to modeling such as understanding the business problem to be solved, understanding and preprocessing the data to be used as input, for example, machine learning algorithms. Big Data Life Cycle: In today‘s big data context, the previous approaches are either incomplete or suboptimal. For example, the SEMMA methodology disregards completely data collection and preprocessing of different data sources. These stages normally constitute most of the work in a successful big data project. A big data analytics cycle can be described by the following stage −  Business Problem Definition  Research  Human Resources Assessment  Data Acquisition  Data Munging  Data Storage

 Exploratory Data Analysis  Data Preparation for Modeling and Assessment  Modeling  Implementation In this section, we will throw some light on each of these stages of big data life cycle. Business Problem Definition This is a point common in traditional BI and big data analytics life cycle. Normally it is a non- trivial stage of a big data project to define the problem and evaluate correctly how much potential gain it may have for an organization. It seems obvious to mention this, but it has to be evaluated what are the expected gains and costs of the project. Research Analyze what other companies have done in the same situation. This involves looking for solutions that are reasonable for your company, even though it involves adapting other solutions to the resources and requirements that your company has. In this stage, a methodology for the future stages should be defined. Human Resources Assessment Once the problem is defined, it‘s reasonable to continue analyzing if the current staff is able to complete the project successfully. Traditional BI teams might not be capable to deliver an optimal solution to all the stages, so it should be considered before starting the project if there is a need to outsource a part of the project or hire more people. Data Acquisition This section is key in a big data life cycle; it defines which type of profiles would be needed to deliver the resultant data product. Data gathering is a non-trivial step of the process; it normally involves gathering unstructured data from different sources. To give an example, it could involve writing a crawler to retrieve reviews from a website. This involves dealing with text, perhaps in different languages normally requiring a significant amount of time to be completed. Data Munging Once the data is retrieved, for example, from the web, it needs to be stored in an easyto-use format. To continue with the reviews examples, let‘s assume the data is retrieved from different sites where each has a different display of the data. Suppose one data source gives reviews in terms of rating in stars, therefore it is possible to read this as a mapping for the response variable y ∈ {1, 2, 3, 4, 5}. Another data source gives reviews

Modeling The prior stage should have produced several datasets for training and testing, for example, a predictive model. This stage involves trying different models and looking forward to solving the business problem at hand. In practice, it is normally desired that the model would give some insight into the business. Finally, the best model or combination of models is selected evaluating its performance on a left-out dataset. Implementation In this stage, the data product developed is implemented in the data pipeline of the company. This involves setting up a validation scheme while the data product is working, in order to track its performance. For example, in the case of implementing a predictive model, this stage would involve applying the model to new data and once the response is available, evaluate the model. Big Data Analytics – Methodology: In terms of methodology, big data analytics differs significantly from the traditional statistical approach of experimental design. Analytics starts with data. Normally we model the data in a way to explain a response. The objectives of this approach are to predict the response behavior or understand how the input variables relate to a response. Normally in statistical experimental designs, an experiment is developed and data is retrieved as a result. This allows generating data in a way that can be used by a statistical model, where certain assumptions hold such as independence, normality, and randomization. In big data analytics, we are presented with the data. We cannot design an experiment that fulfills our favorite statistical model. In large-scale applications of analytics, a large amount of work (normally 80% of the effort) is needed just for cleaning the data, so it can be used by a machine learning model. We don‘t have a unique methodology to follow in real large-scale applications. Normally once the business problem is defined, a research stage is needed to design the methodology to be used. However general guidelines are relevant to be mentioned and apply to almost all problems. One of the most important tasks in big data analytics is statistical modeling, meaning supervised and unsupervised classification or regression problems. Once the data is cleaned and preprocessed, available for modeling, care should be taken in evaluating different models with reasonable loss metrics and then once the model is implemented, further evaluation and results should be reported. A common pitfall in predictive modeling is to just implement the model and never measure its performance.

Big Data Analytics - Core Deliverables: As mentioned in the big data life cycle, the data products that result from developing a big data product are in most of the cases some of the following −  Machine learning implementation − This could be a classification algorithm, a regression model or a segmentation model.  Recommender system − The objective is to develop a system that recommends choices based on user behavior. Netflix is the characteristic example of this data product, where based on the ratings of users, other movies are recommended.  Dashboard − Business normally needs tools to visualize aggregated data. A dashboard is a graphical mechanism to make this data accessible.  Ad-Hoc analysis − Normally business areas have questions, hypotheses or myths that can be answered doing ad-hoc analysis with data. Big Data Analytics - Key Stakeholders: In large organizations, in order to successfully develop a big data project, it is needed to have management backing up the project. This normally involves finding a way to show the business advantages of the project. We don‘t have a unique solution to the problem of finding sponsors for a project, but a few guidelines are given below −  Check who and where are the sponsors of other projects similar to the one that interests you.  Having personal contacts in key management positions helps, so any contact can be triggered if the project is promising.  Who would benefit from your project? Who would be your client once the project is on track?  Develop a simple, clear, and exiting proposal and share it with the key players in your organization. The best way to find sponsors for a project is to understand the problem and what would be the resulting data product once it has been implemented. This understanding will give an edge in convincing the management of the importance of the big data project. Big Data Analytics - Data Analyst: A data analyst has reporting-oriented profile, having experience in extracting and analyzing data from traditional data warehouses using SQL. Their tasks are normally either on the side of data

Big Data Analytics - Problem Definition: Through this tutorial, we will develop a project. Each subsequent chapter in this tutorial deals with a part of the larger project in the mini-project section. This is thought to be an applied tutorial section that will provide exposure to a real-world problem. In this case, we would start with the problem definition of the project. Project Description The objective of this project would be to develop a machine learning model to predict the hourly salary of people using their curriculum vitae (CV) text as input. Using the framework defined above, it is simple to define the problem. We can define X = {x 1 , x 2 , …, xn} as the CV‘s of users, where each feature can be, in the simplest way possible, the amount of times this word appears. Then the response is real valued, we are trying to predict the hourly salary of individuals in dollars. These two considerations are enough to conclude that the problem presented can be solved with a supervised regression algorithm. Problem Definition Problem Definition is probably one of the most complex and heavily neglected stages in the big data analytics pipeline. In order to define the problem a data product would solve, experience is mandatory. Most data scientist aspirants have little or no experience in this stage. Most big data problems can be categorized in the following ways −  Supervised classification  Supervised regression  Unsupervised learning  Learning to rank Let us now learn more about these four concepts. Supervised Classification Given a matrix of features X = {x 1 , x 2 , ..., xn} we develop a model M to predict different classes defined as y = {c 1 , c 2 , ..., cn}. For example: Given transactional data of customers in an insurance company, it is possible to develop a model that will predict if a client would churn or not. The latter is a binary classification problem, where there are two classes or target variables: churn and not churn.

Other problems involve predicting more than one class, we could be interested in doing digit recognition, and therefore the response vector would be defined as: y = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} , a-state-of-the-art model would be convolution neural network and the matrix of features would be defined as the pixels of the image. Supervised Regression In this case, the problem definition is rather similar to the previous example; the difference relies on the response. In a regression problem, the response y ∈ ℜ, this means the response is real valued. For example, we can develop a model to predict the hourly salary of individuals given the corpus of their CV. Unsupervised Learning Management is often thirsty for new insights. Segmentation models can provide this insight in order for the marketing department to develop products for different segments. A good approach for developing a segmentation model, rather than thinking of algorithms, is to select features that are relevant to the segmentation that is desired. For example, in a telecommunications company, it is interesting to segment clients by their cell phone usage. This would involve disregarding features that have nothing to do with the segmentation objective and including only those that do. In this case, this would be selecting features as the number of SMS used in a month, the number of inbound and outbound minutes, etc. Big Data Analytics - Data Collection: Data collection plays the most important role in the Big Data cycle. The Internet provides almost unlimited sources of data for a variety of topics. The importance of this area depends on the type of business, but traditional industries can acquire a diverse source of external data and combine those with their transactional data. For example, let‘s assume we would like to build a system that recommends restaurants. The first step would be to gather data, in this case, reviews of restaurants from different websites and store them in a database. As we are interested in raw text, and would use that for analytics, it is not that relevant where the data for developing the model would be stored. This may sound contradictory with the big data main technologies, but in order to implement a big data application, we simply need to make it work in real time.

Data format in information technology may refer to:  Data type, constraint placed upon the interpretation of data in a type system  Signal (electrical engineering), a format for signal data used in signal processing  Recording format, a format for encoding data for storage on a storage medium  File format, a format for encoding data for storage in a computer file  Container format (digital), a format for encoding data for storage by means of a standardized audio/video codecs file format  Content format, a format for representing media content as data  Audio format, a format for encoded sound data Video format, a format for encoded video data Recommended Digital Data Formats:  Text, Documentation, Scripts: XML, PDF/A, HTML, Plain Text. Still Image: TIFF, JPEG 2000, PNG, JPEG/JFIF, DNG (digital negative), BMP, GIF. Geospatial: Shapefile (SHP, DBF, SHX), GeoTIFF, NetCDF. Graphic Image:  raster formats: TIFF, JPEG2000, PNG, JPEG/JFIF, DNG, BMP, GIF.  vector formats: Scalable vector graphics, AutoCAD Drawing Interchange Format, Encapsulated Postscripts, Shape files.  cartographic: Most complete data, GeoTIFF, GeoPDF, GeoJPEG2000, Shapefile. Audio: WAVE, AIFF, MP3, MXF, FLAC. Video: MOV, MPEG-4, AVI, MXF. Database: XML, CSV, TAB. Parsing and Transformation: In data transformation process data are transformed from one format to another format, that is more appropriate for data mining. Some Data Transformation Strategies:- Smoothing: Smoothing is a process of removing noise from the data. Aggregation: Aggregation is a process where summary or aggregation operations are applied to the data. Generalization: In generalization low-level data are replaced with high-level data by using concept hierarchies climbing. Normalization: Normalization scaled attribute data so as to fall within a small specified range, such as 0.0 to 1.0. Attribute Construction: In Attribute construction, new attributes are constructed from the given set of attributes.

Scalability: Scalability is the capability of a system, network, or process to handle a growing amount of work, or its potential to be enlarged to accommodate that growth.For example, a system is considered scalable if it is capable of increasing its total output under an increased load when resources (typically hardware) are added. An analogous meaning is implied when the word is used in an economic context, where a company's scalability implies that the underlying businessmodel offers the potential for economicgrowth within the company. Scalability, as a property of systems, is generally difficult to defineand in any particular case it is necessary to define the specific requirements for scalability on those dimensions that are deemed important. It is a highly significant issue in electronics systems, databases, routers, and networking. A system, whose performance improves after adding hardware, proportionally to the capacity added, is said to be a scalable system. Scalability can be measured in various dimensions, such as:  Administrative scalability : The ability for an increasing number of organizations or users to easily share a single distributed system.  Functional scalability : The ability to enhance the system by adding new functionality at minimal effort.  Geographic scalability : The ability to maintain performance, usefulness, or usability regardless of expansion from concentration in a local area to a more distributed geographic pattern.  Load scalability : The ability for a distributed system to easily expand and contract its resource pool to accommodate heavier or lighter loads or number of inputs. Alternatively, the ease with which a system or component can be modified, added, or removed, to accommodate changing load.  Generation scalability : The ability of a system to scale up by using new generations of components. Thereby, heterogeneous scalability is the ability to use the components from different vendors. Scalability issues:  A routing protocol is considered scalable with respect to network size, if the size of the necessary routing table on each node grows as O(log N ), where N is the number of nodes in the network.

MODULE - II

Data Cleaning:

DATA CLEANING

Data cleaning is a technique that is applied to remove the noisy data and correct the inconsistencies in data. Data cleaning involves transformations to correct the wrong data. Data cleaning is performed as a data preprocessing step while preparing the data for a data warehouse. KDP is a process of finding knowledge in data; it does this by using data mining methods (algorithms) in order to extract demanding knowledge from large amount of data. Knowledge Discovery Process (KDP)

Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data. Data cleansing may be performed interactively with datawrangling tools, or as batchprocessing through scripting. After cleansing, a dataset should be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been originally caused by user entry errors, by corruption in transmission or storage, or by different datadictionary definitions of similar entities in different stores. Data cleaning differs from datavalidation in that validation almost invariably means data is rejected from the system at entry and is performed at the time of entry, rather than on batches of data. The actual process of data cleansing may involve removing typographicalerrors or validating and correcting values against a known list of entities. The validation may be strict (such as rejecting any address that does not have a valid postalcode) or fuzzy (such as correcting records that partially match existing, known records). Some data cleansing solutions will clean data by cross checking with a validated data set. A common data cleansing practice is data enhancement, where data is made more complete by adding related information. For example, appending addresses with any phone numbers related to that address. Data cleansing may also involve activities like, harmonization of data, and standardization of data. For example, harmonization of short codes (st, rd, etc.) to actual words (street, road, and etcetera). Standardization of data is a means of changing a reference data set to a new standard, ex, use of standard codes.  Data cleansing is a valuable process that can help companies save time and increase their efficiency. Datacleansingsoftware tools are used by various organizations to remove duplicate data, fix and amend badly-formatted, incorrect and amend incomplete data from marketing lists, databases and CRM‘s. They can achieve in a short period of time what could take days or weeks for an administrator working manually to fix. This means that companies can save not only time but money by acquiring data cleaning tools. Data cleansing is of particular value to organizations that have vast swathes of data to deal with. These organizations can include banks or government organizations but small to medium enterprises can also find a good use for the programs. In fact, it‘s suggested by many sources that any firm that works with and hold data should invest in cleansing tools. The tools should also be used on a regular basis as inaccurate data levels can grow quickly, compromising database and decreasing business efficiency.

consider data consistency - and at the time of the article edit, it is 50% complete. The new article is added to the article space (at the 75% mark) and a corresponding index entry is added (at the 20% mark). Because the backup is already halfway done and the index already copied, the backup will be written with the article data present, but with the index reference missing. As a result of the inconsistency, this file is considered corrupted. In real life, a real database such as Wikipedia's may be edited thousands of times per hour, and references are virtually always spread throughout the file and can number into the millions, billions, or more. A sequential "copy" backup would literally contain so many small corruptions that the backup would be completely unusable without a lengthy repair process which could provide no guarantee as to the completeness of what has been recovered. A backup process which properly accounts for data consistency ensures that the backup is a snapshot of how the entire database looked at a single moment. In the given Wikipedia example, it would ensure that the backup was written without the added article at the 75% mark, so that the article data would be consistent with the index data previously written. Disk caching systems: Point-in-time consistency is also relevant to computer disk subsystems. Specifically, operating systems and file systemsare designed with the expectation that the computer system they are running on could lose power, crash, fail, or otherwise cease operating at any time. When properly designed, they ensure that data will not be unrecoverably corrupted if the power is lost. Operating systems and file systems do this by ensuring that data is written to a hard disk in a certain order, and rely on that in order to detect and recover from unexpected shutdowns. On the other hand, rigorously writing data to disk in the order that maximizes data integrity also impacts performance. A process of write caching is used to consolidate and re-sequence write operations such that they can be done faster by minimizing the time spent moving disk heads. Data consistency concerns arise when write caching changes the sequence in which writes are carried out, because it there exists the possibility of an unexpected shutdown that violates the operating system's expectation that all writes will be committed sequentially. For example, in order to save a typical document or picture file, an operating system might write the following records to a disk in the following order:

Journal entry saying file XYZ is about to be saved into sector 123.
The actual contents of the file XYZ are written into sector 123.
Sector 123 is now flagged as occupied in the record of free/used space.
Journal entry noting the file completely saved, and its name is XYZ and is located in sector 123. The operating system relies on the assumption that if it sees item #1 is present (saying the file is about to be saved), but that item #4 is missing (confirming success), that the save operation was unsuccessful and so it should undo any incomplete steps already taken to save it (e.g. marking sector 123 free since it never was properly filled, and removing any record of XYZ from the file directory). It relies on these items being committed to disk in sequential order. Suppose a caching algorithm determines it would be fastest to write these items to disk in the order 4- 3 - 1 - 2 , and starts doing so, but the power gets shut down after 4 get written, before 3, 1 and 2, and so those writes never occur. When the computer is turned back on, the file system would then show it contains a file named XYZ which is located in sector 123, but this sector really does not contain the file. (Instead, the sector will contain garbage, or zeroes, or a random portion of some old file - and that is what will show if the file is opened). Further, the file system's free space map will not contain any entry showing that sector 123 is occupied, so later; it will likely assign that sector to the next file to be saved, believing it is available. The file system will then have two files both unexpectedly claiming the same sector (known as a cross-linked file). As a result, a write to one of the files will overwrite part of the other file, invisibly damaging it. A disk caching subsystem that ensures point-in-time consistency guarantees that in the event of an unexpected shutdown, the four elements would be written one of only five possible ways: completely (1- 2 - 3 - 4), partially (1, 1-2, 1- 2 - 3), or not at all. High-end hardware disk controllers of the type found in servers include a small battery back-up unit on their cache memory so that they may offer the performance gains of write caching while mitigating the risk of unintended shutdowns. The battery back-up unit keeps the memory powered even during a shutdown so that when the computer is powered back up, it can quickly complete any writes it has previously committed. With such a controller, the operating system may request four writes (1- 2 - 3 - 4) in that order, but the controller may decide the quickest way to write them is 4- 3 - 1 - 2. The controller essentially lies to the operating system and reports that the writes have been completed in order (a lie that improves performance at the expense of data

BASIC NETWORK AND DATA ANALYSIS, Cheat Sheet of Computer Networks

Related documents

Partial preview of the text

Download BASIC NETWORK AND DATA ANALYSIS and more Cheat Sheet Computer Networks in PDF only on Docsity!

MODULE - I

DATA GATHERING AND PREPARATION

BIG DATA ANALYTICS:

MODULE - II

DATA CLEANING