Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Data Mining Time And Data Transformation, Study notes of Data Mining

University of Kerala Data Mining

notes include data mining time and data transformation

Typology: Study notes

2020/2021

Available from 04/27/2022

anupama-gireesh 🇮🇳

5 documents

1 / 11

This page cannot be seen from the preview

Don't miss anything!

Mining Data Streams

Characteristics of Data Streams

Data Streams

Data streams—continuous, ordered, changing, fast, huge amount

Traditional DBMS—data stored in finite, persistent data sets

Characteristics

Huge volumes of continuous data, possibly infinite

Fast changing and requires fast, real-time response

Data stream captures nicely our data processing needs of today

Random access is expensive—single scan algorithm (can only have one look)

Store only the summary of the data seen thus far

Most stream data are at pretty low-level or multi-dimensional in nature, needs multi-level

and multi-dimensional processing

Stream Data Applications

Telecommunication calling records

Business: credit card transaction flows

Network monitoring and traffic engineering

Financial market: stock exchange

Engineering & industrial processes: power supply & manufacturing

Sensor, monitoring & surveillance: video streams, RFIDs

Security monitoring

Web logs and Web page click streams

Massive data sets (even saved but random access is too expensive)

Partial preview of the text

Download Data Mining Time And Data Transformation and more Study notes Data Mining in PDF only on Docsity!

Mining Data Streams

Characteristics of Data Streams

 Data Streams  Data streams—continuous, ordered, changing, fast, huge amount  Traditional DBMS—data stored in finite, persistent data sets  Characteristics  Huge volumes of continuous data, possibly infinite  Fast changing and requires fast, real-time response  Data stream captures nicely our data processing needs of today  Random access is expensive—single scan algorithm ( can only have one look )  Store only the summary of the data seen thus far  Most stream data are at pretty low-level or multi-dimensional in nature, needs multi-level and multi-dimensional processing

Stream Data Applications

 Telecommunication calling records  Business: credit card transaction flows  Network monitoring and traffic engineering  Financial market: stock exchange  Engineering & industrial processes: power supply & manufacturing  Sensor, monitoring & surveillance: video streams, RFIDs  Security monitoring  Web logs and Web page click streams  Massive data sets (even saved but random access is too expensive)

Challenges of Stream Data Processing

 Multiple, continuous, rapid, time-varying, ordered streams  Main memory computations  Queries are often continuous  Evaluated continuously as stream data arrives  Answer updated over time  Queries are often complex  Beyond element-at-a-time processing  Beyond stream-at-a-time processing  Beyond relational queries (scientific, data mining, OLAP)  Multi-level/multi-dimensional processing and data mining  Most stream data are at low-level or multi-dimensional in nature

Processing Stream Queries

 Query types  One-time query vs. continuous query (being evaluated continuously as stream continues to arrive)  Predefined query vs. ad-hoc query (issued on-line)  Unbounded memory requirements  For real-time response, main memory algorithm should be used  Memory requirement is unbounded if one will join future tuples  Approximate query answering  With bounded memory, it is not always possible to produce exact answers  High-quality approximate answers are desired  Data reduction and synopsis construction methods  Sketches, random sampling, histograms, wavelets, etc.

 Approximate the frequency distribution of element values in a stream  Partition data into a set of contiguous buckets  Equal-width (equal value range for buckets) vs. V-optimal (minimizing frequency variance within each bucket)  Multi-resolution models  Popular models: balanced binary trees, micro-clusters, and wavelets

Stream Data Processing Methods (2)

 Sketches  Histograms and wavelets require multi-passes over the data but sketches can operate in a single pass  Frequency moments of a stream A = {a 1 , …, aN}, Fk: where v: the universe or domain size, mi: the frequency of i in the sequence  Given N elts and v values, sketches can approximate F 0 , F 1 , F 2 in O(log v + log N) space  Randomized algorithms  Monte Carlo algorithm: bound on running time but may not return correct result  Chebyshev’s inequality:  Let X be a random variable with mean μ and standard deviation σ  Chernoff bound:  Let X be the sum of independent Poisson trials X 1 , …, Xn, δ in (0, 1]  The probability decreases expoentially as we move from the mean

Approximate Query Answering in Streams

 Sliding windows  Only over sliding windows of recent stream data  Approximation but often more desirable in applications

 Batched processing, sampling and synopses  Batched if update is fast but computing is slow  Compute periodically, not very timely  Sampling if update is slow but computing is fast  Compute using sample data, but not good for joins, etc.  Synopsis data structures  Maintain a small synopsis or sketch of data  Good for querying historical data  Blocking operators, e.g., sorting, avg, min, etc.  Blocking if unable to produce the first output until seeing the entire input

Projects on DSMS (Data Stream Management System)

 Research projects and system prototypes  STREAM (Stanford): A general-purpose DSMS  Cougar (Cornell): sensors  Aurora (Brown/MIT): sensor monitoring, dataflow  Hancock (AT&T): telecom streams  Niagara (OGI/Wisconsin): Internet XML databases  OpenCQ (Georgia Tech): triggers, incr. view maintenance  Tapestry (Xerox): pub/sub content-based filtering  Telegraph (Berkeley): adaptive engine for sensors  Tradebot (www.tradebot.com): stock tickers & streams  Tribeca (Bellcore): network monitoring  MAIDS (UIUC/NCSA): Mining Alarming Incidents in Data Streams

 Analysis of Web click streams  Raw data at low levels: seconds, web page addresses, user IP addresses, …  Analysts want: changes, trends, unusual patterns, at reasonable levels of details  E.g., Average clicking traffic in North America on sports in the last 15 minutes is 40% higher than that in the last 24 hours.”  Analysis of power consumption streams  Raw data: power consumption flow for every household, every minute  Patterns one may find: average hourly power consumption surges up 30% for manufacturing companies in Chicago in the last 2 hours today than that of the same day a week ago

A Stream Cube Architecture

 A tilted time frame  Different time granularities  second, minute, quarter, hour, day, week, …  Critical layers  Minimum interest layer (m-layer)  Observation layer (o-layer)  User: watches at o-layer and occasionally needs to drill-down down to m-layer  Partial materialization of stream cubes  Full materialization: too space and time consuming  No materialization: slow response at query time  Partial materialization: what do we mean “partial”?

On-Line Partial Materialization vs. OLAP Processing

 On-line materialization  Materialization takes precious space and time  Only incremental materialization (with tilted time frame)  Only materialize “cuboids” of the critical layers?  Online computation may take too much time  Preferred solution:  popular-path approach: Materializing those along the popular drilling paths  H-tree structure : Such cuboids can be computed and stored efficiently using the H-tree structure  Online aggregation vs. query-based computation  Online computing while streaming: aggregating stream cubes  Query-based computation: using computed cuboids

Frequent Patterns for Stream Data

 Frequent pattern mining is valuable in stream applications  e.g., network intrusion mining (Dokas, et al’02)  Mining precise freq. patterns in stream data: unrealistic  Even store them in a compressed form, such as FPtree  How to mine frequent patterns with good approximation?  Approximate frequent patterns (Manku & Motwani VLDB’02)  Keep only current frequent patterns? No changes can be detected  Mining evolution freq. patterns (C. Giannella, J. Han, X. Yan, P.S. Yu, 2003)  Use tilted time window frame  Mining evolution and dramatic changes of frequent patterns  Space-saving computation of frequent and top-k elements (Metwally, Agrawal, and El Abbadi, ICDT'05)

Mining Approximate Frequent Patterns

 Mining precise freq. patterns in stream data: unrealistic

 maintain at most m level-i medians  On seeing m of them, generate O(k) level-(i+1) medians of weight equal to the sum of the weights of the intermediate medians assigned to them  Drawbacks:  Low quality for evolving data streams (register only k centers)  Limited functionality in discovering and exploring clusters over different portions of the stream over time Clustering for Mining Stream Dynamics  Network intrusion detection: one example  Detect bursts of activities or abrupt changes in real time—by on-line clustering  Our methodology (C. Agarwal, J. Han, J. Wang, P.S. Yu, VLDB’03)  Tilted time frame work: o.w. dynamic changes cannot be found  Micro-clustering: better quality than k-means/k-median  incremental, online processing and maintenance)  Two stages: micro-clustering and macro-clustering  With limited “overhead” to achieve high efficiency, scalability, quality of results and power of evolution/change detection CluStream: A Framework for Clustering Evolving Data Streams  Design goal  High quality for clustering evolving data streams with greater functionality

 While keep the stream mining requirement in mind  One-pass over the original stream data  Limited space usage and high efficiency  CluStream: A framework for clustering evolving data streams  Divide the clustering process into online and offline components  Online component: periodically stores summary statistics about the stream data  Offline component: answers various user questions based on the stored summary statistics

Data Mining Time And Data Transformation, Study notes of Data Mining

Related documents

Partial preview of the text

Download Data Mining Time And Data Transformation and more Study notes Data Mining in PDF only on Docsity!

Mining Data Streams

Characteristics of Data Streams

Stream Data Applications

Challenges of Stream Data Processing

Processing Stream Queries

Stream Data Processing Methods (2)

Approximate Query Answering in Streams

Projects on DSMS (Data Stream Management System)

A Stream Cube Architecture

On-Line Partial Materialization vs. OLAP Processing

Frequent Patterns for Stream Data

Mining Approximate Frequent Patterns