Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Data Mining Time And Data Transformation, Study notes of Data Mining

notes include data mining time and data transformation

Typology: Study notes

2020/2021

Available from 04/27/2022

anupama-gireesh
anupama-gireesh 🇮🇳

5 documents

1 / 11

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Mining Data Streams
Characteristics of Data Streams
Data Streams
Data streams—continuous, ordered, changing, fast, huge amount
Traditional DBMS—data stored in finite, persistent data sets
Characteristics
Huge volumes of continuous data, possibly infinite
Fast changing and requires fast, real-time response
Data stream captures nicely our data processing needs of today
Random access is expensive—single scan algorithm (can only have one look)
Store only the summary of the data seen thus far
Most stream data are at pretty low-level or multi-dimensional in nature, needs multi-level
and multi-dimensional processing
Stream Data Applications
Telecommunication calling records
Business: credit card transaction flows
Network monitoring and traffic engineering
Financial market: stock exchange
Engineering & industrial processes: power supply & manufacturing
Sensor, monitoring & surveillance: video streams, RFIDs
Security monitoring
Web logs and Web page click streams
Massive data sets (even saved but random access is too expensive)
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Data Mining Time And Data Transformation and more Study notes Data Mining in PDF only on Docsity!

Mining Data Streams

Characteristics of Data Streams

 Data Streams  Data streams—continuous, ordered, changing, fast, huge amount  Traditional DBMS—data stored in finite, persistent data sets  Characteristics  Huge volumes of continuous data, possibly infinite  Fast changing and requires fast, real-time response  Data stream captures nicely our data processing needs of today  Random access is expensive—single scan algorithm ( can only have one look )  Store only the summary of the data seen thus far  Most stream data are at pretty low-level or multi-dimensional in nature, needs multi-level and multi-dimensional processing

Stream Data Applications

 Telecommunication calling records  Business: credit card transaction flows  Network monitoring and traffic engineering  Financial market: stock exchange  Engineering & industrial processes: power supply & manufacturing  Sensor, monitoring & surveillance: video streams, RFIDs  Security monitoring  Web logs and Web page click streams  Massive data sets (even saved but random access is too expensive)

Challenges of Stream Data Processing

 Multiple, continuous, rapid, time-varying, ordered streams  Main memory computations  Queries are often continuous  Evaluated continuously as stream data arrives  Answer updated over time  Queries are often complex  Beyond element-at-a-time processing  Beyond stream-at-a-time processing  Beyond relational queries (scientific, data mining, OLAP)  Multi-level/multi-dimensional processing and data mining  Most stream data are at low-level or multi-dimensional in nature

Processing Stream Queries

 Query types  One-time query vs. continuous query (being evaluated continuously as stream continues to arrive)  Predefined query vs. ad-hoc query (issued on-line)  Unbounded memory requirements  For real-time response, main memory algorithm should be used  Memory requirement is unbounded if one will join future tuples  Approximate query answering  With bounded memory, it is not always possible to produce exact answers  High-quality approximate answers are desired  Data reduction and synopsis construction methods  Sketches, random sampling, histograms, wavelets, etc.

 Approximate the frequency distribution of element values in a stream  Partition data into a set of contiguous buckets  Equal-width (equal value range for buckets) vs. V-optimal (minimizing frequency variance within each bucket)  Multi-resolution models  Popular models: balanced binary trees, micro-clusters, and wavelets

Stream Data Processing Methods (2)

 Sketches  Histograms and wavelets require multi-passes over the data but sketches can operate in a single pass  Frequency moments of a stream A = {a 1 , …, aN}, Fk: where v: the universe or domain size, mi: the frequency of i in the sequence  Given N elts and v values, sketches can approximate F 0 , F 1 , F 2 in O(log v + log N) space  Randomized algorithms  Monte Carlo algorithm: bound on running time but may not return correct result  Chebyshev’s inequality:  Let X be a random variable with mean μ and standard deviation σ  Chernoff bound:  Let X be the sum of independent Poisson trials X 1 , …, Xn, δ in (0, 1]  The probability decreases expoentially as we move from the mean

Approximate Query Answering in Streams

 Sliding windows  Only over sliding windows of recent stream data  Approximation but often more desirable in applications

 Batched processing, sampling and synopses  Batched if update is fast but computing is slow  Compute periodically, not very timely  Sampling if update is slow but computing is fast  Compute using sample data, but not good for joins, etc.  Synopsis data structures  Maintain a small synopsis or sketch of data  Good for querying historical data  Blocking operators, e.g., sorting, avg, min, etc.  Blocking if unable to produce the first output until seeing the entire input

Projects on DSMS (Data Stream Management System)

 Research projects and system prototypes  STREAM (Stanford): A general-purpose DSMS  Cougar (Cornell): sensors  Aurora (Brown/MIT): sensor monitoring, dataflow  Hancock (AT&T): telecom streams  Niagara (OGI/Wisconsin): Internet XML databases  OpenCQ (Georgia Tech): triggers, incr. view maintenance  Tapestry (Xerox): pub/sub content-based filtering  Telegraph (Berkeley): adaptive engine for sensors  Tradebot (www.tradebot.com): stock tickers & streams  Tribeca (Bellcore): network monitoring  MAIDS (UIUC/NCSA): Mining Alarming Incidents in Data Streams

 Analysis of Web click streams  Raw data at low levels: seconds, web page addresses, user IP addresses, …  Analysts want: changes, trends, unusual patterns, at reasonable levels of details  E.g., Average clicking traffic in North America on sports in the last 15 minutes is 40% higher than that in the last 24 hours.”  Analysis of power consumption streams  Raw data: power consumption flow for every household, every minute  Patterns one may find: average hourly power consumption surges up 30% for manufacturing companies in Chicago in the last 2 hours today than that of the same day a week ago

A Stream Cube Architecture

 A tilted time frame  Different time granularities  second, minute, quarter, hour, day, week, …  Critical layers  Minimum interest layer (m-layer)  Observation layer (o-layer)  User: watches at o-layer and occasionally needs to drill-down down to m-layer  Partial materialization of stream cubes  Full materialization: too space and time consuming  No materialization: slow response at query time  Partial materialization: what do we mean “partial”?

On-Line Partial Materialization vs. OLAP Processing

 On-line materialization  Materialization takes precious space and time  Only incremental materialization (with tilted time frame)  Only materialize “cuboids” of the critical layers?  Online computation may take too much time  Preferred solution:  popular-path approach: Materializing those along the popular drilling paths  H-tree structure : Such cuboids can be computed and stored efficiently using the H-tree structure  Online aggregation vs. query-based computation  Online computing while streaming: aggregating stream cubes  Query-based computation: using computed cuboids

Frequent Patterns for Stream Data

 Frequent pattern mining is valuable in stream applications  e.g., network intrusion mining (Dokas, et al’02)  Mining precise freq. patterns in stream data: unrealistic  Even store them in a compressed form, such as FPtree  How to mine frequent patterns with good approximation?  Approximate frequent patterns (Manku & Motwani VLDB’02)  Keep only current frequent patterns? No changes can be detected  Mining evolution freq. patterns (C. Giannella, J. Han, X. Yan, P.S. Yu, 2003)  Use tilted time window frame  Mining evolution and dramatic changes of frequent patterns  Space-saving computation of frequent and top-k elements (Metwally, Agrawal, and El Abbadi, ICDT'05)

Mining Approximate Frequent Patterns

 Mining precise freq. patterns in stream data: unrealistic

 maintain at most m level-i medians  On seeing m of them, generate O(k) level-(i+1) medians of weight equal to the sum of the weights of the intermediate medians assigned to them  Drawbacks:  Low quality for evolving data streams (register only k centers)  Limited functionality in discovering and exploring clusters over different portions of the stream over time Clustering for Mining Stream Dynamics  Network intrusion detection: one example  Detect bursts of activities or abrupt changes in real time—by on-line clustering  Our methodology (C. Agarwal, J. Han, J. Wang, P.S. Yu, VLDB’03)  Tilted time frame work: o.w. dynamic changes cannot be found  Micro-clustering: better quality than k-means/k-median  incremental, online processing and maintenance)  Two stages: micro-clustering and macro-clustering  With limited “overhead” to achieve high efficiency, scalability, quality of results and power of evolution/change detection CluStream: A Framework for Clustering Evolving Data Streams  Design goal  High quality for clustering evolving data streams with greater functionality

 While keep the stream mining requirement in mind  One-pass over the original stream data  Limited space usage and high efficiency  CluStream: A framework for clustering evolving data streams  Divide the clustering process into online and offline components  Online component: periodically stores summary statistics about the stream data  Offline component: answers various user questions based on the stored summary statistics