





Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
data warehouse and architecture
Typology: Lecture notes
1 / 9
This page cannot be seen from the preview
Don't miss anything!
Data Mining Lecture 2 2
Data Mining Lecture 2 3
:KDWLV'DWD:DUHKRXVH"
Data Mining Lecture 2 4
'DWD:DUHKRXVH≥6XEMHFW2ULHQWHG
Data Mining Lecture 2 5
'DWD:DUHKRXVH≥,QWHJUDWHG
á (J+RWHOSULFHFXUUHQF\WD[EUHDNIDVWFRYHUHGHWF ≤ :KHQGDWDLVPRYHGWRWKHZDUHKRXVHLWLVFRQYHUWHG
Data Mining Lecture 2 6
'DWD:DUHKRXVH≥7LPH9DULDQW
HOHPHQWμ
Data Mining Lecture 2 7
'DWD:DUHKRXVH≥1RQ9RODWLOH
á LQLWLDOORDGLQJRIGDWDDQGDFFHVVRIGDWD
Data Mining Lecture 2 8
á :KHQDTXHU\LVSRVHGWRDFOLHQWVLWHDPHWDGLFWLRQDU\LVXVHG WRWUDQVODWHWKHTXHU\LQWRTXHULHVDSSURSULDWHIRULQGLYLGXDO KHWHURJHQHRXVVLWHVLQYROYHGDQGWKHUHVXOWVDUHLQWHJUDWHGLQWR DJOREDODQVZHUVHW á &RPSOH[LQIRUPDWLRQILOWHULQJFRPSHWHIRUUHVRXUFHV
Data Mining Lecture 2 9
á 2/73 RQOLQHWUDQVDFWLRQSURFHVVLQJ ≤ 0DMRUWDVNRIWUDGLWLRQDOUHODWLRQDO'% ≤ 'D\WRGD\RSHUDWLRQVSXUFKDVLQJLQYHQWRU\EDQNLQJPDQXIDFWXULQJ SD\UROOUHJLVWUDWLRQDFFRXQWLQJHWF á 2/$3 RQOLQHDQDO\WLFDOSURFHVVLQJ ≤ 0DMRUWDVNRIGDWDZDUHKRXVHV\VWHP ≤ 'DWDDQDO\VLVDQGGHFLVLRQPDNLQJ á 'LVWLQFWIHDWXUHV 2/73YV2/$3 ≤ 8VHUDQGV\VWHPRULHQWDWLRQFXVWRPHUYVPDUNHW ≤ 'DWDFRQWHQWVFXUUHQWGHWDLOHGYVKLVWRULFDOFRQVROLGDWHG ≤ 'DWDEDVHGHVLJQ(5DSSOLFDWLRQYVVWDUVXEMHFW ≤ 9LHZFXUUHQWORFDOYVHYROXWLRQDU\LQWHJUDWHG ≤ $FFHVVSDWWHUQVXSGDWHYVUHDGRQO\EXWFRPSOH[TXHULHV Data Mining Lecture 2 10
2/73YV2/$
OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date detailed, flat relational isolated
historical, summarized, multidimensional integrated, consolidated usage repetitive ad-hoc access read/write index/hash on prim. key
lots of scans unit of work short, simple transaction complex query # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response
Data Mining Lecture 2 11
:K\6HSDUDWH'DWD:DUHKRXVH"
Data Mining Lecture 2 12
Data Mining Lecture 2 19
$'DWD0LQLQJ4XHU\/DQJXDJH'04/ /DQJXDJH3ULPLWLYHV
≤ )LUVWWLPHDV¥FXEHGHILQLWLRQμ ≤ GHILQHGLPHQVLRQGLPHQVLRQBQDPH!DV GLPHQVLRQBQDPHBILUVWBWLPH!LQFXEH FXEHBQDPHBILUVWBWLPH!
Data Mining Lecture 2 20
'HILQLQJD6WDU6FKHPDLQ'04/
Data Mining Lecture 2 21
'HILQLQJD6QRZIODNH6FKHPDLQ'04/
Data Mining Lecture 2 22
'HILQLQJD)DFW&RQVWHOODWLRQLQ'04/
GROODUVBVROG VXP VDOHVBLQBGROODUV DYJBVDOHV DYJ VDOHVBLQBGROODUV XQLWVBVROG FRXQW GHILQHGLPHQVLRQWLPHDV WLPHBNH\GD\GD\BRIBZHHNPRQWKTXDUWHU \HDU GHILQHGLPHQVLRQLWHPDV LWHPBNH\LWHPBQDPHEUDQGW\SH VXSSOLHUBW\SH GHILQHGLPHQVLRQEUDQFKDV EUDQFKBNH\EUDQFKBQDPHEUDQFKBW\SH GHILQHGLPHQVLRQORFDWLRQDV ORFDWLRQBNH\VWUHHWFLW\SURYLQFHBRUBVWDWH FRXQWU
GHILQHFXEHVKLSSLQJ>WLPHLWHPVKLSSHUIURPBORFDWLRQWRBORFDWLRQ@ GROODUBFRVW VXP FRVWBLQBGROODUV XQLWBVKLSSHG FRXQW GHILQHGLPHQVLRQWLPHDVWLPHLQFXEHVDOHV GHILQHGLPHQVLRQLWHPDVLWHPLQFXEHVDOHV GHILQHGLPHQVLRQVKLSSHUDV VKLSSHUBNH\VKLSSHUBQDPHORFDWLRQDV ORFDWLRQLQFXEHVDOHVVKLSSHUBW\SH GHILQHGLPHQVLRQIURPBORFDWLRQDVORFDWLRQLQFXEHVDOHV GHILQHGLPHQVLRQWRBORFDWLRQDVORFDWLRQLQFXEHVDOHV
Data Mining Lecture 2 23
á (JFRXQW VXP PLQ PD[
á (JDYJ PLQB1 VWDQGDUGBGHYLDWLRQ
á (JPHGLDQ PRGH UDQN Data Mining Lecture 2^24
Data Mining Lecture 2 25
0XOWLGLPHQVLRQDO'DWD
á 6DOHVYROXPHDVDIXQFWLRQRISURGXFW PRQWKDQGUHJLRQ
Region
Dimensions: Product, Location, Time Hierarchical summarization paths
Industry Region Year
Category Country Quarter
Product City Month Week
Office Day
Data Mining Lecture 2 26
$6DPSOH'DWD&XEH
Total annual sales
1Qtr 2Qtr (^) 3Qtr 4Qtr U.S.A
Canada
Mexico
sum
Data Mining Lecture 2 27
&XERLGV&RUUHVSRQGLQJWRWKH&XEH
all
product (^) date country
product,date product,country date, country
product, date, country
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D(base) cuboid
Data Mining Lecture 2 28
Data Mining Lecture 2 29
á 5ROOXS GULOOXS VXPPDUL]HGDWD ≤ E\FOLPELQJXSKLHUDUFK\RUE\GLPHQVLRQUHGXFWLRQ á 'ULOOGRZQ UROOGRZQ UHYHUVHRIUROOXS ≤ IURPKLJKHUOHYHOVXPPDU\WRORZHUOHYHOVXPPDU\RUGHWDLOHG GDWDRULQWURGXFLQJQHZGLPHQVLRQV á 6OLFHDQGGLFH ≤ SURMHFWDQGVHOHFW á 3LYRW URWDWH ≤ UHRULHQWWKHFXEHYLVXDOL]DWLRQ'WRVHULHVRI'SODQHV á 2WKHURSHUDWLRQV ≤ GULOODFURVVLQYROYLQJ DFURVV PRUHWKDQRQHIDFWWDEOH ≤ GULOOWKURXJKWKURXJKWKHERWWRPOHYHORIWKHFXEHWRLWVEDFNHQG UHODWLRQDOWDEOHV XVLQJ64/
Data Mining Lecture 2 30
Data Mining Lecture 2 37
(IILFLHQW'DWD&XEH&RPSXWDWLRQ
á %DVHGRQVL]HVKDULQJDFFHVVIUHTXHQF\HWF
n i i
Data Mining Lecture 2 38
&XEH2SHUDWLRQ
á &XEHGHILQLWLRQDQGFRPSXWDWLRQLQ'04/ GHILQHFXEHVDOHV>LWHPFLW\\HDU@VXP VDOHVBLQBGROODUV FRPSXWHFXEHVDOHV á 7UDQVIRUP LW LQWR D 64/OLNH ODQJXDJH ZLWK D QHZ RSHUDWRU FXEHE\LQWURGXFHGE\UD\HWDO∑ 6(/(&7LWHPFLW\\HDU680 DPRXQW )5206$/( &8%(%<LWHPFLW\\HDU á 1HHGFRPSXWHWKHIROORZLQJURXS%\V GDWHSURGXFWFXVWRPHU GDWHSURGXFW GDWHFXVWRPHU SURGXFWFXVWRPHU GDWH SURGXFW FXVWRPHU
(city) (item)
(year)
(city, item) (city, year) (item, year)
(city, item, year)
Data Mining Lecture 2 39
&XEH&RPSXWDWLRQ52/$3%DVHG0HWKRG
JURXSLQJVWHSμ ≤ $JJUHJDWHVPD\EHFRPSXWHGIURPSUHYLRXVO\FRPSXWHG DJJUHJDWHVUDWKHUWKDQIURPWKHEDVHIDFWWDEOH
Data Mining Lecture 2 40
,QGH[LQJ2/$3'DWD%LWPDS,QGH[
á ,QGH[RQDSDUWLFXODUFROXPQ á (DFKYDOXHLQWKHFROXPQKDVDELWYHFWRUELWRSLVIDVW á 7KHOHQJWKRIWKHELWYHFWRURIUHFRUGVLQWKHEDVHWDEOH á 7KHLWKELWLVVHWLIWKHLWKURZRIWKHEDVHWDEOHKDVWKH YDOXHIRUWKHLQGH[HGFROXPQ á QRWVXLWDEOHIRUKLJKFDUGLQDOLW\GRPDLQV
C u st R eg io n T yp e C 1 A s ia R e ta il C 2 E u ro p e D e a le r C 3 A s ia D e a le r C 4 A m e ric a R e ta il C 5 E u ro p e D e a le r
R ecID R etail D ealer 1 1 0 2 0 1 3 0 1 4 1 0 5 0 1
R ecID Asia E u ro p e Am erica 1 1 0 0 2 0 1 0 3 1 0 0 4 0 1 0 5 0 0 1
Data Mining Lecture 2 41
,QGH[LQJ2/$3'DWD-RLQ,QGLFHV
á -RLQLQGH[-, 5LG6LG ZKHUH 5 5LG´ 6 6LG´ á 7UDGLWLRQDOLQGLFHVPDSWKHYDOXHVWRDOLVW RIUHFRUGLGV ≤ ,WPDWHULDOL]HVUHODWLRQDOMRLQLQ-,ILOHDQG VSHHGVXSUHODWLRQDOMRLQ≥DUDWKHUFRVWO
RSHUDWLRQ á ,QGDWDZDUHKRXVHVMRLQLQGH[UHODWHVWKH YDOXHVRIWKHGLPHQVLRQVRIDVWDUWVFKHPD WRURZVLQWKHIDFWWDEOH ≤ (JIDFWWDEOH6DOHVDQGWZRGLPHQVLRQV FLW\DQGSURGXFW á $MRLQLQGH[RQFLW\PDLQWDLQVIRUHDFK GLVWLQFWFLW\DOLVWRI5,'VRIWKH WXSOHVUHFRUGLQJWKH6DOHVLQWKHFLW
≤ -RLQLQGLFHVFDQVSDQPXOWLSOHGLPHQVLRQV
Data Mining Lecture 2 42
(IILFLHQW3URFHVVLQJ2/$34XHULHV
Data Mining Lecture 2 43
0HWDGDWD5HSRVLWRU\
á 0HWDGDWDLVWKHGDWDGHILQLQJZDUHKRXVHREMHFWV,WKDVWKH IROORZLQJNLQGV ≤ 'HVFULSWLRQRIWKHVWUXFWXUHRIWKHZDUHKRXVH á VFKHPDYLHZGLPHQVLRQVKLHUDUFKLHVGHULYHGGDWD GHIQGDWDPDUW ORFDWLRQVDQGFRQWHQWV ≤ 2SHUDWLRQDOPHWDGDWD á GDWDOLQHDJH KLVWRU\RIPLJUDWHGGDWDDQGWUDQVIRUPDWLRQSDWK FXUUHQF
RIGDWD DFWLYHDUFKLYHGRUSXUJHG PRQLWRULQJLQIRUPDWLRQ ZDUHKRXVH XVDJHVWDWLVWLFVHUURUUHSRUWVDXGLWWUDLOV ≤ 7KHDOJRULWKPVXVHGIRUVXPPDUL]DWLRQ ≤ 7KHPDSSLQJIURPRSHUDWLRQDOHQYLURQPHQWWRWKHGDWDZDUHKRXVH ≤ 'DWDUHODWHGWRV\VWHPSHUIRUPDQFH á ZDUHKRXVHVFKHPDYLHZDQGGHULYHGGDWDGHILQLWLRQV ≤ %XVLQHVVGDWD á EXVLQHVVWHUPVDQGGHILQLWLRQVRZQHUVKLSRIGDWDFKDUJLQJSROLFLHV
Data Mining Lecture 2 44
'DWD:DUHKRXVH%DFN(QG7RROVDQG8WLOLWLHV
á 'DWDH[WUDFWLRQ ≤ JHWGDWDIURPPXOWLSOHKHWHURJHQHRXVDQGH[WHUQDOVRXUFHV á 'DWDFOHDQLQJ ≤ GHWHFWHUURUVLQWKHGDWDDQGUHFWLI\WKHPZKHQSRVVLEOH á 'DWDWUDQVIRUPDWLRQ ≤ FRQYHUWGDWDIURPOHJDF\RUKRVWIRUPDWWRZDUHKRXVHIRUPDW á /RDG ≤ VRUWVXPPDUL]HFRQVROLGDWHFRPSXWHYLHZVFKHFNLQWHJULW\ DQGEXLOGLQGLFLHVDQGSDUWLWLRQV á 5HIUHVK ≤ SURSDJDWHWKHXSGDWHVIURPWKHGDWDVRXUFHVWRWKH ZDUHKRXVH
Data Mining Lecture 2 45
Data Mining Lecture 2 46
'LVFRYHU\'ULYHQ([SORUDWLRQRI'DWD&XEHV
á +\SRWKHVLVGULYHQH[SORUDWLRQE\XVHUKXJHVHDUFKVSDFH á 'LVFRYHU\GULYHQ 6DUDZDJLHWDO∑ ≤ SUHFRPSXWHPHDVXUHVLQGLFDWLQJH[FHSWLRQVJXLGHXVHULQWKHGDWD DQDO\VLVDWDOOOHYHOVRIDJJUHJDWLRQ ≤ ([FHSWLRQVLJQLILFDQWO\GLIIHUHQWIURPWKHYDOXHDQWLFLSDWHGEDVHG RQDVWDWLVWLFDOPRGHO ≤ 9LVXDOFXHVVXFKDVEDFNJURXQGFRORUDUHXVHGWRUHIOHFWWKHGHJUHH RIH[FHSWLRQRIHDFKFHOO ≤ &RPSXWDWLRQRIH[FHSWLRQLQGLFDWRU PRGHOLQJILWWLQJDQGFRPSXWLQJ 6HOI([S,Q([SDQG3DWK([SYDOXHV FDQEHRYHUODSSHGZLWKFXEH FRQVWUXFWLRQ
Data Mining Lecture 2 47
([DPSOHV'LVFRYHU\'ULYHQ'DWD&XEHV
Data Mining Lecture 2 48
á 0XOWLIHDWXUHFXEHV 5RVVHWDO &RPSXWHFRPSOH[TXHULHV LQYROYLQJPXOWLSOHGHSHQGHQWDJJUHJDWHVDWPXOWLSOHJUDQXODULWLHV á ([*URXSLQJE\DOOVXEVHWVRI^LWHPUHJLRQPRQWK`ILQGWKH PD[LPXPSULFHLQIRUHDFKJURXSDQGWKHWRWDOVDOHVDPRQJ DOOPD[LPXPSULFHWXSOHV VHOHFWLWHPUHJLRQPRQWKPD[ SULFH VXP 5VDOHV IURPSXUFKDVHV ZKHUH\HDU FXEHE\LWHPUHJLRQPRQWK VXFKWKDW5SULFH PD[ SULFH á &RQWLQXLQJWKHODVWH[DPSOHDPRQJWKHPD[SULFHWXSOHVILQGWKH PLQDQGPD[VKHOIOLYHDQGILQGWKHIUDFWLRQRIWKHWRWDOVDOHV GXHWRWXSOHWKDWKDYHPLQVKHOIOLIHZLWKLQWKHVHWRIDOOPD[ SULFHWXSOHV