Download Lecture 33: Inference in Graphical Models and Intro to Structure Learning and more Slides Artificial Intelligence in PDF only on Docsity!
Inference in Graphical Models
and Intro to Structure Learning
Lecture 33 of 41
Lecture Outline
- More Bayesian Belief Networks (BBNs)
- Inference: applying CPTs
- Learning: CPTs from data, elicitation
- In-class exercises
- Hugin , BKD demos
- CPT elicitation, application
- Learning BBN Structure
- K2 algorithm
- Other probabilistic scores and search algorithms
- Causal Discovery: Learning Causality from Observations
- Incomplete Data: Learning and Inference (Expectation-Maximization)
- Next Week: BBNs Concluded; Review for Midterm (11 October 2001)
- After Midterm: EM Algorithm, Unsupervised Learning, Clustering
Bayesian Networks:
Quick Review
X 1
X 2
X 3
X 4
Season: Spring Summer Fall Winter
Sprinkler: On, Off
Rain: None, Drizzle, Steady, Downpour
Ground-Moisture: Wet, Dry X 5 Ground-Slipperiness: Slippery, Not-Slippery
P ( Summer , Off , Drizzle , Wet , Not-Slippery ) = P ( S ) · P ( O | S ) · P ( D | S ) · P ( W | O , D ) · P ( N | W )
- Recall: Conditional Independence (CI) Assumptions
- Bayesian Network: Digraph Model
- Vertices (nodes): denote events (each a random variable)
- Edges (arcs, links): denote conditional dependencies
- Chain Rule for (Exact) Inference in BBNs
- Arbitrary Bayesian networks: NP -complete
- Polytrees: linear time
- Example (“Sprinkler” BBN)
- MAP, ML Estimation over BBNs
n i
P X 1 ,X 2 , ,Xn P Xi|parents Xi 1
hML argmaxh H P D |h
Learning Distributions in BBNs:
Quick Review
- Learning Distributions
- Shortcomings of Naïve Bayes
- Making judicious CI assumptions
- Scaling up to BBNs: need to learn a CPT for all parent sets
- Goal: generalization
- Given D (e.g., {1011, 1001, 0100})
- Would like to know P ( schema ): e.g., P (11**) P ( x 1 = 1, x 2 = 1)
- Variants
- Known or unknown structure
- Training examples may have missing values
- Gradient Learning Algorithm
- Weight update rule
- Learns CPTs given data points D
x D ijk
h ij ik ijk ijk w
P y ,u |x w w r
- Constraint-Based
- Perform tests of conditional independence
- Search for network consistent with observed dependencies (or lack thereof)
- Intuitive; closely follows definition of BBNs
- Separates construction from form of CI tests
- Sensitive to errors in individual tests
- Score-Based
- Define scoring function ( aka score) that evaluates how well (in)dependencies in a structure match observations
- Search for structure that maximizes score
- Statistically and information theoretically motivated
- Can make compromises
- Common Properties
- Soundness: with sufficient data and computation, both learn correct structure
- Both learn structure from observations and can incorporate knowledge
Learning Structure:
Constraints Versus Scores
Learning Structure:
Maximum Weight Spanning Tree (Chow-Liu)
- Algorithm Learn-Tree-Structure-I ( D )
- Estimate P ( x ) and P ( x , y ) for all single RVs, pairs; I( X ; Y ) = D( P ( X , Y ) || P ( X ) · P ( Y ))
- Build complete undirected graph: variables as vertices, I( X ; Y ) as edge weights
- T Build-MWST ( V V , Weights ) // Chow-Liu algorithm: weight function I
- Set directional flow on T and place the CPTs on its edges (gradient learning)
- RETURN: tree-structured BBN with CPT values
- Algorithm Build-MWST-Kruskal ( E V V , Weights : E R +)
- H Build-Heap ( E , Weights ) // aka priority queue (| E |)
- E’ Ø; Forest {{ v } | v V } // E’ : set; Forest : union-find (| V |)
- WHILE Forest.Size > 1 DO (| E |)
- e H. Delete-Max () // e new edge from H (lg | E |)
- IF (( TS Forest.Find ( e. Start )) ( TE Forest.Find ( e. End ))) THEN (lg*^ | E |) E’.Union ( e ) // append edge e ; E’. Size ++ (1) Forest.Union ( TS , TE ) // Forest.Size-- (1)
- RETURN E’ (1)
- Running Time: (| E | lg | E |) = (| V |^2 lg | V |^2 ) = (| V |^2 lg | V |) = ( n^2 lg n )
- General-Case BBN Structure Learning: Use Inference to Compute Scores
- Recall: Bayesian Inference aka Bayesian Reasoning
- Assumption: h H are mutually exclusive and exhaustive
- Optimal strategy: combine predictions of hypotheses in proportion to likelihood
- Compute conditional probability of hypothesis h given observed data D
- i.e., compute expectation over unknown h for unseen cases
- Let h structure, parameters CPTs
Scores for Learning Structure:
The Role of Inference
^ ^ ^ ^ ^
^
h H
m
m 1 2 n
P x |D,h P h| D
P x |D P x ,x , ,x |x ,x , ,x 1
1 1 2 m
P h P D |h,Θ P Θ |h dΘ
P h|D P D|h P h
Posterior Score Marginal Likelihood
Prior over Structures Likelihood
Prior over Parameters
- Likelihood L ( : D )
- Definition: L ( : D ) P ( D | ) = x D P ( x | )
- General BBN (i.i.d data x ): L ( : D ) x D i P ( xi | Parents ( xi ) ~ ) = i L ( i : D )
- NB: specifies CPTs for Parents ( xi )
- Likelihood decomposes according to the structure of the BBN
- Estimating Prior over Parameters: P ( | D ) P ( D ) · P ( D | ) P ( D ) · L ( : D )
- Example: Sprinkler
- Scenarios D = {( Season ( i ), Sprinkler ( i ), Rain ( i ), Moisture ( i ), Slipperiness ( i ))}
- P ( Su , Off , Dr , Wet , NS ) = P ( S ) · P ( O | S ) · P ( D | S ) · P ( W | O , D ) · P ( N | W )
- MLE for multinomial distribution (e.g., {Spring, Summer, Fall, Winter}):
- Likelihood for multinomials
- Binomial case: N 1 = # heads, N 2 = # tails (“frequency is ML estimator”)
l
l
k k N
Θ ˆ N
K
k
N LΘ D Θk k 1
Scores for Learning Structure:
Prior over Parameters
Learning Structure:
K2 Algorithm and ALARM
- Algorithm Learn-BBN-Structure-K2 ( D, Max-Parents )
FOR i 1 to n DO // arbitrary ordering of variables { x 1 , x 2 , …, xn } WHILE ( Parents [ xi ]. Size < Max-Parents ) DO // find best candidate parent Best argmaxj>i ( P ( D | xj Parents [ xi ]) // max Dirichlet score IF ( Parents [ xi ] + Best ). Score > Parents [ xi ]. Score ) THEN Parents [ xi ] += Best
RETURN ({ Parents [ xi ] | i {1, 2, …, n }})
- A Logical Alarm Reduction Mechanism [Beinlich et al , 1989]
- BBN model for patient monitoring in surgical anesthesia
- Vertices (37): findings (e.g., esophageal intubation ), intermediates, observables
- K2 : found BBN different in only 1 edge from gold standard (elicited from expert)
17
6 5 4
19
10 21
27 11 31
20
22 15 34
32 29 12 9
28 7 8 30
25 18 26 1 2 3
33 14
35
23
13
36
24
16
3 7
Learning Structure:
(Score-Based) Hypothesis Space Search
- Learning Structure: Beyond Trees
- Problem not as easy for more complex networks
- Example
- Allow two parents (even singly-connected case, aka polytree)
- Greedy algorithms no longer guaranteed to find optimal network
- In fact, no efficient algorithm exists
- Theorem: finding network structure with maximal score, where H restricted to BBNs with at most k parents for each variable, is NP -hard for k > 1
- Heuristic Search of Search Space H
- Define H : elements denote possible structures, adjacency relation denotes transformation (e.g., arc addition, deletion, reversal)
- Traverse this space looking for high-scoring structures
- Algorithms
- Greedy hill-climbing
- Best-first search
- Simulated annealing
In-Class Exercise:
Hugin Demo
- Hugin
- Commercial product for BBN inference: http://www.hugin.com
- First developed at University of Aalborg, Denmark
- Applications
- Popular research tool for inference and learning
- Used for real-world decision support applications
- Safety and risk evaluation: http://www.hugin.com/serene/
- Diagnosis and control in unmanned subs: http://advocate.e-motive.com
- Customer support automation: http://www.cs.auc.dk/research/DSS/SACSO/
- Capabilities
- Lauritzen-Spiegelhalter algorithm for inference (clustering aka clique reduction)
- Object Oriented Bayesian Networks (OOBNs): structured learning and inference
- Influence diagrams for decision-theoretic inference (utility + probability)
- See: http://www.hugin.com/doc.html
In-Class Exercise:
Hugin and CPT Elicitation
- Hugin Tutorials
- Introduction: causal reasoning for diagnosis in decision support (toy problem)
- http://www.hugin.com/hugintro/bbn_pane.html
- Example domain: explaining low yield (drought versus disease)
- Tutorial 1: constructing a simple BBN in Hugin
- http://www.hugin.com/hugintro/bbn_tu_pane.html
- Eliciting CPTs (or collecting from data) and entering them
- Tutorial 2: constructing a simple influence diagram (decision network) in Hugin
- http://www.hugin.com/hugintro/id_tu_pane.html
- Eliciting utilities (or collecting from data) and entering them
- Other Important BBN Resources
- Microsoft Bayesian Networks: http://www.research.microsoft.com/dtas/msbn/
- XML BN (Interchange Format): http://www.research.microsoft.com/dtas/bnformat/
- BBN Repository (more data sets) http://www-nt.cs.berkeley.edu/home/nir/public_html/Repository/index.htm
Bayesian Network Learning:
Related Fields and References
- ANNs: BBNs as Connectionist Models
- GAs: BBN Inference, Learning as Genetic Optimization, Programming
- Hybrid Systems (Symbolic / Numerical AI)
- Conferences
- General (with respect to machine learning)
- International Conference on Machine Learning (ICML)
- American Association for Artificial Intelligence (AAAI)
- International Joint Conference on Artificial Intelligence (IJCAI, biennial)
- Specialty
- International Joint Conference on Neural Networks (IJCNN)
- Genetic and Evolutionary Computation Conference (GECCO)
- Neural Information Processing Systems (NIPS)
- Uncertainty in Artificial Intelligence (UAI)
- Computational Learning Theory (COLT)
- Journals
- General: Artificial Intelligence , Machine Learning , Journal of AI Research
- Specialty: Neural Networks , Evolutionary Computation , etc.
Learning Bayesian Networks:
Missing Observations
- Problem Definition
- Given: data ( n -tuples) with missing values, aka partially observable (PO) data
- Kinds of missing values
- Undefined, unknown (possible new )
- Missing, corrupted (not properly collected)
- Second case (“truly missing”): want to fill in****? with expected value
- Solution Approaches
- Expected = distribution over possible values
- Use “best guess” BBN to estimate distribution
- Expectation-Maximization (EM) algorithm can be used here
- Intuitive Idea
- Want to find hML in PO case ( D unobserved variables observed variables )
- Estimation step: calculate E [ unobserved variables | h ], assuming current h
- Maximization step: update wijk to maximize E [lg P ( D | h )], D all variables