Thursday 20 August 2015

Final Notebook

readwrite module pgmpy

pgmpy is a python library for creation, manipulation and implementation of Probabilistic graph models.There are various standard file formats for representing PGM data. PGM data basically consists of graph,a table corresponding to each node and a few other attributes of a graph.
pgmpy has a functionality to read networks from and write networks to these standard file formats.Currently pgmpy supports 5 file formats ProbModelXML, PomDPX, XMLBIF, XMLBeliefNetwork and UAI file formats.Using these modules, models can be specified in a uniform file format and readily converted to bayesian or markov model objects.
Now, Let's read a ProbModel XML File and get the corresponding model instance of the probmodel.
In [1]:
from pgmpy.readwrite import ProbModelXMLReader
In [2]:
reader_string = ProbModelXMLReader('example.pgmx')
Now to get the corresponding model instance we need get_model()
In [3]:
model = reader_string.get_model()
Now we can query this model accoring to our requirements.It is an instance of BayesianModel or MarkovModel depending on the type of the model which is given.
Suppose we want to know all the nodes in the given model, we can use
In [4]:
print(model.nodes())
['Smoker', 'X-ray', 'VisitToAsia', 'Tuberculosis', 'TuberculosisOrCancer', 'LungCancer', 'Dyspnea', 'Bronchitis']
To get all the edges use model.edges() method.
In [5]:
model.edges()
Out[5]:
[('Smoker', 'LungCancer'),
 ('Smoker', 'Bronchitis'),
 ('VisitToAsia', 'Tuberculosis'),
 ('Tuberculosis', 'TuberculosisOrCancer'),
 ('TuberculosisOrCancer', 'Dyspnea'),
 ('TuberculosisOrCancer', 'X-ray'),
 ('LungCancer', 'TuberculosisOrCancer'),
 ('Bronchitis', 'Dyspnea')]
To get all the cpds of the given model we can use model.get_cpds() and to get the corresponding values we can iterate over each cpd and call the corresponding get_cpd() method.
In [6]:
cpds = model.get_cpds()
for cpd in cpds:
    print(cpd.get_cpd())
[[ 0.95  0.05]
 [ 0.02  0.98]]
[[ 0.7  0.3]
 [ 0.4  0.6]]
[[ 0.9  0.1  0.3  0.7]
 [ 0.2  0.8  0.1  0.9]]
[[ 0.99]
 [ 0.01]]
[[ 0.5]
 [ 0.5]]
[[ 0.99  0.01]
 [ 0.9   0.1 ]]
[[ 0.99  0.01]
 [ 0.95  0.05]]
[[ 1.  0.  0.  1.]
 [ 0.  1.  0.  1.]]
pgmpy not only allows us to read from the specific file format but also helps us to write the given model into the specific file format. Let's write a sample model into Probmodel XML file.
For that first define our data for the model.
In [7]:
import numpy as np

edges_list = [('VisitToAsia', 'Tuberculosis'),
              ('LungCancer', 'TuberculosisOrCancer'),
              ('Smoker', 'LungCancer'),
              ('Smoker', 'Bronchitis'),
              ('Tuberculosis', 'TuberculosisOrCancer'),
              ('Bronchitis', 'Dyspnea'),
              ('TuberculosisOrCancer', 'Dyspnea'),
              ('TuberculosisOrCancer', 'X-ray')]
nodes = {'Smoker': {'States': {'no': {}, 'yes': {}},
                    'role': 'chance',
                    'type': 'finiteStates',
                    'Coordinates': {'y': '52', 'x': '568'},
                    'AdditionalProperties': {'Title': 'S', 'Relevance': '7.0'}},
         'Bronchitis': {'States': {'no': {}, 'yes': {}},
                        'role': 'chance',
                        'type': 'finiteStates',
                        'Coordinates': {'y': '181', 'x': '698'},
                        'AdditionalProperties': {'Title': 'B', 'Relevance': '7.0'}},
         'VisitToAsia': {'States': {'no': {}, 'yes': {}},
                         'role': 'chance',
                         'type': 'finiteStates',
                         'Coordinates': {'y': '58', 'x': '290'},
                         'AdditionalProperties': {'Title': 'A', 'Relevance': '7.0'}},
         'Tuberculosis': {'States': {'no': {}, 'yes': {}},
                          'role': 'chance',
                          'type': 'finiteStates',
                          'Coordinates': {'y': '150', 'x': '201'},
                          'AdditionalProperties': {'Title': 'T', 'Relevance': '7.0'}},
         'X-ray': {'States': {'no': {}, 'yes': {}},
                   'role': 'chance',
                   'AdditionalProperties': {'Title': 'X', 'Relevance': '7.0'},
                   'Coordinates': {'y': '322', 'x': '252'},
                   'Comment': 'Indica si el test de rayos X ha sido positivo',
                   'type': 'finiteStates'},
         'Dyspnea': {'States': {'no': {}, 'yes': {}},
                     'role': 'chance',
                     'type': 'finiteStates',
                     'Coordinates': {'y': '321', 'x': '533'},
                     'AdditionalProperties': {'Title': 'D', 'Relevance': '7.0'}},
         'TuberculosisOrCancer': {'States': {'no': {}, 'yes': {}},
                                  'role': 'chance',
                                  'type': 'finiteStates',
                                  'Coordinates': {'y': '238', 'x': '336'},
                                  'AdditionalProperties': {'Title': 'E', 'Relevance': '7.0'}},
         'LungCancer': {'States': {'no': {}, 'yes': {}},
                        'role': 'chance',
                        'type': 'finiteStates',
                        'Coordinates': {'y': '152', 'x': '421'},
                        'AdditionalProperties': {'Title': 'L', 'Relevance': '7.0'}}}
edges = {'LungCancer': {'TuberculosisOrCancer': {'directed': 'true'}},
         'Smoker': {'LungCancer': {'directed': 'true'},
                    'Bronchitis': {'directed': 'true'}},
         'Dyspnea': {},
         'X-ray': {},
         'VisitToAsia': {'Tuberculosis': {'directed': 'true'}},
         'TuberculosisOrCancer': {'X-ray': {'directed': 'true'},
                                  'Dyspnea': {'directed': 'true'}},
         'Bronchitis': {'Dyspnea': {'directed': 'true'}},
         'Tuberculosis': {'TuberculosisOrCancer': {'directed': 'true'}}}

cpds = [{'Values': np.array([[0.95, 0.05], [0.02, 0.98]]),
         'Variables': {'X-ray': ['TuberculosisOrCancer']}},
        {'Values': np.array([[0.7, 0.3], [0.4,  0.6]]),
         'Variables': {'Bronchitis': ['Smoker']}},
        {'Values':  np.array([[0.9, 0.1,  0.3,  0.7], [0.2,  0.8,  0.1,  0.9]]),
         'Variables': {'Dyspnea': ['TuberculosisOrCancer', 'Bronchitis']}},
        {'Values': np.array([[0.99], [0.01]]),
         'Variables': {'VisitToAsia': []}},
        {'Values': np.array([[0.5], [0.5]]),
         'Variables': {'Smoker': []}},
        {'Values': np.array([[0.99, 0.01], [0.9, 0.1]]),
         'Variables': {'LungCancer': ['Smoker']}},
        {'Values': np.array([[0.99, 0.01], [0.95, 0.05]]),
         'Variables': {'Tuberculosis': ['VisitToAsia']}},
        {'Values': np.array([[1, 0, 0, 1], [0, 1, 0, 1]]),
         'Variables': {'TuberculosisOrCancer': ['LungCancer', 'Tuberculosis']}}]
Now let's create a model from the given data.
In [8]:
from pgmpy.models import BayesianModel
from pgmpy.factors import TabularCPD

model = BayesianModel(edges_list)

for node in nodes:
    model.node[node] = nodes[node]
for edge in edges:
    model.edge[edge] = edges[edge]

tabular_cpds = []
for cpd in cpds:
    var = list(cpd['Variables'].keys())[0]
    evidence = cpd['Variables'][var]
    values = cpd['Values']
    states = len(nodes[var]['States'])
    evidence_card = [len(nodes[evidence_var]['States'])
                     for evidence_var in evidence]
    tabular_cpds.append(
        TabularCPD(var, states, values, evidence, evidence_card))

model.add_cpds(*tabular_cpds)
In [9]:
from pgmpy.readwrite import ProbModelXMLWriter, get_probmodel_data
To get the data which we need to give to the ProbModelXMLWriter to get the corresponding fileformat we need to use the method get_probmodel_data. This method is only specific to ProbModelXML file, for other file formats we would directly pass the model to the given Writer Class.
In [10]:
model_data = get_probmodel_data(model)
writer = ProbModelXMLWriter(model_data=model_data)
print(writer)

To write the xml data into the file we can use the method write_file of the given Writer class.
In [ ]:
writer.write_file('probmodelxml.pgmx')

General WorkFlow of the readwrite module

pgmpy.readwrite.[fileformat]reader is base class for reading the given file format. Replace file fomat with the desired fileforamt from which you want to read the file.In this base class there are different methods defined to parse the given file.For example for XMLBelief Network various methods which are defined are as follows.
In [4]:
from pgmpy.readwrite.XMLBeliefNetwork import XBNReader
reader = XBNReader('xmlbelief.xml')
get_analysisnotebook_values: It returns a dictionary of the attributes of analysisnotebook tag.
In [5]:
reader.get_analysisnotebook_values()
Out[5]:
{'NAME': 'Notebook.Cancer Example From Neapolitan', 'ROOT': 'Cancer'}
get_bnmodel_name: It returns the name of the bnmodel.
In [6]:
reader.get_bnmodel_name()
Out[6]:
'Cancer'
get_static_properties: It returns the dictionary of staticproperties.
In [7]:
reader.get_static_properties()
Out[7]:
{'CREATOR': 'Microsoft Research DTAS',
 'FORMAT': 'MSR DTAS XML',
 'VERSION': '0.2'}
get_variables: It returns the list of variables.
In [8]:
reader.get_variables()
Out[8]:
{'a': {'DESCRIPTION': '(a) Metastatic Cancer',
  'STATES': ['Present', 'Absent'],
  'TYPE': 'discrete',
  'XPOS': '13495',
  'YPOS': '10465'},
 'b': {'DESCRIPTION': '(b) Serum Calcium Increase',
  'STATES': ['Present', 'Absent'],
  'TYPE': 'discrete',
  'XPOS': '11290',
  'YPOS': '11965'},
 'c': {'DESCRIPTION': '(c) Brain Tumor',
  'STATES': ['Present', 'Absent'],
  'TYPE': 'discrete',
  'XPOS': '15250',
  'YPOS': '11935'},
 'd': {'DESCRIPTION': '(d) Coma',
  'STATES': ['Present', 'Absent'],
  'TYPE': 'discrete',
  'XPOS': '13960',
  'YPOS': '12985'},
 'e': {'DESCRIPTION': '(e) Papilledema',
  'STATES': ['Present', 'Absent'],
  'TYPE': 'discrete',
  'XPOS': '17305',
  'YPOS': '13240'}}
get_edges: It returs the list of tuples.Each tuple containes two elements (parent, child) for each edge.
In [9]:
reader.get_edges()
Out[9]:
[('a', 'b'), ('a', 'c'), ('b', 'd'), ('c', 'd'), ('c', 'e')]
get_distributions: It returns a dictionary of name and it's distributions.
In [10]:
reader.get_distributions()
Out[10]:
{'a': {'DPIS': array([[ 0.2,  0.8]]), 'TYPE': 'discrete'},
 'b': {'CARDINALITY': array([2]),
  'CONDSET': ['a'],
  'DPIS': array([[ 0.8,  0.2],
         [ 0.2,  0.8]]),
  'TYPE': 'discrete'},
 'c': {'CARDINALITY': array([2]),
  'CONDSET': ['a'],
  'DPIS': array([[ 0.2 ,  0.8 ],
         [ 0.05,  0.95]]),
  'TYPE': 'discrete'},
 'd': {'CARDINALITY': array([2, 2]),
  'CONDSET': ['b', 'c'],
  'DPIS': array([[ 0.8 ,  0.2 ],
         [ 0.9 ,  0.1 ],
         [ 0.7 ,  0.3 ],
         [ 0.05,  0.95]]),
  'TYPE': 'discrete'},
 'e': {'CARDINALITY': array([2]),
  'CONDSET': ['c'],
  'DPIS': array([[ 0.8,  0.2],
         [ 0.6,  0.4]]),
  'TYPE': 'discrete'}}
get_model: It returns an instance of the given model, for ex, BayesianModel in cases of XMLBelief format.
In [11]:
model = reader.get_model()
print(model.nodes())
print(model.edges())
['c', 'b', 'e', 'a', 'd']
[('c', 'e'), ('c', 'd'), ('b', 'd'), ('a', 'c'), ('a', 'b')]
pgmpy.readwrite.[fileformat]writer is base class for writing the model into the given file format.It takes a model as an argument which can be an instance of BayesianModel, MarkovModel. Replace file fomat with the desired fileforamt from which you want to read the file.In this base class there are different methods defined to set the contents of the new file to be created from the given model.For example for XMLBelief Network various methods which are defined are as follows.
In [7]:
from pgmpy.models import BayesianModel
from pgmpy.factors import TabularCPD
import numpy as np
nodes = {'c': {'STATES': ['Present', 'Absent'],
               'DESCRIPTION': '(c) Brain Tumor',
               'YPOS': '11935',
               'XPOS': '15250',
               'TYPE': 'discrete'},
         'a': {'STATES': ['Present', 'Absent'],
               'DESCRIPTION': '(a) Metastatic Cancer',
               'YPOS': '10465',
               'XPOS': '13495',
               'TYPE': 'discrete'},
         'b': {'STATES': ['Present', 'Absent'],
               'DESCRIPTION': '(b) Serum Calcium Increase',
               'YPOS': '11965',
               'XPOS': '11290',
               'TYPE': 'discrete'},
         'e': {'STATES': ['Present', 'Absent'],
               'DESCRIPTION': '(e) Papilledema',
               'YPOS': '13240',
               'XPOS': '17305',
               'TYPE': 'discrete'},
         'd': {'STATES': ['Present', 'Absent'],
               'DESCRIPTION': '(d) Coma',
               'YPOS': '12985',
               'XPOS': '13960',
               'TYPE': 'discrete'}}
model = BayesianModel([('b', 'd'), ('a', 'b'), ('a', 'c'), ('c', 'd'), ('c', 'e')])
cpd_distribution = {'a': {'TYPE': 'discrete', 'DPIS': np.array([[0.2, 0.8]])},
                    'e': {'TYPE': 'discrete', 'DPIS': np.array([[0.8, 0.2],
                                                                [0.6, 0.4]]), 'CONDSET': ['c'], 'CARDINALITY': [2]},
                    'b': {'TYPE': 'discrete', 'DPIS': np.array([[0.8, 0.2],
                                                                [0.2, 0.8]]), 'CONDSET': ['a'], 'CARDINALITY': [2]},
                    'c': {'TYPE': 'discrete', 'DPIS': np.array([[0.2, 0.8],
                                                                [0.05, 0.95]]), 'CONDSET': ['a'], 'CARDINALITY': [2]},
                    'd': {'TYPE': 'discrete', 'DPIS': np.array([[0.8, 0.2],
                                                                [0.9, 0.1],
                                                                [0.7, 0.3],
                                                                [0.05, 0.95]]), 'CONDSET': ['b', 'c'], 'CARDINALITY': [2, 2]}}

tabular_cpds = []
for var, values in cpd_distribution.items():
    evidence = values['CONDSET'] if 'CONDSET' in values else []
    cpd = values['DPIS']
    evidence_card = values['CARDINALITY'] if 'CARDINALITY' in values else []
    states = nodes[var]['STATES']
    cpd = TabularCPD(var, len(states), cpd,
                     evidence=evidence,
                     evidence_card=evidence_card)
    tabular_cpds.append(cpd)
model.add_cpds(*tabular_cpds)

for var, properties in nodes.items():
    model.node[var] = properties
In [8]:
from pgmpy.readwrite.XMLBeliefNetwork import XBNWriter
writer = XBNWriter(model = model)
set_analysisnotebook: It sets the attributes for ANALYSISNOTEBOOK tag.
set_bnmodel_name: It sets the name of the BNMODEL.
set_static_properties: It sets the STAICPROPERTIES tag for the network.
set_variables: It sets the VARIABLES tag for the network.
set_edges: It sets edges/arcs in the network.
set_distributions: It sets distributions in the network.

Wednesday 12 August 2015

UAI Reader And Wrirter

After mid term, I worked on UAI reader and writer module.Now, it has been successfully merged into the main repository.

UAI Format Brief Description

It uses the simple text file format specified below to describe problem instances
Link to the format :  UAI

A file in the UAI format consists of the following two parts, in that order:
  1. Preamble
  2. Function 
Preamble: It starts with a text denoting the type of the network.This is followed by a line containing the number of variables. The next line specifies each variable's domain size, one at a time, separated by whitespace.The fourth line contains only one integer, denoting the number of functions in the problem (conditional probability tables for Bayesian networks, general factors for Markov networks). Then, one function per line, the scope of each function is given as follows: The first integer in each line specifies the size of the function's scope, followed by the actual indexes of the variables in the scope. The order of this list is not restricted, except when specifying a conditional probability table (CPT) in a Bayesian network, where the child variable has to come last. Also note that variables are indexed starting with 0. 

Example of Preamble

MARKOV
3
2 2 3
2
2 0 1
3 0 1 2
 In the above example the model is MARKOV and no of variables are 3, and domain size of the variables are 2 2 3 respectively.

So for reading the preamble,  we have used pyparsing module. And to get the no of variables and their domain sizes we have declared method get_variables and get_domain which will return the list of variables and the dictionary with key as variable name and value as their domain size.

For example, for the above preamble the method get_variables will return [var_0, var_1, var_2]
and the method get_domain will return
{var_0: 2, var_1: 2, var_2: 3}

Function: In this section each function is specified by giving its full table (i.e, specifying the function value for each tuple). The order of the functions is identical to the one in which they were introduced in the preamble.


For each function table, first the number of entries is given (this should be equal to the product of the domain sizes of the variables in the scope). Then, one by one, separated by whitespace, the values for each assignment to the variables in the function's scope are enumerated. Tuples are implicitly assumed in ascending order, with the last variable in the scope as the 'least significant'.

Example of Function
2
0.436 0.564

4
0.128 0.872
0.920 0.080

6
0.210 0.333 0.457
0.811 0.000 0.189