CURRENT MEETING REPORT


Minutes of the Inter-Domain Multicast Routing Working Group (idmr)

Reported by Bill Fenner, Xerox PARC and Tony Ballardie, University 
College London


First Session

Deborah Estrin spoke first about interoperability mechanisms between 
PIM sparse mode and a multicast backbone, whether it be level two or 
current DVMRP.  All groups use local RP's, and the border routers join 
towards the local RPs in order to inject traffic into the backbone.  A 
single designated border router registers externally-sourced packets to 
the internal RP.  Internal RPs advertise themselves on a special 
bootstrap group whose RP is elected via PIM Query messages.  Open 
issues here include whether to explicitly notify the designated border 
router of group memberships inside the domain or to do "flood and 
prune," and how to use a domain of this type for transit traffic.

Deborah then spoke about how to connect such domains with a PIM-SM 
backbone.  The changes in this case are limited to the local RPs.  When 
there is a PIM-SM backbone, there is a hierarchy of RPs; it is not yet 
clear if two levels of hierarchy are enough or if we need more.  Join 
messages are modified to include a hierarchy level and crossing tree 
branches are handled by having the highest level branch "win."  If a 
level one Join message meets a level 2 tree branch, it doesn't change the 
incoming interface.  The local RPs join to the next level RPs.  Candidate 
RP lists are distributed using the same bootstrap mechanism as in the 
local domain.

Mark Handley then spoke about Hierarchical PIM, or HPIM.  HPIM is 
similar to Deborah's scheme (and Deborah's scheme was derived from 
Mark's proposal), but Mark expects HPIM to have five to six levels of 
hierarchy.  Each RP knows its own level and knows the candidate RP 
list for the next level up.  The group address can be hashed into the list 
of RP's to determine the RP for that group.  This gets rid of the 
requirement of storing RP/group mappings.  The candidate-RP list is not 
meant to change often, but when it does, the old RP joins to the new RP 
to keep the tree intact and initiates transfers of its receivers to the new 
RP.

HPIM also makes the following "gratuitous" changes to PIM, which are 
not directly related to the hierarchy of RPs.  Join messages establish bi-
directional forwarding state, not unidirectional, and they get ACK'd.  
RP failure is determined by timeouts and is handled by using an 
alternate hash function to determine a fallback RP.  If there are local 
receivers to a domain, a sender in the same domain uses their (bi-
directional) state; if not then it sends a "sender join" hop-by-hop 
towards the RP until it hits the tree; "sender join" state is 
unidirectional.  The loop avoidance mechanism can result in 
unnecessary traffic flow but is relatively simple.

Outstanding issues include that there is potentially too much 
configuration of the hierarchy and RPs; however, there may be a way 
to automatically configure such things.  The hash function can lead to a 
sub-optimal top level RP, but since tree links are bidirectional you can 
potentially prune off the top-level RP.  More thinking needs to be done 
on the topics of changing RP's dynamically based on traffic load, and on 
RP selection algorithms.

The second half of the meeting was on providers' experiences with 
deploying PIM in their networks.  First, Matt Crawford from Fermilab 
spoke about his experiences with PIM and High Energy Physics.  HEP 
generally has large collaborations, and these collaborations meet 
several times per week using multicast conferencing tools.  In addition, 
there are accelerator controllers which can multicast their readings, 
allowing multiple users to see results at once.

Matt said that he liked PIM because all he had to do was turn it on in 
his routers and it "just worked," and because he owns many routers but 
few workstations.  Steve Deering stood up and commented that Matt 
would probably have the same experience with any multicast routing 
protocol implemented in a router, and that the choice of PIM was 
simply predicated by the brand of router.  Matt more or less agreed.

Petri Helenius from Santa Monica Software spoke about MBONE 
deployment in Finland.  PIM Dense Mode is deployed throughout 
Finland, and DVMRP pruning is implemented at DVMRP 
interoperability points, meaning that multicast should prune all the 
way back to Sweden.  The emphasis was again that they had country-
wide multicast reachability without requiring "kernel hackers" at 
each site.

The issues that Petri brought up included that PIM-DM is wasteful on 
state on sparse-membership groups, NBMA issues have not yet been 
resolved, and the need for configuration guidelines for DVMRP 
interoperation and for suggested topologies.

David Meyer from University of Oregon then spoke about having a 
multi-homed campus.  He wanted to have ubiquitous multicast 
integrated with the existing infrastructure, as well as redundant 
external connectivity and integration, with Internet service providers.  
His network uses sparse mode PIM, but has the problem that his unicast 
topology is too rich, and PIM can't (yet?) RPF over a multipath link, so 
even if he has multiple T1's between two sites he can only use one.  He 
glues sparse mode domains together with dense mode domains to work 
around policy and RP placement issues.  He is providing multicast to 200 
subnets, and it is nearly ubiquitous in his network.


Second Session

Day two began with a presentation on multicast traceroute, released 
recently as an IDMR Internet-Draft.  All MBONE users were strongly 
encouraged to implement mtrace since it makes debugging problems so 
much easier. Currently, it is estimated 58% of the MBONE implements 
'mtrace.'

'mtrace' has been designed to operate with the assumption that the  
underlying multicast routing protocol is based on RPF.  Hence, a call was 
made to the designers of shared tree protocols (CBT, PIM-SM) for 
feedback on whether/how 'mtrace' can work with shared trees.

A call was made requesting that IGMPv2 be moved forward to proposed 
standard. There were no objections. However, it is currently unclear 
whether this will actually move forward to proposed standard, or be 
published as an informational RFC. The Routing Area Director should 
decide on the best course of action, given that the group voiced no 
objections regarding its standardization.

A CBT protocol update was given. The protocol has been considerably 
streamlined since the previous draft release (June 1995).  A new draft 
indicating the very latest proposed changes should be submitted within 
the next week. At this stage, the CBT designers consider the protocol to 
remain stable (as far as the functional spec is concerned). However, it 
remains to be fully specified how CBT interoperates with DVMRP (and 
other protocols for that matter).  The CBT designers are working on this 
and expect to announce an interoperability document shortly.

The following is a summary of the CBT protocol updates:

o  Multi-access LAN designated router (DR) election has been re-
invented.  The CBT default DR is the same router as the IGMP 
querier. Hence, no protocol is required for CBT DR election.

This assumes that any CBT-capable subnetwork has only the CBT 
multicast protocol running over it. If this assumption is not made, then 
the IGMP querier could be a multicast router of another scheme. For this 
scenario, interoperability between CBT and all other protocols needs to 
be defined. The working group is currently trying to establish a 
protocol-independent mechanism for interoperability to avoid each 
protocol having to define interoperability mechanisms between each 
other protocol.  For the moment then, it is safe to assume that any one 
subnetwork is running only a single multicast routing protocol.

o  The core tree (the part of the CBT tree linking all cores together) is 
now built "on-demand."  This requires that all group members and 
potential new members to agree on the identity of only the 
primary core router. The primary core, together with a list of 
alternate (secondary) cores is distributed throughout the network 
by some T.B.D. mechanism (e.g. hpim, <core,grp> advertisements, 
etc.).

The on-demand core tree building works as follows: any secondary core 
that receives a join, first acks the join, then sends a join-request, code 
rejoin, to the primary core so the tree becomes fully connected.  The 
primary core only ever listens for incoming joins; it never need join any 
entity itself.

o  Native mode.

This assumes CBT routers operate in a CBT-only "cloud", i.e. 
multicast routers of other schemes are not active within the same 
cloud. This allows for much faster packet switching times, also 
helped by the fact that no RPF check is necessary.

o  Maintenance message "aggregates."

Rather than have a CBT-ECHO-REQUEST/REPLY sent for each 
child/parent on a per group basis, the protocol now aggregates 
these messages so that only a single request/response pair is sent 
for any child-parent pair. This is especially attractive in those 
parts of the network where links are likely to be shared across 
groups. Considerable bandwidth savings are possible with this 
mechanism.

o  Rejoins and loop-detection.

The dual mechanisms of rejoining and loop detection has been 
made simpler and more straightforward.  The new scheme means 
that a rejoining node first receives an ack before it (rather than 
the node sending the ack) generates a rejoin-nactive (loop 
detection packet). This new technique avoids the router sending 
the ack performing any packet 'translation'.

o  Proxy-ack.

Although discussed in the draft (spec) released immediately prior 
to IDMR Dallas, it was discussed in a meeting of the CBT 
designers immediately prior to the IDMR meetings, and decided 
that there was a case where proxy-acks should not be used.  They 
are no longer present in the protocol.

Overall, the CBT protocol has been streamlined considerably.  There 
are now far fewer control messages, resulting in a further simplified 
protocol engine.  An aggregated join mechanism is currently being 
worked through such that, subsequent to a router/link failure, groups 
with overlapping core(s) can send a single 'aggregated' join to re-
establish connectivity, rather than have each group generate a join 
individually.

One approach to how multi-protocol interoperability can be achieved 
at the L1/L2 hierarchical boundary was presented. The technique 
involves a L2 encapsulation, but does not require any exchange of routing 
information between L1 and L2.  This contributes significantly to the 
simplicity of the approach. However, ideally, it is considered very 
desirable to have an L1/L2 interface that requires no encapsulation, 
and therefore the group is rethinking itÕs (previously announced) 
approach to hierarchical multicast. Nevertheless, the encapsulation 
approach will be released as an IDMR Internet-Draft.

Finally, a pragmatic approach to bi-level multicast in the DIS 
environment was presented. This work has been conducted at BBN under 
ARPA's "Real-time Information Transfer and Networking" program in 
support of distributed simulation. 

DIS applications are not like teleconferencing applications; DIS 
assumes very large numbers of groups (10^4 or more), and requires very 
low join latency (< 0.5 secs or less).

The bi-level environment consists of a "constructed multicast service 
(CMS)" built on top of an "underlying multicast service (UMS)."  The 
CMS provides control traffic which is data to the UMS. Bi-level 
routers peer directly with each other. Bi-level routers (BLRs) use IGMP 
to determine CMS groups. This state is sent to all other BLRs so each 
BLR knows where group members are located. UMS groups are used to 
distribute data for CMS groups-BLRs join both UMS groups and CMS 
groups.

The communication between BLRs about BLR group memberships needs 
to have a high degree of reliability. Joins/leaves use sequence 
numbering and timestamps. Each BLR sends MD5 hash messages to 
upstream neighbour summarizing its current state. Rather than "hard 
state" or "soft state", the state is said to be "firm"; it is like soft-state 
in that it is sent periodically; it is like hard state in that deltas are 
sent.  However, unlike soft state, the information sent is a 
cryptographically strong checksum of the desired state, rather than 
the entire state.

A bi-level multicast prototype has been implemented and tested.  The 
UMS was provided by Bay Networks routers using DVMRP connected by 
ATM, linking six sites in the DC area. A simulation exercise was 
carried out using ca. 700 multicast groups. Some sites had 2000 join 
events over ca. 30 mins, which averages out at about one join per second.

Official documentation from BBN (gdt@bbn.com) should be 
forthcoming  shortly, describing the concepts and protocol of bi-level 
multicasting in detail. 

Once again, the DIS demonstrated the reality of the need for a  next-
generation multicast protocol that can better support its requirements.