In Learning of Routing Revisited we broadened our view to get an overview of the various approaches that have been developed to improve routing in networks with a focus on learning techniques. Here, we want to revisit the Q-Routing approach[1]Boyan, J. and Littman, M. (1994). Packet Routing in Dynamically Changing Networks: A Reinforcement Learning Approach. that has been introduced more than 20 years ago. Neural Networks showed “inconclusive” in the work of Boyan and Littman back then – something we want to keep in mind against the backdrop of Neural Networks being so prominent in the current Machine Learning discourse. While revisiting the approach we also want to focus on the performance metrics applied and the environmental factors considered.
The “Q-Routing” algorithm is a version of Bellman-Ford’s algorithm (a distance vector algorithm, cp. link state algorithms) for finding shortest paths from a starting node in a graph with weighted edges. What was Bellman-Ford’s algorithm again? The core idea is to run from a starting node to its neighboring nodes in the graph and record the weights along the path. This can be extended to all the other nodes left in the graph – by using the knowledge of how best to reach predecessor nodes via which the node in focus can be reached now. Reiterating this and updating the weight sums eventually leads to the shortest paths between the starting node and all other destination nodes.
Important here: the weights (or distance metrics) on the edges are preset. They have been declared by someone in a way that some property of the edge is reflected (e.g. very high weights for low capacity links, or, in the Routing Information Protocol (RIP) which implements Bellman-Ford, the hop count only). The shortest paths will be found via this algorithm as long as the weights remain unchanged. Once they change, the algorithm would need to start over to come up with the optimal paths. But this change needs to come about somehow – 1) either the “declaration agent” senses that changes in the edge properties just demand an update of the weights, or, 2) the network system itself has inherently some mechanism at its disposal that senses actual edge property changes and can also update the weights (or some technique in between the two).
In both cases observation is required, as well as realizing/detecting the need for weight updates, and lastly implementing the update. In case 1) a human would take in all the available information about the network system and its changes (e.g. major link upgrades, topological changes, etc.), then realize and decide on new weights (according to some preferred overall system behavior), and then play out the updated weights into the network system. This is how Traffic Engineering (TE) works in most major telco networks today. The learning bit in it, i.e. the realizing of the need for change and the weight updating (the re-description of the network itself), for the most part is processed in the human system. The delays here are 2-4 weeks until a human TE would check for needed weight update measures and if it is the case, then play around with the options (the routing algorithm can accommodate for many ad-hoc changes in the network (it would sense it and then re-run for a new optimum), however its adaptability has limits), simulate scenarios and then select a robust heuristic that hopefully performs well enough the next 4 weeks.
In case 2) instead of a human system observing the network state, the network itself would take care of this (self-observation and self-description). It would also evaluate the observations (this evaluation is linked to the “reward” concept in reinforcement learning (RL)) and then update the weights by itself (this is related to “actions” in RL). How are these observation signals, these measurements, coming about? This is an interesting topic itself as it involves delay and can have different semantics. Whether ping is used[2]Mansour, Y. (1999). Course material “Packet Routing using Reinforcement Learning” for Computational Learning Theory at Tel Aviv University during Fall Semester 1999/2000. or the signals are received in-band[3]Brockners, F. (2017). Next-gen Network Telemetry is Within Your Packets: In-band OAM. Talk at Open Networking Summit 2017. – it plays an important role. Let’s keep in mind that the various delays introduced here – depending on the measurement method applied – are crucial when analyzing the overall approach.
This was a little detour to how the metrics/weights on edges may change – however, this is not the main idea of Bellman-Ford. When we think about the propagation of updates, though, Bellman-Ford has shown to be comparatively slow. But let’s turn back to what the main idea behind Q-Routing is and in which aspects it adapted Bellman-Ford. Part 2
References
↑1 | Boyan, J. and Littman, M. (1994). Packet Routing in Dynamically Changing Networks: A Reinforcement Learning Approach. |
---|---|
↑2 | Mansour, Y. (1999). Course material “Packet Routing using Reinforcement Learning” for Computational Learning Theory at Tel Aviv University during Fall Semester 1999/2000. |
↑3 | Brockners, F. (2017). Next-gen Network Telemetry is Within Your Packets: In-band OAM. Talk at Open Networking Summit 2017. |