COMP413 Project 3: Distributed Agreement with Aglets

Daniel Ballinger 300041839

The code is all stored in DBAglet.java and can be complied in the standard Java fashion (%>javac DBAglet.java).

Assumptions

  1. No aglets will crash, helping remove some cases of problems like a single point of failure and locks not being released.
  2. Messaging should be reliable. That is, once a message is sent it will arrive and be processed by the recipient. It is possible that the message will not arrive due to a broken proxy in this system.
  3. The first aglet created will be the co-ordinator and spawn off the required number of mobile aglets. This helps avoid the need to hold an election initially.
  4. The initial list of hosts only contains unique host-port pairs. Each aglet itinerary may later be expanded if a more complex path is required.
  5. The co-ordinator will always start at the first host in the list. This stems from locking reasons, but could probably be corrected relatively simply.
  6. When the co-ordinator has completed it's itinerary it must remain active until all other aglets have completed there itineraries. Again, this helps avoid the need for an election.
  7. No aglet shall move without the permission of the co-ordinator. This helps ensure that the co-ordinator _always_ has a valid proxy to all mobile aglets.

General Design

The overall design is centralised, with remote aglets requiring permission from the co-ordinator before moving. The co-ordinator is responsible for maintaining a list of the locations of each aglet and ensuring that no two aglets occupy the same location (enforcing mutual exclusion).

Mutual Exclusion Requirements

Safety: At most one process may execute in the critical section at a time.
Liveness: A process requesting entry to the critical section is eventually granted it.
Ordering: entry to the critical section should be granted in happened-before order.

Safety is addressed by using a semaphore for each context that allocates a lock to a specific aglet id. The lock can subsequently only be released by that aglet id.

To satisfy liveness and ordering, requests to move to a particular context are queued and processed in a FIFO ordering.

Deadlock

For the particular design there are two major issues that can result in deadlock.

1. A circular loop exists between two or more active aglets. In this case no aglet may make progress until another moves, resulting in deadlock. This case is detected by maintain a map from aglet ids to waiting contexts on the co-ordinator. Whenever a request to move is queued the deadlock check is run. Using the id of aglet currently holding the lock for the recently queued to context the context that its waiting for is looked up. If found a recursive call is made, passing the id of the waiting aglet in a list. The recursive call proceeds to trace out the dependency list for waiting aglets until either a free context is found or an id in the list is seen again (on each call the currently locking id is added after this check).

If deadlock is confirmed a list of effected aglets is returned. From this list a random (non-co-ordinator) aglet is chosen. The list of possible hosts is then shuffled and a search begins for a free context. When one is found the randomly picked aglet is dispatched there, the old locks it held released, and first waiting aglet dispatched. The randomisation is important to prevent situations where a single aglet is toggled between two locations.

2. The co-ordinator has finished but is still blocking the path for other aglets. This is detected by checking if the blocking aglet is the co-ordinator and if it has finished it’s itinerary. If this is true the co-ordinator is shunted to a free host.

An unpredicted form of deadlock can occur due to lost messages. Messages sometimes don't arrive (or get processed) at the co-ordinator. This can happen for requests to move and disposal messages at the server and also for co-ordinator proxy updates at the slaves. Some redundant messaging has been added to help resolve this issue, but ideally the aglets would use timeouts to retransmit messages. In the majority of cases this is not the result of invalid proxies. A hack solution I implemented to help with this issue is the retransmission of important messages when the “dialog” message arrives at an aglet. The user can issue this message using the dialog button from within Tahiti.

Response of major Aglet events

OnCreation

If an aglet has a null Object for it's input parameters then it is the co-ordinator, otherwise it is a slave aglet.

The co-ordinator will proceed to create slave aglets and dispatch them to free contexts. During this process it will store a proxy to the aglets and lock each context for the new aglets identifier.

Once all the slave aglets are created the co-ordinator will lock the host it resides on for itself. Following this all aglets (including the co-ordinator) are sent a message informing them that it is appropriate to start moving.

Slave aglets will set the identifier, co-ordinator proxy, and itinerary attributes on creation (which they are passed as arguments). They will then wait for further instructions – the message to activate.

OnArrival

If an aglet arrives at a new location and it has no record of its previous location it has just been created and dispatched for the first time. It will set its current location to the URL of the host it has just arrived at. No further action will be required in this case.

When arriving at a new location with a known previous location the aglet will check its progress in its itinerary. If the new host is the expected destination then the itinerary is advanced and a request to move sent to the co-ordinator. Otherwise the aglet must have moved to relieve deadlock. When moving to relieve deadlock the request for transport is resent on arrival.

If the arriving aglet is the co-ordinator it will also release any previous context locks held and advise all aglets of it new location (along with a new proxy).

OnDispatch

Just dump a message to the console.

OnDisposing

If the aglet isn't the co-ordinator, send a message to the co-ordinator informing it to release any locks or resources held.

Ideally, for the case of the co-ordinator disposing, it will not attempt to dispose of itself until all other aglets have finished and disposed of themselves.

A better design may have been to have the co-ordinator call dispose via the proxy, hence ensuring that the message arrived and the resources were released.

Normal Messages

Activate: Send a request to the co-ordinator for transport to the first host in the itinerary.

CoProxy(newproxy): Received a new proxy for the co-ordinator. Update and requestTransport.

wakeup: A request from the co-ordinator to request transport again.

dialog: Special message that the user can fire from tahiti to get the aglet to request transport again. (Ideally this would actually happen due to a timeout)

Co-ordinator messages:

move(id,fromlocation,tolocation): The co-ordinator will add that aglet to a waiting queue for the destination context. It will then attempt to dispatch the first waiting aglet for that location. If it can get the lock, the aglet is dispatched, the new proxy stored in the slave list, and the old context lock released. The use of a queuing list helps ensure bounded waiting for the mutual exclusion requirement.

finished(id,fromlocation): an aglet has finished its itinerary and is disposing of itself. The co-ordinator will remove its proxy from the slave list and release any locks it held. After freeing the locks, the first waiting aglet (if there is one) for that location will be dispatched.

dialog: In addition to the behaviour described for standard aglets, the co-ordinator will inform all known aglets of its valid proxy.

Known Bugs/Issues

InterruptedException immediately after dispatching. Probable cause is the waitMessage method being cut short by a dispatch.

InvalidAgletException when the co-ordinator is moving, or when an aglet has disposed of itself and not managed to message the co-ordinator correctly.

Tahiti can often leave messages displayed on the host after the aglet has left; this can leave the impression that the aglet is still there. A check of the console trace will demonstrate that there was no overlap between the previous aglet leaving and the new one arriving. The screen glitch can be corrected by shrinking then expanding the host display with the >< button.

The standard aglet code and co-ordinator code both reside in the same Java class. Initially I deemed this to be a good design decision, as it would simplify the election of a new co-ordinator. On reflection after implementing the code I believe completely separating the co-ordinator would have been a wiser decision, mainly because it complicates the code for the standard slave aglets. Unfortunately, due to time constraints, a complete rebuild is implausible.

Additional Design Notes

Submitted after the due date, so please ignore for marks. They’re just here for completeness.

Firstly, two classes exist in the DBAglet.java file. I suspect this causes some problems when migrating the code to separate hosts. It would have been better to put each in a separate file and compile them separately.

Secondly, I realised there was a design flaw when it came to handling messages coming off the wire. As only the heap structures for each object are moved on transport, any running threads are lost. Hence outstanding messages that haven’t been processed are lost on dispatch.

To address this I should have either checked outstanding threads before dispatch or put each message into a buffer as soon as it came off the wire. This would prevent the coordinator from losing messages by dropping active threads. It would also help ensure fairness, as the messages could be processed in order of arrival (preventing the coordinator dominating the moves). Also, it could have removed the need for some of the hack solutions I created to fix the problems of lost messages.

I’m sure I discussed this problem/solution with a few people, but unfortunately I didn’t have time to implement it myself.