Orleans v3.4.0-rc1 Release Notes

Release Date: 2020-12-10 // 3 months ago
  • πŸ‘Œ Improved resiliency during severe performance degradation

    This release includes improvements to the cluster membership algorithm which are opt-in in this initial release. These changes are aimed at improving the accuracy of cluster membership when some or all nodes are in a degraded state. Details follow.

    Perform self-health checks before suspecting other nodes (#6745)

    This PR implements some of the ideas from Lifeguard (paper, talk, blog) which can help during times of catastrophe, where a large portion of a cluster is in a state of partial failure. One cause for these kinds of partial failures is large scale thread pool starvation, which can cause a node to run slowly enough to not process messages in a timely manner. Slow nodes can therefore suspect healthy nodes simply because the slow node is not able to process the healthy node's timely response. If a sufficiently proportion of nodes in a cluster are slow (eg, due to an application bug), then healthy nodes may have trouble joining and remaining in the cluster, since the slow nodes can evict them. In this scenario, slow nodes will also be evicting each other. The intention is to improve cluster stability in these scenarios.

    This PR introduces LocalSiloHealthMonitor which uses heuristics to score the local silo's health. A low score (0) represents a healthy node and a high score (1 to 8) represents an unhealthy node.

    LocalSiloHealthMonitor implements the following heuristics:

    • Check that this silos is marked as Active in membership
    • Check that no other silo suspects this silo
    • Check for recently received successful ping responses
    • Check for recently received ping requests
    • Check that the .NET Thread Pool is able to execute work items within 1 second from enqueue time
    • Check that local async timers have been firing on-time (within 3 seconds of their due time)

    Failing heuristics contribute to increased probe timeouts, which has two effects:

    • πŸ‘Œ Improves the chance of a successful probe to a healthy node
    • Increases the time taken for an unhealthy node to vote a healthy node dead, giving the cluster a larger chance of voting the unhealthy node dead first (Nodes marked as dead are pacified and cannot vote others)

    This effects of this feature are disabled by default in this release, with only passive background monitoring being enabled. The extended probe timeouts feature can be enabled by setting ClusterMembershipOptions.ExtendProbeTimeoutDuringDegradation to true. The passive background monitoring period can be configured by changing ClusterMembershipOptions.LocalHealthDegradationMonitoringPeriod from its default value of 10 seconds.

    Probe silos indirectly before submitting a vote (#6800)

    πŸ‘ This PR adds support for indirectly pinging silos before suspecting/declaring them dead.
    ⏱ When a silo is one missed probe away from being voted, the monitoring silo switches to indirect pings. In this mode, the silo picks a random other silo and sends it a request to probe the target silo. If that silo responds promptly with a negative acknowledgement (after waiting for a specified timeout), then the silo will be suspected/declared dead.

    βž• Additionally, when the vote limit to declare a silo dead is 2 silos, a negative acknowledgement counts for both required votes and the silos is unilaterally declared dead.

    πŸš€ The feature is disabled by default in this release - only direct probes are used by-default - but could be enabled in a later release, or by users by setting ClusterMembershipOptions.EnableIndirectProbes to true.

    πŸ‘Œ Improvements and bug fixes since 3.3.0

    • Non-breaking improvements
      • Probe silos indirectly before submitting a vote (#6800) (#6839)
      • Perform self-health checks before suspecting other nodes (#6745) (#6836)
      • Add IManagementGrain.GetActivationAddress() (#6816) (#6827)
      • In GrainId.ToString(), display the grain type name and format the key properly (#6774)
    • πŸ›  Non-breaking bug fixes
      • Avoid race for stateless worker grains with activation limit #6795 (#6796) (#6803)
      • Fix bad merge of GrainInterfaceMap (#6767)
      • Make Activation Data AsyncDisposable (#6761)

Previous changes from v3.3.0

  • πŸ‘Œ Improved diagnostics for long running, delayed, and blocked request:

    πŸš€ This release includes improvements to give developers additional context when a request does not return promptly. PR #6672 added these improvements. Orleans will periodically probe active grains to inspect their message queues and send status updates for certain requests which have been enqueued or executing for too long. These status messages will appear as warnings in the logs and will also be included in exceptions when a request timeout occurs. The information included can help a developer to identify what the grain is doing at the time of the request. For example, which messages are enqueued ahead of this message, and which messages are executing, how long they have been executing, how long this message has been enqueued, and the status of the grain's TaskScheduler.

    πŸ“¦ Microsoft.Orleans.Hosting.Kubernetes NuGet package (3.3.0-beta1) for tighter integration with Kubernetes

    πŸ“š This release adds a new pre-release package, Microsoft.Orleans.Hosting.Kubernetes, which adds richer integration for users hosting on Kubernetes. The package assists users by monitoring Kubernetes for silo pods and reflecting changes in cluster membership. For example, when a Pod is deleted, it is immediately removed from Orleans' membership. In addition, the package configures EndpointOptions and `ClusterOptions' to match the Pod's environments. Documentation and a sample project are expected in the coming weeks, and in the meantime, please see the original PR for more information: #6707.

    πŸ‘Œ Improvements and bug fixes since 3.2.0.

    Potentially breaking change

    • Added 'RecordExists' flag to perisistent store so that grains can det… (#6580)
      ⚑️ (Implementations of IStorage<TState> and IGrainState need to be updated to add a RecordExists property.)

    Non-breaking improvements

    • Use "static" client observer to notify from the gateway when the silo is shutting down (#6613)
    • More graceful termination of network connections (#6557) (#6625)
    • Use TaskCompletionSource.RunContinuationsAsynchronously (#6573)
    • Observe discarded ping task results (#6577)
    • Constrain work done under a lock in BatchWorker (#6586)
    • Support deterministic builds with CodeGenerator (#6592)
    • Fix some xUnit test discovery issues (#6584)
    • Delete old Joining records as part of cleanup of defunct entries (#6601, #6624)
    • Propagate transaction exceptions in more cases (#6615)
    • SocketConnectionListener: allow address reuse (#6653)Improve ClusterClient disposal (#6583)
    • AAD authentication for Azure providers (blob, queue & table) (#6648)
    • Make delay after gw shutdown notification configurable (#6679)
    • Tweak shutdown completion signalling (#6685) (#6696)
    • Close some kinds of misbehaving connections during shutdown (#6684) (#6695)
    • Send status messages for long-running and blocked requests (#6672) (#6694)
    • Kubernetes hosting integration (#6707) (#6721)
    • Reduce log noise (#6705)

    - Upgrade AWS dependencies to their latest versions. (#6723)

    πŸ›  Non-breaking bug fixes

    • Fix SequenceNumber for MemoryStream (#6622) (#6623)
    • When activation is stuck, make sure to unregister from the directory before forwarding messages (#6593)
    • Fix call pattern that throws. (#6626)
    • Avoid NullReferenceException in Message.TargetAddress (#6635)
    • Fix unobserved ArgumentOutOfRangeException from Task.Delay (#6640)
    • Fix bad merge (#6656)
    • Avoid race in GatewaySender.Send (#6655)
    • Ensure that only one instance of IncomingRequestMonitor is created (#6714)