Wednesday, November 16, 2011

Darwinian_Coding: ( App_Services )


Introduction: why read this blog

Welcome:  Have you ever had (or made) enough time to thoroughly re-evaluate past project coding choices?  It can be a humbling although entertaining activity.  After nearly 30 years of video game programming, i realize that time-scales really affect how you see things, whether it's an 'immediate moment' choice of a variable name, a 'short-term' decision on a comment's level of detail, medium-term class diagramming/system planning, long-term design document, or the after-shipping post-mortem analysis.   Since reviewing past projects revealed how much i couldn't see at those shorter time scales, i'm targeting this blog on the patterns that are easily missed.   The goal being to caution your design-considerations against my mistakes and share some pros-&-cons of past solutions.  While veterans may find little newness, i hope novices or those currently developing similar systems might at least spark some dialog on their choices.
Hopefully the phrase 'Darwinian Coding' evokes the notion that the most successful coding choices, particularly for long-term large-projects, are not always the most powerful/optimal but instead the most adaptable. In my journey, the most important metric of code has been how it survives change. Whether drifting design docs, shifting QA feedback, blame-ridden profiling results, OS/Bios/Driver/SDK/Compiler updates,or even new hardware/net-services, change can cost time & innovation. Code-survival means less time re-writing, debugging, and integrating to adapt to those changes.
The bulk of common programming advice, such as Knuth's 'premature optimization is the root of evil' or 'profile-early/profile-often', or 'metrics-metrics-metrics' seems boil down to the childhood warning against relying on 'assumptions.' Ironically, as coders we are often forced to make assumptions in the interest schedule-realities, needing to delegate tasks to others, relying on trusted advice of experts, ease of cut'n'paste code, or sometimes lack of other options. Code survival relies on identifying those assumptions and finding the patterns in them that will pop-up elsewhere.
Each blog-article will cover these 'assumption-patterns' given a problem's context (where we use this code), goals (what we need from the code), solutions (how we tried to solve it in the past), and a survivor (who has proven best over time).
The biggest survival sagas (aka changes in coding approaches) in the past involved:
  • App_Services ( Providing asynchronous processing capabilities ) ---entry below---
  • Reversible_Timelines ( How to build a rich-AI & physics world with rewind/fast-forward & network propagation )
  • RPG_Improvising_Rules ( Using Pen and Paper RPG flexibility in rigid Digital Simulations )
  • Knowledge_Representation ( Signs, symbols, relationships, inferences, remembering, forgetting, etc )
  • Energy_Propagator ( Propagating energy through matter, managing error metrics and logarithmic scales, suited for invisible RF/sound/thermal or imaginary 'energies' since we are too sensitive to aliasing/roughness/inaccuracies in the visible spectrum )
  • Behavior_Creation ( Using feedback loops to adjust behavior trees and mixing actions to generate complex behavior recipes )
  • Security_Islands ( Methods to protect Application-services, User-choices, and Simulation-events/thinking means to secure/limit data changes )
  • Shape_Synthesis ( Constructing destructible and animatable shapes for rendering, physics, and general AI queries )
  • Possibility_Mapping ( How to synthesize new animations through constrained mixing of local animation areas )
  • Bit_Shipping ( Methods to combine transforming, compressing, encrypting, and transferring raw bits of known data stream types )
  • Cultural_Text ( How we  manage dynamic paragraph layout, resolutions of mobile vs many screens, cache rendered words/sentences, rich font formatting, aliasing, writing orientations )
  • Coding_Productivity ( What choices/habits have universally been of benefit, code-generators using scripts, including naming conventions, file/make organization )
  • Project_Productivity ( What hurt/helped projects to get finished on time, auto-generated documentation & in-game bug-tracking, expectations vs humility vs drive )
  • Self_Balancing_Metrics ( How to dynamically adjust analog sensory subsystems ( graphics, audio ) and discrete subsystems ( physics, AI simulating ) to balance quality vs interactivity (frame-rate) )

A visualizer that relied on App_Services to balance network streams (as i can't find a thread-specific image)
Application Services

Context: where we use this

Application services is a name for how your software engine provide services, such as rendering an image or loading a file, to the actual product.  App-services are usually library or system calls, often directly spawning threads, or adding a task to a job-stealing pool.  Sometimes they use networking to contact another computer or network domain to make requests.  In all cases, this is how we harness all available CPUs and other processors as well as network access to deliver the best performance (at least best under a given a battery-budget/power-settings).
App Services provide background tasks such as filling or mixing audio buffers, file I/O, monitoring network messages and decompressing assets.  In this context they can also provide immediate foreground services such balancing rendering loads where there would be app services to mix geometry, search a list of text, count valid elements in an array,  or walk a scene graph.

Goals: what we need

  • An API to schedule 'work' which is a computing 'service' and some context parameters ( this means a list of what services are provided and what data they need to run )
  • Work is schedule with a time to start and an expected duration which provides prioritization hints to help balance the processing power
  • Supports profiling individual work-items, and the overall idle vs active efficiency of the App-Services system
  • Can balance work load across available resources ( 1 to N CPUs or special processors )
  • Can route work to particular processors ( such as GPUs, PS3 SPUs, or a particular thread, as is needed for Direct3D calls or thread-unsafe libraries )
  • Changes to shared data are atomic ( no side effects if several services are all accessing the same variable )
  • Supports an event graph using triggers ( finishing or starting a work item can launch dependent work with appropriate timing )
  • Application can be frozen and stored to disk to be resumed later ( for anywhere application-saves, reproducing bugs,  and power-management issues )
  • Cross-platform & OS-friendly (can manage Dll / Dyld challenges)
  • Responds to the standard 'life-cycle' of software systems which has Create, Destroy, Pause, Resume, and Run (update) modes to properly handle exiting when a fatal error occurs or a manageable error requires recovery.

Solutions: how we tried

Technique: Mega-Loop Cooperative-Yielding aka 'Melcoy'
In the 90s we started with a simple application-managed 'cooperative-multitasking system' that had a single 'Run' function which iterated over a linked-list of current work-items to provide the product the services it needed.  Each service function was written with a series of calls to Fn_yield() which pushed all the local variables to a pre-allocated stack that each work-item kept for the service it was performing.  Then the 'Run' function would switch to another function in the work-item list.  There it would restore those local variables and 'goto' to the previous Fn_yield label location to resume executing that code.  We used a lot of macros to declare local variables as part of a struct that was mem-copied quickly and let us have type-information for debugging.  It was a complex system but it did provide asynchronous activity at a time when all consumer machines had a single processor and stalls were common.
Pros: Didn't have overhead of task switching that others systems had at the time and it gave us loading with audio and interactivity which was novel at the time ( and a design choice to drop given that audio and interactivity slowed down loading )
Cons: Required 'Fn_yield()' calls throughout all the service code which made it hard for others to write code that played well with others.  Although it was easy to profile each function, it was very hard to maintain even performance as Fn_yields could give radically different performance depending on the data they were using or if I/O was involved.  Scaling to many simultaneous services required a lot of memory at the time this technique was used.  Frame-rates stuttered until all the yields were tuned.

Technique: Preemptive Service Manager aka 'Presm'
Inspired by how most operating systems handle pre-emptive multi-tasking, we used one low-priority thread to monitor one or more 'worker' threads and used 80x86 interrupts to halt execution, push/pop assembly for custom push/pop of the existing registers onto the stack
Pros: Learned a lot while coding this.
Cons: Complexity in coding.  Bugs resulting from the register push/pop choices.  Weird cases that were hard to reproduce.  No ability to save conditions to disk as its implementation was address/register-specific which didn't stay constant between executions

Technique: Micro-Functions (hierarchy to scale) aka 'Mif'
Next we tried to break our services into 'micro-functions' that could be streamed together to form the larger 'service-oriented' functionality as needed. Initially this gave us the ability to paused and resume any given service since it was being represented as a series of instructions, which effectively made it a cluster of Virtual Machines.  To provide asynchronous behavior we would simply execute a series of micro-functions from one work item's service then switch to another continuously round-robin.  Used a priority ring that kept work-items like mix-audio and update-user-input high up and file I/O 'read-next-chunk' low-down.
Pros: Easy to debug in realtime (can edit these instructions) and simple to understand. Although it still used a single thread of execution, it was very responsive even with heavy I/O loads.
Cons: Although it was responsive in terms of consistent user-input affecting the world and screen, the background-task performance was awful for complex activities.  Iterating 1,000s of entities in a scene, animating geometry, image-uploads, and polynomial solvers would take a long time to get done which required fallbacks such as blurry visuals or stuttered animations.  More importantly, any atomic synchronizing of shared data had frequent long-duration stalls due to contention. Having such a small granularity of functionality prevented us from gaining access to a shared variable once as we'd have to acquire, either read or write, and release it for each micro-function. This cost added up quickly and even had the unfortunate result ( in 2008 ) of running faster on a dual-core than a four-core Xeon of the same speed.

Technique: Specialized-Threads aka 'Spet'
In an attempt to honor the 'simple over clever' principle, we finally built a traditional 'specialized threads'  approach.  This technique simply launched threads for services when they were needed and used a message passing system to flag when the task was done.  Audio mixing, animation blending, collision detection, etc. were all their own simple thread running one dedicated function.
Pros: Let the OS handle the schedule.  Code is simple and easy for external coders to understand and modify.  Simpler to graph profiling data and think about most services.  Integrated better with existing debugging tools than other approaches.
Cons: Given the specialized nature of each thread function, profiling and balancing became a lot of work.  Moving to a different hardware configuration could wreak havoc on timings.  OS thread calls didn't perform consistently and timings between thread rescheduling could wildly vary.  Easy to understand but very hard to balance without rewriting the specialized functions.

Survivor: who proved best & why

Technique: Work-Scheduler aka 'Wos'
Motivated to avoid the costly tuning required to balance dedicated thread-functions and scale performance up to use the hyper-threaded P4 ( in 2003) and the inevitable dual-core ( 2005 ) machines, we tried running generic threads whose main 'run' function pulled work out of a queue to provide dedicated services.  This was an evolution of the Mif approach described above made for 3 or more threads.  We used a red-black tree to sort the timings of tasks and an 'association' hash to characterize each work item.  Each thread had its own local work-queue which it would pull work-items from and potentially add back into.  After reaching a threshold number of work-items completed or added, the per-thread work-queue is merged back into a system-wide common 'work-queue' to eliminate contentious stalls over the common queue.  There is a singleton Work_Scheduler that owns the common queue as well as the per-thread queues and is responsible for signaling threads to be notified of application-life-cycle changes such as pause, resume, shutdown or error.  Wos works on 'service-functions' that can be large and take a lot of time or very small and fast to execute.  A key difference from the Mif micro-function-only approach is in profiling and estimating a finishing time for the service to better manage hiccups in the schedule and a minimum of 3 threads with some services only executing on a prime 'user-input/OS-message' thread and the other two allowing long-term background services to run uninterrupted.
Pros: Has enough information at run-time to self-balance performance.  Simple to use and easy to view profiling information.  Although untested on more than 64 cores at this time, it has shown steady performance increase with more cores and it can scale down to use less processing to save mobile battery or play nicer with other apps.
Cons: Overhead from translating the service-function data into actual local variables to use.  Atomic access to shared memory impact performance of 'small-services.'

Future: For many years this approach has served us well, mapping nicely to mobile (iOS & Android) and allowing us to take advantage of the many cores now found in workstations.  Future cooperation with different processors such as GPUs, especially being able to issue their own tasks from within OpenCL-type code may require big changes or may fit in with the existing Work_Scheduler as a new block of Work_Queues.  If hardware progresses to allowing us even cheaper profiling and a safe capability to generate CPU instructions tuned to the current task, we could see a new iteration of this Work_Scheduler pattern that builds the actual code for the service implementations based on its context parameters and its past-performance.  Given faster local network connections, perhaps the Work_Scheduler could be extended to local clusters, like super-computers operate.  In all cases, it will be interesting to see how this survivor-solution fares in the next 8 years.

Survivor_Datastructs

//===
//STRUCTS
//Using C & psuedo-Container-templates
//where
// Usec_t is microseconds used as a 'double'
// Msec_t is milliseconds used as a unsigned 32 bit int
// Array_Ref is an array container class that provides fast access and linear memory
// List_Ref is a singly-linked list class for building 'trigger'
// Dict_Entry__Ref is a dictionary entry class that connects to a dictionary container class that is searchable
// Context_Param_t is a union of various data types and a member that indicates what type is stored & basic constraints.

//===
Struct Work_Item_t
{
  //---
  //External Ref
  Dict_Entry__Ref< Profile_t > Profile_p;
  List_Ref< Work_Item_t > On_Start__Worktm__Next_p;
  List_Ref< Work_Item_t > On_Finish__Worktm__Head_p;

  //---
  //Locally Aligned Members
  Work_Category_t category;
  Work_Func_t func;

  Context_Param_t A_param;
  Context_Param_t B_param;
  Context_Param_t C_param;
  Context_Param_t D_param;

  Msec_t Desired__Start_msec;
  Msec_t Expected__Duration_msec;
  Msec_t Required__Finish_msec;
};

//===
Struct Work_Queue_t
{
  //---
  //External Ref
  Dict_Entry__Ref< Profile_t > Profile_p;

  //---
  //Locally Aligned Members
  AtomicToken64_t Item_access;
  //used to atomically grant exclusive write or shared read access to this struct

};

//===
Struct Work_Schedule_t
{
  //---
  //External Ref
  Dict_Entry__Ref< Profile_t > Profile_p;
  Array_Ref< Work_Queue_t > Local__Work_Queues__array;
  Work_Queue_t Common__Work_queue;

  //---
  //Locally Aligned Members
  uint32_t Work_Items__Total_u;

  //Simple Schedule Assessment of work activity vs idle time
  Usec_t Idle_Work__msec;
  Usec_t Active_Work__msec;
  float_t Efficiency_Ratio_f;

AtomicToken64_t Array_access;
//used to atomically grant exclusive write or shared read access to this struct
};

No comments: