- Inlined Array::operator[] and removed bounds checking

- Changed compilation settings to -O3 (maximal optimisation) and added the
  compiler settings -march=native (for gcc 4.4+), which produces optimisations
  specific to the architecture of the machine it is being compiled on,
  including such things as vector/SIMD instructions, and -ffast-math which does
  things like rewriting x/y as x*(1/y) if they appear repeatedly, but note that
  these do change the values x/y!=x*(1/y) in general, but the differences are
  tiny.
  
+++ So far, 3.5-4x speed up on test data

- Replaced some loops setting data to 0 and copying data with memset/memcpy
  operations (small improvement)
  
- Implemented cacheing system using either an ordered stl::map or the less
  standard stl::tr1::unordered_map (available in gcc 4.0+). The class
  CacheKeyType is used as the key values for this associative map, and it
  computes a reduced mask vector (packing bools into bits rather than ints),
  which has size nDims/32+1 (on 32 bit machines, or nDims/64+1 on 64 bit).
  It also computes a hashed value of this mask vector, using stl::tr1::hash.
  The reduced mask vector is used to define the operator< and operator== which
  are needed for associative map and unordered_map respectively, and the
  unordered_map also requires a hashing class which is wrapped by the
  hash_CacheKeyType class. The map container has lookup which is O(log(n)) and
  the unordered_map has a usual case of O(1) and a worst case of O(n) if the
  hash function has many collisions (very unlikely). The class CachedSubData
  is used to store the data to be cached, namely SubCov and the Cholesky
  decomposition SubChol of this. The map type is typedef'd to the name
  CacheType.
  
+++ 1.3x speed up on test data, total of about 5x. However, 95% of the time we
    are using cached values rather than computing values, so I feel there is
    more to be got from this optimisation. We should probably fix this with
    reordering the data so we don't need to cache.
    
- Removed the Array class entirely, replacing it with STL vector instead. No
  performance improvement, but makes the code cleaner and simpler, especially as
  vector and Array were more or less identical (but vector does more).
  
- Restructured code into several files, might change the restructuring when
  we think about adding GPU support, but for the moment it is more
  manageable.

- Put mechanism in place to use float or double as underlying data type, but
  at the moment double doesn't work, I think because the data files are using
  float and not double, and so we need to treat this differently.
  
- Added new SafeArray class which can have bounds checking enabled or disabled
  and adds some security without slowing the program down when bounds
  checking is disabled (and not slowing it down too much when it's enabled).
  Unfortunately, the code is not quite as clean as it was.

- Added precomputation of Unmasked indices in a sparse vector structure
  Unmasked, UnmaskedInd, and a function ComputeUnmasked() that generates this
  from the Masks array. Updated FindSubs to use this and it now runs a little
  faster (not much faster because not a lot of time was spent on this
  comparatively, but the improvement will matter more when the other parts
  take less time).
  
- Added sorting of points into mask order, so that recomputation of matrices
  is only done when necessary, and the old cache has been removed. In
  addition, it causes almost no changes to the rest of the program as the data
  itself is not sorted, but rather we loop through the indices in the
  SortedIndices vector, and the rest of the code (which doesn't do this) is
  unchanged.

+++ Speed up is now 5.5x over original program. Profiling now shows 65% spent
    on MStep, 25% inside EStep, 9% on Cholesky, 4% on TriSolve and 2% on
    FindSubs. So, optimising MStep is definitely the way to go now, as well
    as implementing masks (which will itself optimise it considerably).

- Added precompute mask change points when in sorted order, note that we need to
  be quite careful about calling ComputeUnmasked(), ComputeSortIndices() and
  ComputeSortedUnmaskedChangePoints() when we initialise a KK structure.

+++ Speed up is now 6x over original program. Profiling shows 65% spent on
    MStep, 17% on EStep, 10% on Cholesky, 4% on TriSolve and 2% on FindSubs.
    (This looks weirdly inconsistent with previous result, but it's what comes
    out of the profiler...)

- Refactored KK, not adding a constructor because it would involve too much
  rewriting, but adding a DoPrecomputations() which should be called after
  the Data or Masks have changed. I will also add some sanity checks on the
  data here to catch possible programming errors relating to forgetting to
  call this function.
  
- Reordered covariance matrix computation in M step to use blocks which fit into
  the L1d cache. The exact size of the blocks is processor dependent and needs
  to be tested on a range of machines, as it can change the speed dramatically.
  It might be even faster to recast the computation as a series of matrix
  multiply and adds, and use an optimised BLAS to do it (although this adds
  heavy duty compiler dependencies and makes it more difficult for the user to
  install). Note that the results are now not exactly the same because the
  order of computations is slightly different due to the blocking, but they
  are very close (and it's not like the previous ones are right and these are
  wrong, but that they are both slightly wrong in different ways).
  
+++ Speed up is now 7x over original program. Profiling now shows 54% on MStep,
    24% on EStep, 11% on Cholesky, 5% on TriSolve, 2% on FindSubs and 1% on
    SafeArray<int>::SafeArray.
    
- Commented code to explain it a bit more, moved FindSubs into EStep, which
  simplifies the code a bit.
  
- Do not compute all of Vec2Mean in EStep, only SubVec2Mean, which saves a lot
  of work.

+++ Speed up is now 8x over original program. Profiling shows:
    59% MStep
    16% EStep
    15% Cholesky
     7% TriSolve

- Added constructors (from filename or subset of existing object), and these
  are now obligatory.

- Removed MaskFileBase, MaskElecNo from parameters, the values of FileBase and
  ElecNo are now used.

- Parameters and default values are only defined once now in parameters.h, and
  thanks to some preprocessor magic the other references to them are filled in
  automatically (see parameters.h for explanation). Now when adding params we
  only need to use this one file.

- Double precision now works (enable flag USE_DOUBLE_PRECISION in
  globalswitches.h).

- Output .sorted.* files.

- Updated parameters so that they have a one line documentation string that
  is shown to the user (and also explains the parameter in the source file as
  well).
  
- Bug in CatonM fixed.

- First version of using masks in M steps implemented, but no testing done yet
  to see if it works (Shabnam will work on that). The new behaviour is
  activated by defining USE_MASKED_MSTEP in globalswitches.h (currently defaults
  to being defined).
  
PLANNED CHANGES

- Cut CatonM down to just the parts we need, implement a better interpolation
  method and fix the various other issues, including weighted mean of peak
  channels.
- Optimise writing of files?
- Remove Reindex() by computing it when necessary (it's a short computation)
- Backwards compatibility, if no mask file present use all 1s
- GPU implementation

NOTES

- New MKK produced error "split should only produce 2 clusters" which only
  happens if points are moved to noise cluster by CEM in TrySplits, added a
  flag to make it not possible for this to happen.