1. 04 Mar, 2015 1 commit
  2. 03 Mar, 2015 1 commit
  3. 14 Jan, 2015 1 commit
  4. 01 Oct, 2014 1 commit
  5. 31 Jul, 2013 1 commit
    • Major rename detection changes · d730d3f4
      After doing further profiling, I found that a lot of time was
      being spent attempting to insert hashes into the file hash
      signature when using the rolling hash because the rolling hash
      approach generates a hash per byte of the file instead of one
      per run/line of data.
      
      To optimize this, I decided to convert back to a run-based file
      signature algorithm which would be more like core Git.
      
      After changing this, a number of the existing tests started to
      fail.  In some cases, this appears to have been because the test
      was coded to be too specific to the particular results of the file
      similarity metric and in some cases there appear to have been bugs
      in the core rename detection code where only by the coincidence
      of the file similarity scoring were the expected results being
      generated.
      
      This renames all the variables in the core rename detection code
      to be more consistent and hopefully easier to follow which made it
      a bit easier to reason about the behavior of that code and fix the
      problems that I was seeing.  I think it's in better shape now.
      
      There are a couple of tests now that attempt to stress test the
      rename detection code and they are quite slow.  Most of the time
      is spent setting up the test data on disk and in the index.  When
      we roll out performance improvements for index insertion, it
      should also speed up these tests I hope.
      Russell Belfer committed
  6. 24 Jul, 2013 1 commit
  7. 15 May, 2013 1 commit
  8. 09 Mar, 2013 1 commit
    • Make tree iterator handle icase equivalence · e40f1c2d
      There is a serious bug in the previous tree iterator implementation.
      If case insensitivity resulted in member elements being equivalent
      to one another, and those member elements were trees, then the
      children of the colliding elements would be processed in sequence
      instead of in a single flattened list.  This meant that the tree
      iterator was not truly acting like a case-insensitive list.
      
      This completely reworks the tree iterator to manage lists with
      case insensitive equivalence classes and advance through the items
      in a unified manner in a single sorted frame.
      
      It is possible that at a future date we might want to update this
      to separate the case insensitive and case sensitive tree iterators
      so that the case sensitive one could be a minimal amount of code
      and the insensitive one would always know what it needed to do
      without checking flags.
      
      But there would be so much shared code between the two, that I'm
      not sure it that's a win.  For now, this gets what we need.
      
      More tests are needed, though.
      Russell Belfer committed
  9. 28 Feb, 2013 1 commit
  10. 27 Feb, 2013 2 commits
  11. 20 Feb, 2013 2 commits
    • Refine pluggable similarity API · 9bc8be3d
      This plugs in the three basic similarity strategies for handling
      whitespace via internal use of the pluggable API.  In so doing, I
      realized that the use of git_buf in the hashsig API was not needed
      and actually just made it harder to use, so I tweaked that API as
      well.
      
      Note that the similarity metric is still not hooked up in the
      find_similarity code - this is just setting out the function that
      will be used.
      Russell Belfer committed
    • Change similarity metric to sampled hashes · 5e5848eb
      This moves the similarity metric code out of buf_text and into a
      new file.  Also, this implements a different approach to similarity
      measurement based on a Rabin-Karp rolling hash where we only keep
      the top 100 and bottom 100 hashes.  In theory, that should be
      sufficient samples to given a fairly accurate measurement while
      limiting the amount of data we keep for file signatures no matter
      how large the file is.
      Russell Belfer committed