Measuring the repetitiveness of a text

I am going to give you my quick & ineffective implementation of an algorithm that measures the repetitiveness of a text. A text that has a lot of repeating phrases will have a higher score than a text that has no repeating phrases.

Method

This method was described in two papers by the same author:

Cache-based Online Adaptation for Machine Translation Enhanced Computer Assisted Translation, section 4.2
Online adaptation to post-edits for phrase-based statistical machine translation, section 3.2

Implementation

Here is a quick and hopefully correct but very ineffective implementation of the algorithm in Python.

import sys
from collections import defaultdict

def ngrams(tokens, n):
  ngram = []
  for token in tokens:
    if len(ngram) < n:
      ngram.append(token)
    else:
      yield ngram
      ngram.pop(0)
      ngram.append(token)
  if len(ngram) == n:
    yield ngram

def count_ngrams(tokens, n):
  counts = defaultdict(int)
  for ngram in ngrams(tokens, n):
    counts[' '.join(ngram)] += 1
  return counts

def rr(tokens, max_n=4, window_size=1000):
  if len(tokens) < max_n or len(tokens) < window_size:
    raise Exception('Too few tokens, change window_size or max_n')

  result = 1.0

  for n in range(1, max_n+1):
    numerator = 0.0
    denominator = 0.0

    for window in ngrams(tokens, window_size):
      ngram_counts = count_ngrams(window, n)
      singletons = [ngram for ngram, count in ngram_counts.iteritems() if count == 1]
      numerator += len(ngram_counts) - len(singletons)
      denominator += len(ngram_counts)

    result *= numerator / denominator

  return pow(result, 1.0/max_n)

if __name__ == '__main__':

  tokens = sys.stdin.read().lower().split()
  print 'tokens:', len(tokens), 'types:', len(set(tokens))
  print 'RR:', rr(tokens)

Usage

The input document should be tokenized and passed to the standard input. The script will perform lowercasing for you.

The recommended sliding window size is 1000 so your input document should have at least as many tokens as that.

$ pypy textrr.py < document.tok
tokens: 8957 types: 867
RR: 0.236746159266

A 500 line, 9000 token and about 50 KB small test document is processed in about 10 seconds with PyPy and 30 seconds with Python 2.7.

Final remarks

Hopefully my understanding and implementation of the algorithm is correct. If not, please let me know.