I am going to give you my quick & ineffective implementation of an algorithm that measures the repetitiveness of a text. A text that has a lot of repeating phrases will have a higher score than a text that has no repeating phrases.

## Method

This method was described in two papers by the same author:

- Cache-based Online Adaptation for Machine Translation Enhanced Computer Assisted Translation, section 4.2
- Online adaptation to post-edits for phrase-based statistical machine translation, section 3.2

## Implementation

Here is a quick and hopefully correct but very ineffective implementation of the algorithm in Python.

```
import sys
from collections import defaultdict
def ngrams(tokens, n):
ngram = []
for token in tokens:
if len(ngram) < n:
ngram.append(token)
else:
yield ngram
ngram.pop(0)
ngram.append(token)
if len(ngram) == n:
yield ngram
def count_ngrams(tokens, n):
counts = defaultdict(int)
for ngram in ngrams(tokens, n):
counts[' '.join(ngram)] += 1
return counts
def rr(tokens, max_n=4, window_size=1000):
if len(tokens) < max_n or len(tokens) < window_size:
raise Exception('Too few tokens, change window_size or max_n')
result = 1.0
for n in range(1, max_n+1):
numerator = 0.0
denominator = 0.0
for window in ngrams(tokens, window_size):
ngram_counts = count_ngrams(window, n)
singletons = [ngram for ngram, count in ngram_counts.iteritems() if count == 1]
numerator += len(ngram_counts) - len(singletons)
denominator += len(ngram_counts)
result *= numerator / denominator
return pow(result, 1.0/max_n)
if __name__ == '__main__':
tokens = sys.stdin.read().lower().split()
print 'tokens:', len(tokens), 'types:', len(set(tokens))
print 'RR:', rr(tokens)
```

## Usage

The input document should be tokenized and passed to the standard input. The script will perform lowercasing for you.

The recommended sliding window size is `1000`

so your input document should have at least as many tokens as that.

```
$ pypy textrr.py < document.tok
tokens: 8957 types: 867
RR: 0.236746159266
```

A 500 line, 9000 token and about 50 KB small test document is processed in about 10 seconds with PyPy and 30 seconds with Python 2.7.

## Final remarks

Hopefully my understanding and implementation of the algorithm is correct. If not, please let me know.