I am going to give you my quick & ineffective implementation of an algorithm that measures the repetitiveness of a text. A text that has a lot of repeating phrases will have a higher score than a text that has no repeating phrases.
This method was described in two papers by the same author:
- Cache-based Online Adaptation for Machine Translation Enhanced Computer Assisted Translation, section 4.2
- Online adaptation to post-edits for phrase-based statistical machine translation, section 3.2
Here is a quick and hopefully correct but very ineffective implementation of the algorithm in Python.
import sys from collections import defaultdict def ngrams(tokens, n): ngram =  for token in tokens: if len(ngram) < n: ngram.append(token) else: yield ngram ngram.pop(0) ngram.append(token) if len(ngram) == n: yield ngram def count_ngrams(tokens, n): counts = defaultdict(int) for ngram in ngrams(tokens, n): counts[' '.join(ngram)] += 1 return counts def rr(tokens, max_n=4, window_size=1000): if len(tokens) < max_n or len(tokens) < window_size: raise Exception('Too few tokens, change window_size or max_n') result = 1.0 for n in range(1, max_n+1): numerator = 0.0 denominator = 0.0 for window in ngrams(tokens, window_size): ngram_counts = count_ngrams(window, n) singletons = [ngram for ngram, count in ngram_counts.iteritems() if count == 1] numerator += len(ngram_counts) - len(singletons) denominator += len(ngram_counts) result *= numerator / denominator return pow(result, 1.0/max_n) if __name__ == '__main__': tokens = sys.stdin.read().lower().split() print 'tokens:', len(tokens), 'types:', len(set(tokens)) print 'RR:', rr(tokens)
The input document should be tokenized and passed to the standard input. The script will perform lowercasing for you.
The recommended sliding window size is
1000 so your input document should have at least as many tokens as that.
$ pypy textrr.py < document.tok tokens: 8957 types: 867 RR: 0.236746159266
A 500 line, 9000 token and about 50 KB small test document is processed in about 10 seconds with PyPy and 30 seconds with Python 2.7.
Hopefully my understanding and implementation of the algorithm is correct. If not, please let me know.