I am going to give you my quick & ineffective implementation of an algorithm that measures the repetitiveness of a text. A text that has a lot of repeating phrases will have a higher score than a text that has no repeating phrases.
Method
This method was described in two papers by the same author:
- Cache-based Online Adaptation for Machine Translation Enhanced Computer Assisted Translation, section 4.2
- Online adaptation to post-edits for phrase-based statistical machine translation, section 3.2
Implementation
Here is a quick and hopefully correct but very ineffective implementation of the algorithm in Python.
import sys
from collections import defaultdict
def ngrams(tokens, n):
ngram = []
for token in tokens:
if len(ngram) < n:
ngram.append(token)
else:
yield ngram
ngram.pop(0)
ngram.append(token)
if len(ngram) == n:
yield ngram
def count_ngrams(tokens, n):
counts = defaultdict(int)
for ngram in ngrams(tokens, n):
counts[' '.join(ngram)] += 1
return counts
def rr(tokens, max_n=4, window_size=1000):
if len(tokens) < max_n or len(tokens) < window_size:
raise Exception('Too few tokens, change window_size or max_n')
result = 1.0
for n in range(1, max_n+1):
numerator = 0.0
denominator = 0.0
for window in ngrams(tokens, window_size):
ngram_counts = count_ngrams(window, n)
singletons = [ngram for ngram, count in ngram_counts.iteritems() if count == 1]
numerator += len(ngram_counts) - len(singletons)
denominator += len(ngram_counts)
result *= numerator / denominator
return pow(result, 1.0/max_n)
if __name__ == '__main__':
tokens = sys.stdin.read().lower().split()
print 'tokens:', len(tokens), 'types:', len(set(tokens))
print 'RR:', rr(tokens)
Usage
The input document should be tokenized and passed to the standard input. The script will perform lowercasing for you.
The recommended sliding window size is 1000
so your input document should have at least as many tokens as that.
$ pypy textrr.py < document.tok
tokens: 8957 types: 867
RR: 0.236746159266
A 500 line, 9000 token and about 50 KB small test document is processed in about 10 seconds with PyPy and 30 seconds with Python 2.7.
Final remarks
Hopefully my understanding and implementation of the algorithm is correct. If not, please let me know.