# Measuring the repetitiveness of a text

How to measure how many phrases are repeated in a text

Last updated on

I am going to give you my quick & ineffective implementation of an algorithm that measures the repetitiveness of a text. A text that has a lot of repeating phrases will have a higher score than a text that has no repeating phrases.

## Method

This method was described in two papers by the same author:

## Implementation

Here is a quick and hopefully correct but very ineffective implementation of the algorithm in Python.

``````import sys
from collections import defaultdict

def ngrams(tokens, n):
ngram = []
for token in tokens:
if len(ngram) < n:
ngram.append(token)
else:
yield ngram
ngram.pop(0)
ngram.append(token)
if len(ngram) == n:
yield ngram

def count_ngrams(tokens, n):
counts = defaultdict(int)
for ngram in ngrams(tokens, n):
counts[' '.join(ngram)] += 1
return counts

def rr(tokens, max_n=4, window_size=1000):
if len(tokens) < max_n or len(tokens) < window_size:
raise Exception('Too few tokens, change window_size or max_n')

result = 1.0

for n in range(1, max_n+1):
numerator = 0.0
denominator = 0.0

for window in ngrams(tokens, window_size):
ngram_counts = count_ngrams(window, n)
singletons = [ngram for ngram, count in ngram_counts.iteritems() if count == 1]
numerator += len(ngram_counts) - len(singletons)
denominator += len(ngram_counts)

result *= numerator / denominator

return pow(result, 1.0/max_n)

if __name__ == '__main__':

print 'tokens:', len(tokens), 'types:', len(set(tokens))
print 'RR:', rr(tokens)
``````

## Usage

The input document should be tokenized and passed to the standard input. The script will perform lowercasing for you.

The recommended sliding window size is `1000` so your input document should have at least as many tokens as that.

``````\$ pypy textrr.py < document.tok
tokens: 8957 types: 867
RR: 0.236746159266
``````

A 500 line, 9000 token and about 50 KB small test document is processed in about 10 seconds with PyPy and 30 seconds with Python 2.7.

## Final remarks

Hopefully my understanding and implementation of the algorithm is correct. If not, please let me know.