Measuring the repetitiveness of a text

How to measure how many phrases are repeated in a text

Last updated on

I am going to give you my quick & ineffective implementation of an algorithm that measures the repetitiveness of a text. A text that has a lot of repeating phrases will have a higher score than a text that has no repeating phrases.


This method was described in two papers by the same author:


Here is a quick and hopefully correct but very ineffective implementation of the algorithm in Python.

import sys
from collections import defaultdict

def ngrams(tokens, n):
  ngram = []
  for token in tokens:
    if len(ngram) < n:
      yield ngram
  if len(ngram) == n:
    yield ngram

def count_ngrams(tokens, n):
  counts = defaultdict(int)
  for ngram in ngrams(tokens, n):
    counts[' '.join(ngram)] += 1
  return counts

def rr(tokens, max_n=4, window_size=1000):
  if len(tokens) < max_n or len(tokens) < window_size:
    raise Exception('Too few tokens, change window_size or max_n')

  result = 1.0

  for n in range(1, max_n+1):
    numerator = 0.0
    denominator = 0.0

    for window in ngrams(tokens, window_size):
      ngram_counts = count_ngrams(window, n)
      singletons = [ngram for ngram, count in ngram_counts.iteritems() if count == 1]
      numerator += len(ngram_counts) - len(singletons)
      denominator += len(ngram_counts)

    result *= numerator / denominator

  return pow(result, 1.0/max_n)

if __name__ == '__main__':

  tokens =
  print 'tokens:', len(tokens), 'types:', len(set(tokens))
  print 'RR:', rr(tokens)


The input document should be tokenized and passed to the standard input. The script will perform lowercasing for you.

The recommended sliding window size is 1000 so your input document should have at least as many tokens as that.

$ pypy < document.tok
tokens: 8957 types: 867
RR: 0.236746159266

A 500 line, 9000 token and about 50 KB small test document is processed in about 10 seconds with PyPy and 30 seconds with Python 2.7.

Final remarks

Hopefully my understanding and implementation of the algorithm is correct. If not, please let me know.