#!/usr/bin/python # -*- coding: utf-8 -*- """Break up a text file on “natural” boundaries. The idea, inspired by “bup”, is to split up a file into chunks at consistent intervals, at points that will shift automatically along with insertions and deletions in the text. Side-by-side comparison before and after an edit: '\xef\xbb\xbfThe Project' '\xef\xbb\xbfThe Project' ' Gutenberg' ' Gutenberg' ' EBook of The King' ' EBook of The King' ' James Bible' ' James Bible' '\n\nThis eBook is' '\n\nThis eBook is' ' for the use of' ' for the use of' ' anyone anywhere at no cost' ' anyone anywhere at no cost' ' and with' ' and with' '\nalmost no restrictions' '\nalmost no' ' whatsoever.' ' important restrictions' ' You may' ' whatsoever.' ' copy it, give it away or' ' You may copy it, give it' '\nre-use' ' away or' ' it under the terms of the' '\nre-use it under' ' Project' ' the terms of the' ' Gutenberg' ' Project' ' License included' ' Gutenberg License included' '\nwith this eBook or online' '\nwith this' ' at www' ' eBook or online' '.gutenberg.org' ' at www.gutenberg.org' '\n\n\nTitle: The King James Bible\n' '\n\n\nTitle: The King' '\nRelease' ' James Bible' ' Date: March' '\n\nRelease Date: March' ' 2, 2011 [EBook #10]' ' 2, 2011 [EBook' '\n[This King James' ' #10]' ' Bible was' '\n[This King James' ' orginally posted by Project' ' Bible was orginally posted' ' Gutenberg' ' by Project' '\nin late 1989]' ' Gutenberg' '\n\nLanguage: English' '\nin late 1989]' '\n\n\n*** START' '\n\nLanguage: English' ' OF THIS PROJECT GUTENBERG' '\n\n\n*** START OF THIS PROJECT' ' EBOOK THE KING' ' GUTENBERG EBOOK' ' JAMES BIBLE ***\n' ' THE KING JAMES' '\n\n' ' BIBLE ***\n\n\n' '\n' '\n' You can see that it never quite resynchronizes, because the underlying frame has moved, so even though some of the chunks are the same, the sequence of chunks thereafter is not! We need some kind of resynchronizing state machine that still maintains a more or less even spacing. """ import sys if __name__ == '__main__': buf = '' while True: inp = sys.stdin.read(16) if inp == '': break n = min(range(len(inp)), key=lambda i: (inp[i:i+2], i)) sys.stdout.write('%r\n' % (buf + inp[:n])) buf = inp[n:] sys.stdout.write('%r\n' % buf)