July 23, 2019

Cryptopals Challenges: #6

In this challenge we are tasked with breaking a repeating-key XOR ("Vigenere") cipher without knowing the key.

Cryptopals Challenges: #6

In this task we are tasked with breaking a repeating-key XOR ("Vigenere") cipher without knowing the key.  Cryptopals gives us a general overview of how to complete this task:

  1. Guess the keysize by taking keysizes between 2-40, separating the text into keysize sized blocks, and computing the average difference of bits between each block.
  2. Break the ciphertext into blocks of size keysize
  3. Transpose the blocks into n new strings (n being the number of blocks) where each string has the nth character of each block
  4. Solve each transposed string block as a single-character XOR problem
  5. Concatenate each single-character XOR keys into a final phrase which is the key

This approach is a very commonly known theoretical approach to breaking a Vigenere cipher.  Cryptopals mentions that many people know how to break the cipher, but few people can implement the crack.  In this post we will look at each step individually and explain the mechanics behind the step.  With that being said, let's get started.

Step One - Hamming Distance

Hamming distance is the difference in bits of two strings.  We will use this function to compare blocks for a measure of similarity.  Why does hamming distance allow us to determine the keysize?  This is easily thought of using a single-character xor cipher.  With a single-character xor, identical strings will be encrypted identically.  The phrase "and" under the single-character xor of "a" will always result to the same thing.  Now, we can assume that the ciphertext likely has characters which were encrypted under the same character in the key.  For example the phrase "quick and nimble" encrypted under the key "abc" will encrypt both "i" characters using the "c" character.  The idea with computing the hamming distance between blocks is a large scale of this idea.  If two blocks contain the phrase "and" and they happen to be encrypted with the same letters of the key, the hamming distance will decrease between the blocks.  The idea is the blocksize with the lowest hamming distance has the highest chance of being the keysize and repeating xor patterns.

Programming the hamming distance computation is simple.  We take two byte strings, zip them into iterables, take each corresponding digit in both binary strings, xor the digits, count the number of ones.

def Hamming(str1, str2):
    distance = 0

    #Create iterables using zip(), stops when shortest iterable is exhausted
    for b1, b2 in zip(str1, str2):
        #XOR b1 and b2
        diff = b1 ^ b2

        #Count the ones
        distance += sum([1 for bit in bin(diff) if bit == '1'])

    return(distance)

Counting the number of ones gives you the number of differing bits due to the property of XOR to return 1 if and only if one OR the other is TRUE but not both.  For example 10011 ⊕ 11111 = 01100, showing two differing bits as expected by the differing sequence of "00" and "11" in the 2nd and 3rd positions.  Testing this code with Hamming(b"this is a test" , b"wokka wokka!!!") returns 37 as expected!

Now to use the hamming distance we write a function which computes the hamming distance of neighboring blocks and returns the average distance.  Some solutions compare the first block to the rest, some compare all combos of blocks, and some only do the first couple blocks.  Regardless, I found that comparing neighboring blocks gave me a large enough sample to determine the keysize.

def Crack(ciphertext, keysize):
    distances = []

    #convert to bytes
    ciphertext = base64.b64decode(ciphertext).hex()

    #Separate into blocks of size KEYSIZE
    chunks = [ciphertext[i:i+keysize] for i in range(0,len(ciphertext),keysize)]

    while True:
        try:
            #Take the first two chunks and compute the hamming distance
            str1 = chunks[0]
            str2 = chunks[1]
            dist = Hamming(bytes(str1, 'utf-8'), bytes(str2, 'utf-8'))

            #Normalize the distance, add to list, and delete the chunks
            distances.append(dist/keysize)
            del chunks[0]
            del chunks[1]

        #Once there are no more blocks, return the normalized distances
        except Exception as e:
            return sum(distances)/len(distances)

Now that we have a method for determining the keysize, we just need to write code to try keysizes within the range of 2-40.  I chose to only take one keysize, but you could record the three lowest scored keysizes.

#Set initial values for the keysize and shortest distance
keysize = 0
shortest = 600

#For each keysize in the range of [2, 40]
for i in range(2,40):

	#Compute the average hamming distance for that KEYSIZE
    hamming_distance = Crack(ciphertext, i)

	#If the hamming distance is the lowest recorded, save it
    if(hamming_distance < shortest):
        shortest = hamming_distance
        keysize = i

print("KEYSIZE: {}".format(keysize))

Running the above code gives us that the likely keysize is 29 characters long.

Step 2: Breaking Ciphertext into Blocks

Since we now know the keysize, we can break the ciphertext into blocks which are of length keysize.  This is important because each of these blocks were encrypted with the full length of the key.  Python's one line for loops make breaking the ciphertext into blocks incredibly easy.

#Change to bytes
ciphertext = b64decode(ciphertext)

#Break into blocks
blocks = [ciphertext[keysize*i:keysize*(i+1)] for i in range(int(len(ciphertext)/keysize))]

Since I use this syntax a lot in Cryptopals challenges, I feel it is important to review what is going on here.  The following code will create an array populated with 0,1,2:

arr = [i for i in range(3)]

The brackets tell Python to create an array, and the loop inside tells Python what to populate the array with.  When breaking the ciphertext into blocks we are taking i from 0 to the length divided by keysize.  At each step we are taking blocks from i*keysize to (i+1)*keysize to get a block of size keysize.  To see how this works consider the keysize 16.  When we take the first block (i = 0) we get ciphertext[0*16:1*16] = ciphertext[0:16].

Step 3: Block Transposition

Now t hat we have broken the ciphertext into blocks, we have to transpose the blocks by taking the nth byte and placing it into the nth string.  Why are we doing this if we already have blocks which are encrypted with the same key?  By separating each block into n chunks of the nth bytes, we are separating out each byte which was encrypted with the same character in the key!  For example if we had "blockoftext!" encrypted with "abc" we would have blocks of "blo", "cko", "fte", "xt!".  Taking each blocks nth byte and separting into strings we would get "bcfx", "lktt", and "ooe!" as our chunks.  The first chunk was encrypted using "a", the second with "b", and the third with "c".  We can solve each of these with single-char xors just like challenge five!

There are many ways to do this in code, I chose to loop through each digit in the range of keysize and for each block take that digit and add it to a chunk.

#Create an array to store all chunks
chunks = []

#For each i in the range of keysize
for i in range(keysize):

	#Create a new chunk
    chunk = b''
    
    #For each block in the separated blocks
    for block in blocks:
    	
        #Take the ith byte and concatenate to the created chunk
        chunk = chunk + block[i:i+1]
        
    #Add the newly made chunk to the list of chunks
    chunks.append(chunk)

Step 4: Solving each Chunk

We now revisit our procedure for Problem #5 and solve each chunk as if it were a standalone repeating key xor.  After solving the chunks, we concatenate the single-character to a master key to obtain the full phrase.  Truthfully, I should have encapsulated the Problem 5 code into a function to make it more clean.  However, I did these problems early in the morning and in a rushed mood so I just repeated the hacked together code here.

#Starting variables for the loop
key = ""
best_character = 0
best_phrase = ""

for chunk in chunks:
    
    #Reset the max tracker for each chunk
    value_highest = 0
    
    #Find the frenquency value for each i, recording the i with the highest
    for i in range(256):
        phrase, value = SingleCharXOR(chunk, i)
        if(value > value_highest):
            value_highest = value
            best_character = i
    
    #Convert the character to ascii form and add to the key
    key += chr(best_character)
print("KEY: {}".format(key))

From this code we obtain a key which is not nonsense, meaning our previous code worked!  Again the key is a Vanilla Ice quote:

KEY: Terminator X: Bring the Noise

Step 5: Decrypting the Ciphertext

Since we know the key, we can easily decrypt the ciphertext using the code written in problem five.

print(RepeatingXOR(ciphertext, bytes(key, 'utf-8')).decode('ascii'))

Decrypting the ciphertext gives the complete lyrics to Play That Funky Music by Vanilla Ice.

I hope you enjoyed this guide to Cryptopals #5, it was a long one!  Recently I got my subscriber feature set up.  If you enjoy the content on my blog, feel free to enter your email in the form below this post to receive emails of future content!  In the future I hope to implement a feature to allow users to subscribe to specific tags.