Chapter 13. Data Encoding
Analyze the malware found in the file Lab13-01.exe.
Compare the strings in the malware (from the output of the strings command) with the information available via dynamic analysis. Based on this comparison, which elements might be encoded?
By running strings over Lab13-01.exe we can see reference to Mozilla/4.0 and http://%s/%s/ which looks to be a User Agent combined with a string that would build a URL. By running Fakenet and the executable, we can see that it beacons to a URL not shown when using strings in addition to making a GET request to a resource that looks to be Base64 encoded.
Based on this we can assume that the URL and GET request resource this reaches out to is encoded.
Use IDA Pro to look for potential encoding by searching for the string xor. What type of encoding do you find?
By searching in IDA Pro for the string ‘xor’ we find a number of entries. Entries that are performing an xor of a register with itself such as ecx, eax, edx, al etc is indicative of a ‘zeroing’ command that will always result in 0 and is of no interest to us. This leaves 3 possible entries to examine, of which only one has an inbuilt function with code to examine as part of it.
By examining this entry we find it is located within what appears to be a looping function at ‘sub_401190’. As this doesn’t perform any additional checks to see if the buffer passed is the hex key it is using or ‘00’, this looks to be using single-byte XOR encoding with the key 0x3B.
What is the key used for encoding and what content does it encode?
From the above we know that the key used for encoding is 0x3B. By examining cross-references to the function ‘sub_401190’, we can see that one exists. Viewing this reveals that it is part of a function ‘sub_401300’ which looks to be loading the value of resource index ‘101’ into a buffer before it is passed to this encoding routine.
By extracting this using Resource Hacker and then leveraging 010 Editor to perform an XOR operation using the key 0x3B, we can see that this neatly decodes to ‘www.practicalmalwareanalysis.com’.
Based on this we know the key is used for decoding the resource at index 101 which has been encoding using XOR. This decodes to ‘www.practicalmalwareanalysis.com’.
Use the static tools FindCrypt2, Krypto ANALyzer (KANAL), and the IDA Entropy Plugin to identify any other encoding mechanisms. What do you find?
Using the Krypto ANALyzer (KANAL) plugin for PEiD, we can find Base64 encoding being used at ‘0x004050E8’.
At the time of writing the IDA Entropy Plugin and FindCrypt plugins are quite dated and are maintained poorly if at all. Both experience some compatibility issues with IDA Pro Free 5.0 which makes using these challenging. 3rd party developed newer versions of findcrypt exist, but these generally utilise IDAPython which is not supported in IDA Pro Free 5.0.
As a substitute for the IDA Entropy Plugin we can use the standalone ida-ent.exe binary.
developed by the same creator as the original IDA Entropy plugin ‘smoked chicken’. Note: The original download for this binary and plugin has broken over time and become abandoned. As a result a copy of this can be found here. This was originally sourced from a compiled repository by Blue Soul.
It’s important to note that entropy analysis by itself like this provides little context, and many pieces of compressed file formats will have high entropy.
There’s a level of dependency between chunk size and Max Entropy when trying to find sections of interest within a binary. In the below example we use a chunk size of ‘64’ and max entropy of ‘5.95’ to find areas of interest.
By using this we immediately find high entropy between 0x004050E4 and 0x004050EB which aligns perfectly with the Base64 encoding discovered using PEiD. Some useful tests for finding many types of cryptography constants are shown below:
- Chunk Size: 64 - Max Entropy: 5.95 (Used for locating Base64-encoding strings)
- Chunk Size: 256 - Max Entropy: 7.9 (Used for locating very random data)
Some key aspects of (Shannon) entropy often used in digital information analysis (and as a result malware analysis) are as follows:
- The max entropy possible is 8.
- The closer to 8, the more random (non-uniform) the data is.
- The closer to 0, the less random (uniform) the data is.
- English text is generally between 3.5 and 5.
- Properly encrypted or compressed data is generally over 7.5.
As a substitute for Findcrypt we can use free IDA alternative Ghidra which was developed by the National Security Agency (NSA), and a ported plugin FindCrypt-Ghidra; however, for this question we’ll skip this check as it’s not required given KANAL and ida-ent have already located what we’re looking for.
From our use of crypto analysing tools we have found Base64 encoding which appears to be used at ‘0x004050E8’. Looking at this reveals the below index string being used for Base64 encoding.
What type of encoding is used for a portion of the network traffic sent by the malware?
To find out whether this is related to the Base64 encoding we just found we’ll need to look for cross references to this to determine how it is being used.
Looking at the cross-references we can see that they’re all within the sub-routine ‘sub_401000’, so we’ll take a look at this closer to see if we can find any obvious constructs.
From this routine we can see that there’s 2 branches which look to be moving the value ‘=’ into the variable passed to this sub-routine which leads us to believe this may be the Base64 encoding routine given the strings present and that ‘=’ is used for Base64 padding. To determine where this Base64 encoding function is used we can look at cross-references to ‘sub_401000’ which reveals one within ‘sub_4010B1’, so we examine this further.
This contains a number of loops and conditional jumps; however, there’s 2 key constructs that help us determine this is the actual routine performing the Base64 encoding which is the following:
- comparison of passed string length to 3 bytes
- sub_401000 routine with Base64 encoding index string
- comparison of passed string length to 4 bytes
This is key to identifying Base64 functions as for every 3 bytes of input chunks given you wind up with 4 bytes of encoded ASCII. From this we know sub_4010B1 is the Base64 encoding routine with sub_401000 being the Base64 index routine. To locate where the Base64 encoding routine is being used we once again look for cross references to sub_4010B1 and find one at ‘sub_4011C9’.
Looking into this function we can confirm our suspicions and infer that the encoding used for a portion of the network traffic sent by the malware is Base64.
Where is the Base64 function in the disassembly?
From the above analysis we know the Base64 function begins at ‘0x004010B1’.
What is the maximum length of the Base64-encoded data that is sent? What is encoded?
To determine the maximum length of Base64-encoded data that will be sent we need to look a bit closer at ‘sub_4011C9’ which is calling the encoding function.
From this we can see that a call is being made to ‘gethostname’ which retrieves the hostname of the local computer. This then takes the first 12 bytes of the hostname and copies it using ‘strcopy’ before this is base64 encoded and formatted as the resource being retrieved in our previously discovered GET request.
Because we know that only 12 bytes will be sent to this encoding function, we can calculate the maximum length of Base64-encoded data that will be sent. To do this we need to remember from question 5 that for every 3 bytes of input chunks you get 4 bytes of encoded ASCII. Because we know that we have 12 bytes of input, we can perform the below calculation:
- Length = (12/3)*4 = 16 bytes.
At this point we know that the maximum length of Base64-encoded data that is sent is 16, and that this comes from the hostname of the system running this executable.
In this malware, would you ever see the padding characters (= or ==) in the Base64-encoded data?
Padding occurs to make up the required values if our hostname isn’t divisible by 3. Based on this we may see padding characters in the Base64-encoded data if we run this on a host with less than 12 bytes in its hostname which is not evenly divisible by 3.
What does this malware do?
By looking at cross-references to ‘sub_4011C9’ we find that it is called from the malware’s main method.
This helps us identify that the malware will send the hostname running it Base64-encoded in a GET request to www.practicalmalwareanalysis.com and repeat this approcimately every 30 seconds.
As a sidenote, the code inside of our beacon function will check the first character of the beacon response and check if it equals ‘6Fh’ (o).
If so it will return ‘1’ from the function indicating it was successful which will break the loop which is checking every 30 seconds.
Analyze the malware found in the file Lab13-02.exe.
Using dynamic analysis, determine what this malware creates.
Running this malware while running procmon we can see that a number of file operations occur which creates files with the naming convention ‘temp’ followed by what looks like random characters.
These files are 10mb each in size, and contain a small amount of seemingly random data which we wouldn’t expect to take up 10mb in file size.
Use static techniques such as an xor search, FindCrypt2, KANAL, and the IDA Entropy Plugin to look for potential encoding. What do you find?
Leveraging KANAL and IDA-Ent we don’t find anything of interest in this binary. By using a XOR search we find 6 primary entries of interest to look into further (entries not used to clear a register).
Delving into these further we find that they’re located in the following subroutines:
- 0040171F (Undefined function)
Of note is that a number of these operations occur within sub_401739. Based on this there’s the potential for encoding functions to be used within sub_40128D, sub_401739 and code likely residing at 0040171F.
To clean this up and for ease of navigation, we create a function at 00401570 (given 0040171F is inside of this undefined function as shown below), rename sub_40128D to SingleXOR, and rename sub_401739 to MultiXOR.
By plotting out the XRefs to SingleXOR, we can see that this is called from a subroutine inside of MultiXOR.
If we plot XRefs from our newly created function at ‘00401570’ we see this makes similar calls to MultiXOR.
Based on your answer to question 1, which imported function would be a good prospect for finding the encoding functions?
Based on question 1, it’s likely that the imported function ‘WriteFile’ would be a good prospect for finding the encoding functions given the data we found inside of the written files looks to be encoded. By looking for calls to this we find it located at sub_401000.
Where is the encoding function in the disassembly?
If we pivot from xrefs to WriteFile, we find only one which is from the executable we’re examining rather than an imported library, and this is located at sub_401000. Looking at xrefs to sub_401000 we find there is only one call from sub_401851.
This makes a number of calls, and has a string reference which appears to be related to the creation of files we’ve seen; however, there’s no XOR operations or calls which would indicate this is an encoding function.
By examining sub_401070 which is called by the above function, we find this may be performing some type of screenshot involving an open window based on its imported APIs.
Although this is interesting, it doesn’t appear to be performing any kind of encoding, so we instead move to look into sub_40181F.
This function is what kicks off MultiXOR we defined earlier, and based on this being the only user-defined function, we begin to believe that the encoding function is found here, within sub_40181F of the assembly.
Trace from the encoding function to the source of the encoded content. What is the content?
Working our way back to sub_401070 which we discovered before the encoding function took place, we can see that in addition to calling GetDesktopWindow to get a handle to the current screen, there’s calls to BitBlt and GetDIBits which is associated with getting pixel colour and layout of the desktop, which helps us to infer this is taking a screenshot.
Based on this we can assume that the content of files we’ve found are screenshots which have been encoded.
Can you find the algorithm used for encoding? If not, how can you decode the content?
Based on results of Ida-Ent, KANAL, and by looking at this in IDA, we can begin to assume that the algorithm used for encoding is custom, and not something easily fingerprintable. Looking into the flow of this program, we can’t see any obvious decoding function in place; however, if we examine the subroutine sub_4012DD, which is located in our encoding function, we get the feeling that this may be able to be used for both encoding and decoding.
To determine if this can be leveraged to decode the content, we can look at this more closely in a debugger such as OllyDbg. After opening this in OllyDbg, we can place a breakpoint right before the call to our identified encoding routine (sub_40181F) at address 00401880 , and directly after the call to writing a file (sub_401851) at address 0040190A.
- Breakpoint at 00401880 (call to sub_40181F)
- Breakpoint at 0040190A (call to sub_401851)
After running the program for a brief period of time we hit our breakpoint. At this point we can right click the value stored in ESP, and click FOLLOW IN DUMP to view the data that will be encoded.
By opening one of the encoded files from this malware and opening it in a hex editor such as HxD, we can copy the hex contents and paste it in place of the data that was going to be encoded.
From this we see the data change.
After resuming the malware again, if we allow it to complete a new file will be created. If we give this new file a .bmp extension, we’re able to view it and see it is a recovered screenshot as expected. From this we know we can use the power of a debugger and the encoding function, to also decode encoded data.
Using instrumentation, can you recover the original source of one of the encoded files?
Leveraging Immunity Debugger’s extended python scripting functions, we’re able to develop a python script used to decode the encoded file. Note: This is largely covered and taken from the PMA material.
Opening the program in Immunity Debugger, we can now run our script with the specified file we want to be decoded using the provided terminal inside Immunity.
By continuing to resume the program after it hits our breakpoint, we get a newly created file with the decoded content.
From here we’ve successfully used instrumentation to automate the recovery of the original source of an encoded file.
Analyze the malware found in the file Lab13-03.exe.
Compare the output of strings with the information available via dynamic analysis. Based on this comparison, which elements might be encoded?
By running strings over this malware we see reference to what looks to be a custom Base64 index string, an error associated with an API, references to listing directory contents and spawning cmd.exe, a possible C2 of www.practicalmalwareanalysis.com, and reference to key and block length which seems to indicate that Base64-encoding and possible encryption is used in this binary.
By executing this when procmon and fakenet is running, we can see a connection occurs back to www.practicalmalwareanalysis.com via encrypted traffic on port 8910 which is then followed by cmd.exe running without any further actions. This is a classic sign of a potential reverse shell where commands sent through this connection are sent to cmd.exe for executing on the host.
At this point we still don’t know what elements may be encoded, but we have enough evidence to assume that this does use encoding of some kind. To get a bit more information, we can setup a netcat listener on port 8910 and leverage ApateDNS to notify the malware that www.practicalmalwareanalysis.com can be found at 127.0.0.1. By doing this we’re able to catch the malware connection which we suspect to be a reverse shell, and attempt to run a command such as ‘dir’.
Based on the response received, we can begin to assume that the response to commands run from the C2 of this malware is encoded.
Use static analysis to look for potential encoding by searching for the string xor. What type of encoding do you find?
By looking for the string ‘xor’ in IDA we find that there are 194 entries. To reduce the amount of noise generated by this search (given we want to exclude any register clearing functions as a start) we can search using regex to look for only operations which contain items inside of ‘’ given that register clearing will never contain this.
This drops us down to 68 entries instead of 194. By viewing the first entry at ‘00401F56’ we find this is located within a subroutine ‘sub_401AC2’.
By looking at when this subroutine finishes we find that it finishes at ‘00402237’. We now know that any entries for XOR between 00401AC2 and 00402237 fall under this subroutine, so we give it a name of XOR_OP_1.
If we then run the same search and look for the first xor operation that falls after 00402237 (004022DD), we find another subroutine (sub_40223A) which begins at 0040223A and ends at 004027EA, so we give it a name of XOR_OP_1. By repeating this process we find the following six functions perform xor operations that look like they may be encoding, of which the type of encoding is not yet clear.
- sub_401AC2: XOR_OP_1 (00401AC2 to 00402237)
- sub_40223A: XOR_OP_2 (0040223A to 004027EA)
- sub_4027ED: XOR_OP_3 (004027ED to 00402DA5)
- sub_402DA8: XOR_OP_4 (00402DA8 to 00403163)
- sub_403166: XOR_OP_5 (00403166 to 0040352A)
- sub_403990: XOR_OP_6 (00403990 to 00403A06)
There is one other XOR reference we haven’t accounted for; however, this falls within a OS library function so we’re not concerned about it.
Use static tools like FindCrypt2, KANAL, and the IDA Entropy Plugin to identify any other encoding mechanisms. How do these findings compare with the XOR findings?
Leveraging KANAL, we’re able to see references to indicate this is likely using Rijndael (AES) encryption at address 0040C908, and 0040CA08.
If we examine the .rdata section using IDA-Ent and look for random data (Chunk Size: 256 - Max Entropy: 7.9), we find a number of results appear between 0040C8EC and 0040CA1A which aligns with what we found using KANAL.
Changing this to look for Base64-encoded strings (Chunk Size: 64 - Max Entropy: 5.95) reveals an area within the .data section that appears to be Base64-encoded between 0041209E and 004120A7.
As mentioned in Lab13-01, as a substitute for Findcrypt we can use Ghidra and a ported plugin FindCrypt-Ghidra to find evidence, once again of AES encryption being used.
- Td0: 0040db08
- Td1: 0040df08
- Td2: 0040e308
- Td3: 0040e708
- Te0: 0040cb08
- Te1: 0040cf08
- Te2: 0040d308
- Te3: 0040d708
Based on the above we can see that these findings help us infer AES is likely being used when compared to performing a simple XOR check.
Which two encoding techniques are used in this malware?
From the analysis above we can make an informed assumption that the two techniques used in this malware are Base64-encoding using a custom index string and AES Encryption.
For each encoding technique, what is the key?
To determine the AES encryption key we go back to the earliest known address which appeared to be associated with encryption, and in this case it’s address 00401AC2 under XOR_OP_1 as we identified it looking for XOR operations. In this we can quickly see reference to a key being provided.
By examining cross-references to this function we find one reference within the main function. Slightly before the call we find a string being pushed to the stack which appears to be our key ‘ijklmnopqrstuvwx’.
At this point we can infer that the AES encryption key is likely ‘ijklmnopqrstuvwx’.
By looking at the start of the identified Base64-encoded data (0041209E) we can see reference to the Base64-encoding index string in use which we previously found using strings in question 1.
At this point we can infer that the custom idex string used for this Base64-encoding is
For the cryptographic encryption algorithm, is the key sufficient? What else must be known?
When it comes to AES encryption a key in itself isn’t always sufficient to decrypt the content encrypted using it. If we take a look at the AES decryption routines built into CyberChef, we can see that based on the AES standard implemented you will also likely need to know the key size and algorithm, any Initialisation Vector, and the mode used to encrypt the content.
What does this malware do?
Based on the dynamic and static analysis above we can begin to feel confident that this acts as a reverse command shell which beacons responses to commands run as encrypted output to the IP resolved from www.practicalmalwareanalysis.com. Exactly how this relates to the identified AES encryption and Base64-encoding is unknown so we explore the binary further.
Starting with the custom Base64-encoded index string, we find it is referenced in a subroutine at ‘sub_40103F’.
If we look at cross-references to this subroutine, we find is is called in another large routine at ‘sub_401082’.
Of note is that this appears to have 2 large loops which at a high level may indicate a decoding or encoding function is implemented here, or that it is checking for user input. We’re not looking to examine every operation, but rather looking to identify the key aspects of this malware based on observed characteristics. As such if we examine cross-references to ‘sub_401082’ we find it is called inside a windows API call to ‘StartAddress’ used to start its own thread, which if we once again examine cross-references find it is called within ‘sub_4015B7’.
What we can gleam from this is that the Base64 decoding function looks to occur after cmd.exe is run, and if we continue looking through these cross-references we can see evidence that a socket connection to www.practicalmalwareanalysis.com occurs slightly before a pipe is created to pass the input and output from this socket directly to cmd.exe.
Based on this and our dynamic analysis, we begin to suspect that any input, or commands sent from the C2 to cmd.exe will need to come in as a Base64-encoded string using the previously identified custom index string. To identify if this is related to the AES encryption we identified, we first need to know where AES operations are hiding. Based on our Findcrypt results in Ghidra, we know that anything labeled Td0 to Td3 is related to a decryption AES operation, and anything labeled Te0 to Te3 are related to an encryption AES operation.
In this instance, due to the addresses falling within XOR_OP_3 and XOR_OP_2 functions it means that:
- XOR_OP_3 is related to decryption
- XOR_OP_2 is related to encryption
By graphing the user defined cross-references to both XOR_OP_3 and XOR_OP_2, we find in addition to the above that XOR_OP_5 is related to decryption, and XOR_OP_4 is related to encryption.
From here we want to understand XOR_OP_1 and XOR_OP_6 and how they fit into this piece of malware. By graphing user defined cross-references to the main function of the malware, we can see that the AES operations aren’t related to the base64-encoded functions we identified previously in addition to how XOR_OP_1 and XOR_OP_6 fit into the picture.
Of interest is that this flow doesn’t show either of the routines XOR_OP_3 or XOR_OP_5, indicating that the decryption routine may never be used in this binary. Looking at cross-references to XOR_OP_5 we find that it looks to be called by part of the program that is marked as code but has no clearly defined function. By moving to the start of this section (00403745) and creating a function, we see what looks to be an uincovered decryption routine.
At this point we can rename this ‘Decrypt_AES’, and take a very similar looking function we previously identified (0040352D) which calls XOR_OP_4, and rename this ‘Encrypt_AES’. After graphing user defined xrefs, we can see the relation between these 2 functions.
Finally we take a look at how this encryption function relates to the rest of the malware.
In this case we find that it is called within ‘sub_40132B’ which is started in a thread after cmd.exe has been run and our previous StartAddress call has run. It performs the AES encryption, and then writes the output back to the console (which in this case is redirecting the output back through the established socket).
Through all of this analysis we can conclude that this malware creates a reverse shell back to its C2 at www.practicalmalwareanalysis.com. Any incoming commands to the reverse shell are decoded using Base64 and a custom index string of
and any responses to those commands are encrypted using AES with a key of ijklmnopqrstuvwx before being sent back to the C2.
Create code to decrypt some of the content produced during dynamic analysis. What is this content?
To do this we first need to get some content. First off we fire up ApateDNS and netcat like we did back in question 1 of this lab exercise; however, this time we’re going to setup the netcat listener on a remote system and run Wireshark to capture traffic on one of the systems.
- nc.exe -nlvp 8910
After running the malware we quickly see a connection, and if we kill malware with CTRL + C we then see what it was sending back to the server, which in this case looks like gibberish as it’s encrypted.
Examining the related TCP stream in Wireshark as a hex dump reveals different data which has been sent, in this case it is the below hex.
37 f3 1f 04 51 20 e0 b5 86 ac b6 0f 65 20 89 92 4f af 98 a4 c8 76 98 a6 4d d5 51 8f a5 cb 51 c5 cf 86 11 0d c5 35 38 5c 9c c5 ab 66 78 40 1d df
By using CyberChef, we can confirm our analysis about encryption by checking the decoded output of this hex using the key we identified and ‘0’ padding as required given no IV is used. From this output we can see it is the starting header shown when cmd.exe is run.
If we were to know the exact response format of what is sent from the C2 server, we could encode our commands using the custom Base64 index string we’ve identified such as the below; however, during testing this failed to get a response when sent from netcat, likely due to extra headers typically being sent from the legitimate C2 and the malware being unstable.
To manually decode the Base64-encoded data, we can use the following python2 script which is heavily taken from the PMA material.
import string import base64 s = "" index_string = 'CDEFGHIJKLMNOPQRSTUVWXYZABcdefghijklmnopqrstuvwxyzab0123456789+/' b64 = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/' b64_command = 'f2jxAY1r' for ch in b64_command: if (ch in index_string): s += b64[string.find(index_string,str(ch))] elif (ch == '='): s += '=' print base64.decodestring(s)
This will take a command which has been encoded using our custom index string and decode it.
To manually decrypt the AES encrypted data, for a change we can use the following python3 script created for this purpose.
from Crypto.Cipher import AES from binascii import unhexlify raw = '37 f3 1f 04 51 20 e0 b5 86 ac b6 0f 65 20 89 92 ' + \ ' 4f af 98 a4 c8 76 98 a6 4d d5 51 8f a5 cb 51 c5 ' + \ ' cf 86 11 0d c5 35 38 5c 9c c5 ab 66 78 40 1d df ' ciphertext = unhexlify(raw.replace(' ','')) IV = unhexlify('00000000000000000000000000000000') key = 'ijklmnopqrstuvwx' obj = AES.new(key, AES.MODE_CBC, IV) print(obj.decrypt(ciphertext).decode('UTF-8'))
The end result is 2 custom python scripts, one using Python2, and one using Python3, designed to decode bothe the Base64-encoded string, and the AES encrypted data.
python2 ./B64Decode.py whoami python3 ./AESDecrypt.py Microsoft Windows XP [Version 5.1.2600]
This concludes chapter 13, proceed to the next chapter.