APNIC Pty Ltd.

03/23/2022 | Press release | Distributed by Public on 03/22/2022 16:53

*This series explores real-life examples of advanced Yara uses, seeking to generalize them to form an abstract problem. It also explores general approaches to solve such uses, while examining the pros and cons of possible solutions. The material provided here should be educational for those who are new to Yara but should also be suitable for very experienced Yara users as it uncovers some fundamental issues.*

In my second post, I looked at how we can improve the overall threshold usage, and also report every file that crossed the maximum threshold.

Another way to approach this problem is based on Yara's native features for hexadecimal pattern definitions. Yara allows you to skip some bytes, essentially providing some sort of wildcarding in binary. We can use this feature to define larger patterns, such as those that start with Y and end with Z and have acceptable length. Later we can check if there happens to be X in between. Of course, you need to define all possible beginnings and endings by hand, which is a bit tedious.

Note the prefix , which provides access to the matched pattern length. It is essential when you use patterns that skip some bytes because we don't know the matched pattern length in advance. More information about the match length is available.

Here's how such a rule may look:

rule three_body_problem_7{

meta:

description ="Rule to detect 3 patterns in arbitrary order,

while limiting the distance between them."

strings:

$p= { ((22 22 22 22[4-12]33 33 33 33) | (33 33 33 33

[4-12]22 22 22 22)) }

$q= { ((33 33 33 33[4-12]11 11 11 11) | (11 11 11 11

[4-12]33 33 33 33)) }

$r= { ((11 11 11 11[4-12]22 22 22 22) | (22 22 22 22

[4-12]11 11 11 11)) }

$x= {11 11 11 11}

$y= {22 22 22 22}

$z= {33 33 33 33}

condition:

($pand

forany jin(1..#p):

(

forany iin(1..#x):

(@x[i]>= @p[j]+4and@x[i]<= @p[j]+!p[j]-4)

)

)

or

($qand

forany jin(1..#q):

(

forany iin(1..#y):

(@y[i] >= @q[j]+4and@y[i] <= @q[j]+!q[j]-4)

)

)

or

($rand

forany jin(1..#r):

(

forany iin(1..#z):

(@z[i] >= @r[j]+4and@z[i] <= @r[j]+!r[j]-4)

)

)

}

**Pros:**

- This rule can detect all variations without missing any patterns.
- It contains a reduced number of nested iterations (two versus three in the past rules).

**Cons:**

- Despite a lower number of nested iterations and lack of greedy memory consumption, the rule still produced too many loops in our tests and could just get stuck on some files because of the virtual machine (VM). During testing, the rule kept running for over two hours on a specially crafted file, so we had to interrupt.

Wildcarded hexadecimal pattern combinations

Following the idea of wildcarding hex-patterns, we can describe all the X, Y, Z sequences as follows:

rule three_body_problem_8{

meta:

description = "Rule to detect 3 patterns in arbitrary order,

while limiting the distance between them."

strings:

$p1= {22 22 22 22[0-8]11 11 11 11[0-8]33 33 33 33}

$p2= { 33 33 33 33[0-8]11 11 11 11[0-8]22 22 22 22}

$p3= {33 33 33 33[0-8]22 22 22 22[0-8]11 11 11 11}

$p4= {11 11 11 11[0-8]22 22 22 22[0-8]33 33 33 33}

$p5= {11 11 11 11[0-8]33 33 33 33[0-8]22 22 22 22}

$p6= { 22 22 22 22[0-8]33 33 33 33[0-8]11 11 11 11}

condition:

any of them

}

**Pros:**

- This is a fast rule.
- This rule has a very simple condition section.

**Cons:**

- It requires creating all possible sequences of the patterns by hand.
- This rule may be imprecise in some cases, for example, when the distance between the middle patterns and the edge patterns reaches the maximum (8 bytes in this example). So far, it may detect a pattern with X, Y, Z, and length M = 12+8*2 = 28, which is 8 bytes larger than what we were looking for. However, given the simplicity of the rule it's probably an acceptable error.

Precise wildcarded hexadecimal pattern combinations

The previous rule can be improved if we add a check for overall pattern match length.

This requires using iterations through sets of patterns and anonymously using prefixes and . This is required by the iteration set (), which expands to () and is walked through inside the loop body. To be able to refer to the current pattern at the current step of iteration, Yara allows the use instead of being specific (like ). If that sounds new to you, read more about applying conditions to multiple strings.

The second loop iterates through all matches (with count ) of the pattern and validates its length ( is the length of the i-th pattern match for the ).

rule three_body_problem_9{

meta:

description ="Rule to detect 3 patterns in arbitrary order,

while limiting the distance between them."

strings:

$p1= {22 22 22 22[0-8]11 11 11 11[0-8]33 33 33 33}

$p2= { 33 33 33 33[0-8]11 11 11 11[0-8]22 22 22 22}

$p3= {33 33 33 33[0-8]22 22 22 22[0-8]11 11 11 11}

$p4= {11 11 11 11[0-8]22 22 22 22[0-8]33 33 33 33}

$p5= {11 11 11 11[0-8]33 33 33 33[0-8]22 22 22 22}

$p6= { 22 22 22 22[0-8]33 33 33 33[0-8]11 11 11 11}

condition:

any of themand

forany of ($p*): (

forany iin(1..#): (![i] <= 20)

)

}

**Pros:**

- This is a fast rule.
- It adds boundary precision checks to the previous rule, which eliminates false positives.

**Cons:**

- The rule still contains a nested loop, which opens the door to a possible performance impact.

Precise wildcarded hexadecimal pattern combinations without a nested loop

There is a way to implement the previous rule excluding the nested loop with the rule below, which defines a single pattern with multiple contents variations, including all invariants.

rule three_body_problem_10{

meta:

description ="Rule to detect 3 patterns in arbitrary order,

while limiting the distance between them."

strings:

$p= { (

(11 11 11 11[0-8] ( (22 22 22 22[0-8]33 33 33 33) | (

33 33 33 33[0-8]22 22 22 22) ) ) |

(22 22 22 22[0-8] ( (11 11 11 11[0-8]33 33 33 33 ) | (

33 33 33 33[0-8]11 11 11 11) ) ) |

(33 33 33 33[0-8] ( (11 11 11 11[0-8]22 22 22 22) | (

22 22 22 22[0-8] 11 11 11 11) ) )

) }

condition:

any of ($p*)andforany of iin(1..#p): (!p[i] <= 20)

}

**Pros:**

- This is a fast rule.
- It contains boundary precision checks like the previous rule, which eliminates false positives.
- It contains no alarming nested loops.

**Cons:**

- There is a length limit for a wildcarded hexadecimal Yara pattern, which is fixed at 200 bytes, so this method may not be applicable in some cases.
- The more wildcards there are the slower the rule becomes - this rule is about four times slower than the previous one according to our tests.
- It's hard to comprehend and create such a rule by hand.

The last con, indeed, hardly makes this approach useful. However, if you want to try, you can use a single-liner shell command to produce a combined pattern.

# First of all, define variables with hex patterns and bytes to skip

X="11 11 11 11" Y="22 22 22 22" Z="33 33 33 33" S="[0-8]"

# Produce the pattern combinations

echo-e "{(\n($X $S (($Y $S $Z)|($Z $S $Y)))|\n($Y $S (($X $S

$Z)|($Z $S $X)))|\n($Z $S(($X $S $Y)|($Y $S $Z)))\n)}"

This should produce the following output:

{(

(11 11 11 11 [0-8] ((22 22 22 22 [0-8] 33 33 33 33)|(33 33 33 33 [0-8] 22 22 22 22)))|

(22 22 22 22 [0-8] ((11 11 11 11 [0-8] 33 33 33 33)|(33 33 33 33 [0-8] 11 11 11 11)))|

(33 33 33 33 [0-8] ((11 11 11 11 [0-8] 22 22 22 22)|(22 22 22 22 [0-8] 33 33 33 33)))

)}

In my next post, I will look at how you might combine Yara with other tools and break free of the VM restraints to have full control over the condition validation process.

Appendix: Yara test set file generator

Should you like to try out the proposed solution or keep finding the best one, feel free to use the following python script that generates initial test files with some special cases mentioned in this series. Note, that good solutions **must** detect all sample_good* files and no sample_bad* files.

X = b"\x11\x11\x11\x11" Y = b"\x22\x22\x22\x22" Z = b"\x33\x33\x33\x33" J = b" " #Junk or blank space with open("sample_good1.dat","wb") as f: f.write( X + Y + Z ) and f.close() with open("sample_good2.dat","wb") as f: f.write( Y + X + J*100 + X + Y + Z ) and f.close() with open("sample_good3.dat","wb") as f: f.write( Z + J*4 + Y + X ) and f.close() with open("sample_good4.dat","wb") as f: f.write( (X+J)*90+ J*20 + X + J*4 + Y + J*4 + Z ) and f.close() with open("sample_good5.dat","wb") as f: f.write( X + J*4 + Y + J*4 + Z + \ J*100 + X + J*4 + Y + Z + \ J*100 + X + Y + Z ) f.close() with open("sample_good6.dat","wb") as f: f.write( (X+J)*100 + J*2 + Y + J*4 + Z ) and f.close() with open("sample_bad1.dat","wb") as f: f.write( X + 16*J + Y + 16*J + Z ) and f.close() with open("sample_bad2.dat","wb") as f: f.write( X + 8*J + Y ) and f.close() with open("sample_bad3.dat","wb") as f: f.write( X + 8*J + Y + 8*J + Z ) and f.close() with open("sample_bad4.dat","wb") as f: f.write( ( X + 8*J + Y + 8*J + Z+8*J) * 1000 ) and f.close() with open("sample_bad5.dat","wb") as f: f.write( ( X*100 + 8*J + Y + 8*J + Z*100) * 1278) and f.close()

*Vitaly Kamluk is Director of the Global Research and Analysis Team for Kaspersky Asia Pacific.*

The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.