iocextract

Developed by InQuest Build Status Documentation Status Code Health Test Coverage PyPi Version

Advanced Indicator of Compromise (IOC) extractor.

Overview

This library extracts URLs, IP addresses, MD5/SHA hashes, and YARA rules from text corpora. It includes obfuscated and “defanged” IOCs in the output, and optionally deobfuscates them.

The Problem

It is common practice for malware analysts or endpoint software to “defang” IOCs such as URLs and IP addresses, in order to prevent accidental exposure to live malicious content. Being able to extract and aggregate these IOCs is often valuable for analysts. Unfortunately, existing “IOC extraction” tools often pass right by them, as they are not caught by standard regex.

For example, the simple defanging technique of surrounding periods with brackets:

127[.]0[.]0[.]1

Existing tools that use a simple IP address regex will ignore this IOC entirely.

The Solution

By combining specially crafted regex with some custom postprocessing, we are able to both detect and deobfuscate “defanged” IOCs. This saves time and effort for the analyst, who might otherwise have to manually find and convert IOCs into machine-readable format.

A Simple Use Case

Many Twitter users post C2s or other valuable IOC information with defanged URLs. For example, this tweet from @InQuest:

Recommended reading and great work from @unit42_intel:
https://researchcenter.paloaltonetworks.com/2018/02/unit42-sofacy-attacks-multiple-government-entities/ ...
InQuest customers have had detection for threats delivered from hotfixmsupload[.]com
since 6/3/2017 and cdnverify[.]net since 2/1/18.

If we run this through the extractor, we can easily pull out the URLs:

https://researchcenter.paloaltonetworks.com/2018/02/unit42-sofacy-attacks-multiple-government-entities/
hotfixmsupload[.]com
cdnverify[.]net

Passing in refang=True at extraction time would remove the obfuscation, but since these are real IOCs, let’s leave them defanged in our documentation. :)

Installation

You may need to install the Python development headers in order to install the regex dependency. On Ubuntu/Debian-based systems, try:

sudo apt-get install python-dev

Then install iocextract from pip:

pip install iocextract

If you have problems installing on Windows, try installing regex directly by downloading the appropriate wheel from PyPI and running e.g.:

pip install regex-2018.06.21-cp27-none-win_amd64.whl

Usage

Try extracting some defanged URLs:

>>> content = """
... I really love example[.]com!
... All the bots are on hxxp://example.com/bad/url these days.
... C2: tcp://example[.]com:8989/bad
... """
>>> import iocextract
>>> for url in iocextract.extract_urls(content):
...     print url
...
hxxp://example.com/bad/url
tcp://example[.]com:8989/bad
example[.]com
tcp://example[.]com:8989/bad

Note that some URLs may show up twice if they are caught by multiple regexes.

If you want, you can also “refang”, or remove common obfuscation methods from IOCs:

>>> for url in iocextract.extract_urls(content, refang=True):
...     print url
...
http://example.com/bad/url
http://example.com:8989/bad
http://example.com
http://example.com:8989/bad

You can even extract and decode hex-encoded URLs:

>>> content = '612062756e6368206f6620776f72647320687474703a2f2f6578616d706c652e636f6d2f70617468206d6f726520776f726473'
>>> for url in iocextract.extract_urls(content):
...     print url
...
687474703a2f2f6578616d706c652e636f6d2f70617468
>>> for url in iocextract.extract_urls(content, refang=True):
...     print url
...
http://example.com/path

All extract_* functions in this library return iterators, not lists. The benefit of this behavior is that iocextract can process extremely large inputs, with a very low overhead. However, if for some reason you need to iterate over the IOCs more than once, you will have to save the results as a list:

>>> list(iocextract.extract_urls(content))
['hxxp://example.com/bad/url', 'tcp://example[.]com:8989/bad', 'example[.]com', 'tcp://example[.]com:8989/bad']

A command-line tool is also included:

$ iocextract -h
usage: iocextract [-h] [--input INPUT] [--output OUTPUT] [--extract-ips]
                  [--extract-urls] [--extract-yara-rules] [--extract-hashes]
                  [--refang] [--strip-urls]

Advanced Indicator of Compromise (IOC) extractor. If no arguments are
specified, the default behavior is to extract all IOCs.

optional arguments:
  -h, --help            show this help message and exit
  --input INPUT         default: stdin
  --output OUTPUT       default: stdout
  --extract-ips
  --extract-urls
  --extract-yara-rules
  --extract-hashes
  --refang              default: no
  --strip-urls          remove possible garbage from the end of urls. default:
                        no
  --wide                preprocess input to allow wide-encoded character
                        matches. default: no

Only URLs and IPv4 addresses can be “refanged”.

More Details

This library currently supports the following IOCs:

  • IP Addresses
    • IPv4 fully supported
    • IPv6 partially supported
  • URLs
    • With protocol specifier: http, https, tcp, udp, ftp, sftp, ftps
    • With [.] anchor, even with no protocol specifier
    • IPv4 and IPv6 (RFC2732) URLs are supported
    • Hex-encoded URLs with protocol specifier: http, https, ftp
    • URL-encoded URLs with protocol specifier: http, https, ftp, ftps, sftp
  • Emails
    • Partially supported, anchoring on @
  • YARA rules
  • Hashes
    • MD5
    • SHA1
    • SHA256
    • SHA512

For IPv4 addresses, the following defang techniques are supported:

Technique Defanged Refanged
. -> [.] 1[.]1[.]1[.]1 1.1.1.1
. -> (.) 1(.)1(.)1(.)1 1.1.1.1
. -> \. 1\.1\.1\.1 1.1.1.1
Partial 1[.1[.1.]1 1.1.1.1
Any combination 1.)1[.1.)1 1.1.1.1

For URLs, the following defang techniques are supported:

Technique Defanged Refanged
. -> [.] example[.]com/path http://example.com/path
. -> (.) example(.)com/path http://example.com/path
. -> \. example\.com/path http://example.com/path
Partial http://example[.com/path http://example.com/path
/ -> [/] http://example.com[/]path http://example.com/path
Cisco ESA http:// example .com /path http://example.com/path
:// -> __ http__example.com/path http://example.com/path
hxxp hxxp://example.com/path http://example.com/path
Any combination hxxp__ example( .com[/]path http://example.com/path
Hex encoded 687474703a2f2f6578616d706c652e636f6d2f70617468 http://example.com/path
URL encoded http%3A%2F%2fexample%2Ecom%2Fpath http://example.com/path

Note that the table above is not exhaustive, and other URL/defang patterns may also be extracted correctly. If you notice something missing or not working correctly, feel free to let us know via the GitHub Issues.

Changelog

New features, improvements, and bugfixes for each release can be found in the GitHub releases.

Contributing

If you have a defang technique that doesn’t make it through the extractor, or if you find any bugs, PRs and Issues are always welcome. The library is released under a “BSD-New” (aka “BSD 3-Clause”) license.

Module Documentation

Extract and optionallly refang Indicators of Compromise (IOCs) from text.

All methods return iterator objects, not lists. If for some reason you need a list, do e.g.: list(extract_iocs(my_data)).

Otherwise, you can iterate over the objects (e.g. in a for loop) normally. Each object yielded from the generators will by of type str.

iocextract.defang(ioc)

Defang a URL, domain, or IPv4 address.

Parameters:ioc – String URL, domain, or IPv4 address.
Return type:str
iocextract.extract_emails(data)

Extract email addresses

Parameters:data – Input text
Return type:Iterator[str]
iocextract.extract_hashes(data)

Extract MD5/SHA hashes.

Results are returned as an itertools.chain iterable object which lazily provides the results of the other extract_*_hashes generators.

Parameters:data – Input text
Return type:itertools.chain()
iocextract.extract_iocs(data, refang=False, strip=False)

Extract all IOCs.

Results are returned as an itertools.chain iterable object which lazily provides the results of the other extract_* generators.

Parameters:
  • data – Input text
  • refang (bool) – Refang output?
  • strip (bool) – Strip possible garbage from the end of URLs
Return type:

itertools.chain()

iocextract.extract_ips(data, refang=False)

Extract IP addresses.

Includes both IPv4 and IPv6 addresses.

Parameters:
  • data – Input text
  • refang (bool) – Refang output?
Return type:

itertools.chain()

iocextract.extract_ipv4s(data, refang=False)

Extract IPv4 addresses.

Parameters:
  • data – Input text
  • refang (bool) – Refang output?
Return type:

Iterator[str]

iocextract.extract_ipv6s(data)

Extract IPv6 addresses.

Not guaranteed to catch all valid IPv6 addresses.

Parameters:data – Input text
Return type:Iterator[str]
iocextract.extract_md5_hashes(data)

Extract MD5 hashes.

Parameters:data – Input text
Return type:Iterator[str]
iocextract.extract_sha1_hashes(data)

Extract SHA1 hashes.

Parameters:data – Input text
Return type:Iterator[str]
iocextract.extract_sha256_hashes(data)

Extract SHA256 hashes.

Parameters:data – Input text
Return type:Iterator[str]
iocextract.extract_sha512_hashes(data)

Extract SHA512 hashes.

Parameters:data – Input text
Return type:Iterator[str]
iocextract.extract_urls(data, refang=False, strip=False)

Extract URLs.

Parameters:
  • data – Input text
  • refang (bool) – Refang output?
  • strip (bool) – Strip possible garbage from the end of URLs
Return type:

Iterator[str]

iocextract.extract_yara_rules(data)

Extract YARA rules.

Parameters:data – Input text
Return type:Iterator[str]
iocextract.main()

Run as a commandline utility.

iocextract.refang_ipv4(ip_address)

Refang an IPv4 address.

Parameters:ip_address – String IPv4 address.
Return type:str
iocextract.refang_url(url)

Refang a URL.

Parameters:url – String URL
Return type:str

Indices and tables