The new Snort HTTP inspector (HI) is divided into two major parts. The HttpStreamSplitter
(splitter) accepts TCP payload data from Stream and subdivides it into message sections.
HttpInspect (inspector) processes individual message sections.

Splitter finish() is called by Stream when the TCP connection closes (including pruning).
It serves several specialized purposes in cases where the HTTP message is truncated (ends
unexpectedly).

The nature of splitting allows packets to be forwarded before they are aggregated into a message
section and inspected. This may lead to problems when the target consumes a partial message
body even though the end of the message body was never received because Snort blocked it.

Script detection is a feature developed to solve this problem for message bodies containing
Javascripts. The stream splitter scan() method searches its input for the end-of-script tag
"</script>". When necessary this requires scan() to unzip the data. This is an extra unzip as
storage limitations preclude saving the unzipped version of the data for subsequent reassembly.

When the end of a script is found and the normal flush point has not been found, the current TCP
segment and all previous segments for the current message section are flushed using a special
procedure known as partial inspection. From the perspective of Stream (or H2I) a partial inspection
is a regular flush in every respect.

scan() calls prep_partial_flush() to prepare for the partial inspection. Then it returns a normal
flush point to Stream at the end of the current TCP segment. Partial inspections perform all of the
functions of a regular inspection including forwarding data to file processing and detection.

The difference between a partial inspection and a regular inspection is reassemble() saves the
input data for future reuse. Eventually there will be a regular full inspection of the entire
message section. reassemble() will accomplish this by combining the input data for the partial
inspection with later data that complete the message section.

Correct and efficient execution of a full inspection following a partial inspection requires
special handling of certain functions. Unzipping is only done once in reassemble(). The stored
input in reassemble() has already been through dechunking and unzipping. Data is forwarded to file
processing during the partial inspection and duplicate data will not be forwarded again. Some
of the message body normalization steps are done once during partial inspection with work
products saved for reuse.

It is possible to do more than one partial inspection of a single message section. Each partial
inspection is cumulative, covering the new data and all previous data.

Compared to just doing a full inspection, a partial inspection followed by a partial inspection
will not miss anything. The benefits of partial inspection are in addition to the benefits of a
full inspection.

The http_inspect partial inspection mechanism is also used by http2_inspect to manage frame
boundaries. When inspecting HTTP/2, a partial inspection by http_inspect may occur because script
detection triggered it, because H2I wanted it, or both. 

Some applications may be affected by blocks too late scenarios related to seeing part of the
zero-length chunk. For example a TCP packet that ends with:
    8<CR><LF>abcdefgh<CR><LF>0
might be sufficient to forward the available data ("abcdefgh") to the application even though the
final <CR><LF> has not been received.
Note that the actual next bytes are uncertain here. The next packet might begin with <CR><LF>, but
    100000<CR><LF>ijklmnopq ...
is another perfectly legal possibility. There is no rule against starting a nonzero chunk length
with a zero character and some applications reputedly do this.
As a precaution partial inspections performed when 1) a TCP segment ends inside a possible
zero-length chunk or 2) chunk processing fails (broken chunk).

HttpFlowData is a data class representing all HI information relating to a flow. It serves as
persistent memory between invocations of HI by the framework. It also glues together the inspector,
the client-to-server splitter, and the server-to-client splitter which pass information through the
flow data.

Message section is a core concept of HI. A message section is a piece of an HTTP message that is
processed together. There are eight types of message section:

1. Request line (client-to-server start line)
2. Status line (server-to-client start line)
3. Headers (all headers after the start line as a group)
4. Content-Length message body (a block of message data usually not much larger than 16K from a
   body defined by the Content-Length header)
5. Chunked message body (same but from a chunked body)
6. Old message body (same but from a body with no Content-Length header that runs to connection
   close)
7. HTTP/2 message body (same but content taken from an HTTP/2 Data frame)
8. Trailers (all header lines following a chunked body as a group)

Message sections are represented by message section objects that contain and process them. There
are twelve message section classes that inherit as follows. An asterisk denotes a virtual class.

1. HttpMsgSection* - top level with all common elements
2. HttpMsgStart* : HttpMsgSection - common elements of request and status
3. HttpMsgRequest : HttpMsgStart
4. HttpMsgStatus : HttpMsgStart
5. HttpMsgHeadShared* : HttpMsgSection - common elements of header and trailer
6. HttpMsgHeader : HttpMsgHeadShared
7. HttpMsgTrailer : HttpMsgHeadShared
8. HttpMsgBody* : HttpMsgSection - common elements of message body processing
9. HttpMsgBodyCl : HttpMsgBody
10. HttpMsgBodyChunk : HttpMsgBody
11. HttpMsgBodyOld : HttpMsgBody
12. HttpMsgBodyH2 : HttpMsgBody

An HttpTransaction is a container that keeps all the sections of a message together and associates
the request message with the response message. Transactions may be organized into pipelines when an
HTTP pipeline is present. The current transaction and any pipeline live in the flow data. A
transaction may have only a request because the response is not (yet) received or only a response
because the corresponding request is unknown or unavailable.

The attach_my_transaction() factory method contains all the logic that makes this work. There are
many corner cases. Don't mess with it until you fully understand it.

Message sections implement the Just-In-Time (JIT) principle for work products. A minimum of
essential processing is done under process(). Other work products are derived and stored the first
time detection or some other customer asks for them.

HI also supports defining custom "x-forwarded-for" type headers. In a multi-vendor world, it is
quite possible that the header name carrying the original client IP could be vendor-specific. This
is due to the absence of standardization which would otherwise standardize the header name. In such
a scenario, it is important to provide a configuration with which such x-forwarded-for type headers
can be introduced to HI. The headers can be introduced with the xff_headers configuration. The
default value of this configuration is "x-forwarded-for true-client-ip". The default definition
introduces the two commonly known "x-forwarded-for" type headers and is preferred in the same order
by the inspector as they are defined, e.g "x-forwarded-for" will be preferred than "true-client-ip"
if both headers are present in the stream. Every HTTP Header is mapped to an ID internally. The
custom headers are mapped to a dynamically generated ID and the mapping is appended at the end
of the mapping of the known HTTP headers. Every HI instance can have its own list of custom
headers and thus an instance of HTTP header mapping list is also associated with an HI instance.

The Field class is an important tool for managing JIT. It consists of a pointer to a raw message
field or derived work product with a length field. Various negative length values specify the
status of the field. For instance STAT_NOTCOMPUTE means the item has not been computed yet,
STAT_NOTPRESENT means the item does not exist, and STAT_PROBLEMATIC means an attempt to compute the
item failed. Never dereference the pointer without first checking the length value.

All of these values and more are in http_enums.h which is a general repository for enumerated
values in HI.

A Field is intended to represent an immutable object. It is either part of the original message
section or it is a work product that has been derived from the original message section. In the
former case the original message is constant and there is no reason for a Field value to change. In
the latter case, once the value has been derived from the original message there is no reason to
derive it again.

Once Field is set to a non-null value it should never change. The set() functions will assert if
this rule is disregarded.

A Field may own the buffer containing the message or it may point to a buffer that belongs to
someone else. When a Field owning a buffer is deleted the buffer is deleted as well. Ownership is
determined with the Field is initially set. In general any dynamically allocated buffer should be
owned by a Field. If you follow this rule you won't need to keep track of allocated buffers or have
delete[]s all over the place.

HI implements flow depth using the request_depth and response_depth parameters. HI seeks to provide
a consistent experience to detection by making flow depth independent of factors that a sender
could easily manipulate, such as header length, chunking, compression, and encodings. The maximum
depth is computed against normalized message body data.

HttpUri is the class that represents a URI. HttpMsgRequest objects have an HttpUri that is created
during analyze().

URI normalization is performed during HttpUri construction in four steps.

Step 1: Identify the type of URI.

HI recognizes four types of URI:

1. Asterisk: a lone ‘*’ character signifying that the request does not refer to any resource in
particular. Often used with the OPTIONS method. This is not normalized.

2. Authority: any URI used with the CONNECT method. The entire URI is treated as an authority.

3. Origin: a URI that begins with a slash. Consists of only an absolute path with no scheme or
authority present.

4. Absolute: a URI which includes a scheme and a host as well as an absolute path. E.g.
http://example.com/path/to/resource.

In addition there are malformed URIs that don't meet any of the four types. These are protocol
errors and will trigger an alert. Because their format is unrecognized they are not normalized.

Step 2: Decompose the URI into its up to six constituent pieces: scheme, host, port, path, query,
and fragment.

Based on the URI type the overall URI is divided into scheme, authority, and absolute path. The
authority is subdivided into host and port. The absolute path is subdivided into path, query, and
fragment.

The raw URI pieces can be accessed via rules. For example: http_raw_uri: query; content: “foo”
will only match the query portion of the URI.

Step 3: Normalize the individual pieces.

The port is not normalized. The scheme is normalized to lower case. The other four pieces are
normalized in a fashion similar to 2.X with an important exception. Path-related normalizations
such as eliminating directory traversals and squeezing out extra slashes are only done for the
path.

The normalized URI pieces can be accessed via rules. For example: http_uri: path; content:
“foo/bar”.

Step 4: Stitch the normalized pieces back together into a complete normalized URI.

This allows rules to be written against a normalized whole URI as is done in 2.X.

The procedures for normalizing the individual pieces are mostly identical to 2.X. Some points
warrant mention:

1. HI considers it to be normal for reserved characters to be percent encoded and does not
generate an alert. The 119/1 alert is used only for unreserved characters that are found to be
percent encoded. The ignore_unreserved configuration option allows the user to specify a list of
unreserved characters that are exempt from this alert.

2. Plus to space substitution is a configuration option. It is not currently limited to the query
but that would not be a difficult feature to add.

3. The 2.X multi_slash and directory options are combined into a single option called
simplify_path.

Algorithm for reassembling chunked message bodies:

NHI parses chunked message bodies using an algorithm based on the HTTP RFC. Chunk headers are not
included in reassembled message sections and do not count against the message section length. The
attacker cannot affect split points by adjusting their chunks.

Built-in alerts for chunking are generated for protocol violations and suspicious usages. Many
irregularities can be compensated for but others cannot. Whenever a fatal problem occurs, NHI
generates 119:213 HTTP chunk misformatted and converts to a mode very similar to run to connection
close. The rest of the flow is sent to detection as-is. No further attempt is made to dechunk the
message body or look for the headers that begin the next message. The customer should block 119:213
unless they are willing to run the risk of continuing with no real security.

In addition to 119:213 there will often be a more specific alert based on what went wrong.

From the perspective of NHI, a chunked message body is a sequence of zero or more chunks followed
by a zero-length chunk. Following the zero-length chunk there will be trailers which may be empty
(CRLF only).

Each chunk begins with a header and is parsed as follows:

1. Zero or more unexpected CR or LF characters. If any are present 119:234 is generated and
processing continues.

2. Zero or more unexpected space and tab characters. If any are present 119:214 is generated. If
five or more are present that is a fatal error as described above and chunk processing stops.

3. Zero or more '0' characters. Leading zeros before other digits are meaningless and ignored. A
chunk length consisting solely of zeros is the zero-length chunk. Five or more leading zeros
generate 119:202 regardless of whether the chunk length eventually turns out to be zero or nonzero.

4. The chunk length in hexadecimal format. The chunk length may be zero (see above) but it must be
present. Both upper and lower case hex letters are acceptable. The 0x prefix for hex numbers is not
acceptable.

The goal here is a hexadecimal number followed by CRLF ending the chunk header. Many things may go
wrong:

* More than 8 hex digits other than the leading zeros. The number is limited by Snort to fit into
  32 bits and if it does not that is a fatal error.
* The CR may be missing, leaving a bare LF as the separator. That generates 119:235 after which
  processing continues normally.
* There may be one or more trailing spaces or tabs following the number. If any are present 119:214
  is generated after which processing continues normally.
* There may be chunk options. This is legal and parsing is supported but options are so unexpected
  that they are suspicious. 119:210 is generated.
* There may be a completely illegal character in the chunk length (other than those mentioned
  above). That is a fatal error.
* The character following the CR may not be LF. This is a fatal error. This is different from
  similar bare CR errors because it does not provide a transparent data channel. An "innocent"
  sender that implements this error has no way to transmit chunk data that begins with LF.

5. Following the chunk header should be a number of bytes of transparent user data equal to the
chunk length. This is the part of the chunked message body that is reassembled and inspected.
Everything else is discarded.

6. Following the chunk data should be CRLF which do not count against the chunk length. These are
not present for the zero length chunk. If one of the two separators is missing, 119:234 is
generated and processing continues normally. If there is no separator at all that is a fatal error.

Then we return to #1 as the next chunk begins. In particular extra separators beyond the two
expected are attributed to the beginning of the next chunk.

Test tool usage instructions:

The HI test tool consists of two features. test_output provides extensive information about the
inner workings of HI. It is strongly focused on showing work products (Fields) rather than being a
tracing feature. Given a problematic pcap, the developer can see what the input is, how HI
interprets it, and what the output to rule options will be. Several related configuration options
(see help) allow the developer to customize the output.

test_input is provided by the HttpTestInput class. It allows the developer to write tests that
simulate HTTP messages split into TCP segments at specified points. The tests cover all of splitter
and inspector and the impact on downstream customers such as detection and file processing. The
test_input option activates a modified form of test_output. It is not necessary to also specify
test_output.

The test input comes from the file http_test_msgs.txt in the current directory. Enter HTTP test
message text as you want it to be presented to the StreamSplitter.

The easiest way to format is to put a blank line between message sections so that each message
section is its own "paragraph". Within a paragraph the placement of single new lines does not have
any effect. Format a paragraph any way you are comfortable. Extra blank lines between paragraphs
also do not have any effect.

Each paragraph represents a TCP segment. The splitter can be tested by putting multiple sections in
the same paragraph (splitter must split) or continuing a section in the next paragraph (splitter
must search and reassemble).

Lines beginning with # are comments. Lines beginning with @ are commands. These do not apply to
lines in the middle of a paragraph. Lines that begin with $ are insert commands - a special class
of commands that may be used within a paragraph to insert data into the message buffer.

Commands:
  @break resets HTTP Inspect data structures and begins a new test. Use it liberally to prevent
     unrelated tests from interfering with each other.
  @tcpclose simulates a half-duplex TCP close.
  @request and @response set the message direction. Applies to subsequent paragraphs until changed.
     The initial direction is always request and the break command resets the direction to request.
  @fileset <pathname> specifies a file from which the tool will read data into the message buffer.
     This may be used to include a zipped or other binary file into a message body. Data is read
     beginning at the start of the file. The file is closed automatically whenever a new file is
     set or there is a break command.
  @fileskip <decimal number> skips over the specified number of bytes in the included file. This
     must be a positive number. To move backward do a new fileset and skip forward from the
     beginning.
  @<decimal number> sets the test number and hence the test output file name. Applies to subsequent
     sections until changed. Don't reuse numbers.

Insert commands:
  $fill <decimal number> create a paragraph consisting of <number> octets of auto-fill data
     ABCDEFGHIJABC ....
  $fileread <decimal number> read the specified number of bytes from the included file into the
     message buffer. Each read corresponds to one TCP section.
  $h2preface creates the HTTP/2 connection preface "PRI * HTTP/2.0\r\n\r\nSM\r\n\r\n"
  $h2frameheader <frame_type> <frame_length> <flags> <stream_id> generates an HTTP/2 frame header.
    The frame type may be the frame type name in all lowercase or the numeric frame type code:
      (data|headers|priority|rst_stream|settings|push_promise|ping|goaway|window_update|
      continuation|{0:9})
    The frame length is the length of the frame payload, may be in decimal or test tool hex value
      (\xnn, see below under escape sequence for more details)
    The frame flags are represented as a single test tool hex byte (\xnn)
    The stream id is optional. If provided it must be a decimal number. If not included it defaults
      to 0.

Escape sequences begin with '\'. They may be used within a paragraph or to begin a paragraph.
  \r - carriage return
  \n - linefeed
  \t - tab
  \\ - backslash
  \# - #
  \@ - @
  \$ - $
  \xnn or \Xnn - where nn is a two-digit hexadecimal number. Insert an arbitrary 8-bit number as
     the next character. a-f and A-F are both acceptable.

Data is separated into segments for presentation to the splitter whenever a paragraph ends (blank
line).

When the inspector aborts the connection (scan() returns StreamSplitter::ABORT) it does not expect
to receive any more input from stream on that connection in that direction. Accordingly the test
tool should not send it any more input. A paragraph of test input expected to result in an abort
should be the last paragraph. The developer should either start a new test (@break, etc.) or at
least reverse the direction and not send any more data in the original direction. Sending more data
after an abort is likely to lead to confusing output that has no bearing on the test.

This test tool does not implement the feature of being hardened against bad input. If you write a
badly formatted or improper test case the program may assert or crash. The responsibility is on the
developer to get it right.

The test tool is designed for single-threaded operation only.

The test tool is only available when compiled with REG_TEST.
