 |
 |
Technical Challenges to
Electronic Discovery
BY RUSSELL SHUMWAY,
CISSP, professor, University of Fairfax |
|
Under the newly updated Federal Rules of Evidence, electronic discovery has taken on new emphasis. Technical staff now have a pivotal role in the process.
Under the newly updated Federal Rules of Evidence, electronic discovery has taken on new emphasis. This
article will discuss some of the technical challenges to electronic discovery
and some strategies to aid in compliance.
In one sense, electronic
discovery is not a new issue. Parties in litigation have been required to
produce all relevant information under the rules of evidence for some time.
However, the new Federal Rules of Evidence, updated in December 2006, formalize
the process. Technical staff have a pivotal role in
the process.
Many IT organizations may go for months without direct contact with the firm's General Counsel. IT staff
may, in fact, prefer it that way. Production of electronic discovery materials,
however, involves close cooperation between IT, users, the General Counsel (and
outside counsel), management, and perhaps external consultants. Even if outside
consultants are brought it to perform the production, IT will still be
intimately involved in identifying potential sources of data and providing
access to systems and repositories of data.
Organizations typically have defined backup processes. These may involve incremental or differential
backups or full backups of data and/or system files. Backup procedures
may also involve the destruction of older materials or the routine purging
of files or email messages after a certain period. Such processes are
acceptable and there are many good reasons for document retention and
destruction policies, enforced through automated processes. It is important,
however, to remember that the processes must be suspended and the information
safeguarded against intentional or inadvertent destruction once the litigation
is pending. Preserving this information, like its eventual production,
requires close coordination between the legal and IT departments.
Volume
Perhaps the biggest impact of electronic discovery on litigation has been the dramatic increase
in the volume of data. ESI is both easier to maintain and to produce than
paper files. While organizations have long maintained paper files, the
sheer physical volume of those files makes it more difficult to maintain
large historical files or to maintain a history of a given document. Unless
there is a legal or regulatory reason to maintain versions, it is likely
that paper files contain only final versions of documents. If a document
was electronically prepared, however, multiple versions of the document
may exist in electronic storage. Documents circulated for comments through
email likely exist as email attachments in multiple versions that are
not retained in hard copy files.
Email is perhaps the single biggest source of data in many organizations. The specific location
of stored email is highly dependent on the organizations mail architecture,
but typically mail is stored both on the mail server and in local user
files or archives. The archives may be on the users workstation,
on a shared network drive, or even on removable media. Both server stores
and user archives may also exist on backup media if those systems were
part of a backup process.
Multi-megabyte personal email stores have become the norm, so a reasonable discovery request for
email from 10 custodians can easily run into 5 -10 gigabytes of data.
A gigabyte of data roughly translates into a stack of paper 85 feet tall
or 255,000 pages. Producing that volume of data is trivial; reviewing
it is not.
In addition to email, electronically stored information may include traditional documents, web
pages, spreadsheets, and presentations. These are normally in standard
formats and the content is typically searchable using either plain-text
search engines or readily available tools. Although the metadata associated
with a document may not be searchable by a test search engine, the content
is normally accessible. Other formats, however, may not be. Examples might
include databases (both small desktop database programs or large transactional
databases) and image files. Many organizations have fax servers that forward
received faxes as email attachments. These attachments are typically images
of the fax and the content is not searchable. If these images are potentially
responsive, they must be manually reviewed.
Relevant data may also exist in many physical or logical locations. The obvious include
networked file servers and user workstations, but data may also be contained
in removeable media including thumb drives or CDs. It may exist on personal
devices or blackberry-type devices. It may even exist on home computers
if they were used to telecommute. While the legal authority for procuring
or producing that data is clearly outside the scope of this discussion,
the technical challenges of acquiring and producing the data should be
well understood by all involved parties.
Keyword and Automated Searching
One of the major advantages
of electronic data versus traditional paper discovery is that it lends
itself to automated search tools. Because of the problems (and costs)
associated with manual review of documents, initial production in large
cases is often conducted via a mutually-agreed-upon list of keywords.
Keywords can dramatically reduce the universe of documents from the entire
data storage of the company down to a subset that can be assumed to at
least contain responsive documents. Once produced, that reduced set can
be manually reviewed to further reduce the list to those documents that
are responsive. Automated searches can also be used to find privileged
documents that are not required to be produced. For example, a search
might exclude any documents containing correspondence from outside counsel.
While valuable, keyword
searches are not a panacea. All too often attorneys produce a list of
keywords in the initial meet and confer meeting that is overly
broad or contains words that may look good but may produce a large number
of non-responsive documents.
Email. Word searches can be structured to include or exclude certain
parties. The example previously discussed would be to exclude correspondence
with outside counsel. Discovery requests may include all correspondence
to or from a particular party. It may seem obvious, but if that party
is also a custodian whose mail store is being searched, their entire store
will show up in a search. Email addresses, especially in certain environments,
may include display names (John Doe), fully qualified SMTP addresses (
john.doe@company.comThis e-mail address is being protected from spam bots,
you need JavaScript enabled to view it ), or both. Where display names
are present, a search on the domain name may not provide any results.
Poorly chosen terms. Until a term is tested against live data, it is incredibly difficult
to predict how many documents might be produced. It is surprising how
many documents may contain false positive keyword hits. A classic example
is the word sex, often requested in cases of employee misconduct.
In a typical Windows installation, the phrase sex is found
well over 10,000 times inside system files in such phrases as ProcessExit
or AccessException. The number of false positives can be reduced
by either excluding system and executable files from the search, or by
searching for whole words only. However, this should be clearly stated
in the initial agreement.
Overly broad terms. While any search term has the potential to be overly broad and produce
non-responsive material, it is a non-trivial task to define those terms.
Terms that may seem perfectly logical when viewed in the context of the
case may not be appropriate within the context of the environment. If
the case involves, for example, price fixing it might seem appropriate
to look for words such as price or cost from certain
custodians. However, if those persons were in positions where they could
influence pricing, it is likely that virtually all of their documents
will contain those keywords, even if they are not relevant to the case.
Impossible (or impractical) terms. Search terms may be overly specified in the order and may be impossible
to implement in certain search tools. For example, a search term designed
to capture John Doe and John L. Doe might be specified
as John within 4 characters of Doe. This could
be constructed as a regular expression within some search engines, but
there is no way to do that particular search in others. The persons responsible
for conducting the searches and familiar with the tools should be part
of the initial discussions in order to construct terms that are feasible
and practical and which still capture the relevant data.1 They can also
advise on data formats that cannot be electronically searched or which
may require either additional tools or outside consultants.
Production Format
Principle 12 of the Sedona Principles for Electronic Document Production states: Absent
party agreement or court order specifying the form or forms of production,
production should be made in the form or forms in which the information
is ordinarily maintained or in a reasonably useful form, taking into account
the need to produce reasonably accessible metadata.2
There are essentially
two major forms of production: Native format (such as word processing
documents or email archives) or images or text files extracted from the
data. Many litigation support tools used for review of electronic data
are geared towards images of paper documents with database load files
associated with the image that contain additional information. Word processing
documents could be extracted to TIFF images and the associated text and
metadata included in the litigation database. The document would still
be searchable, but the reviewer could see an image of the document during
the search.
This is perfectly
appropriate from a technical standpoint when dealing with physical data.
It may not be as appropriate with electronic data, depending on the nature
of the data. At the one extreme, a word processing file could generally
convert to TIFF and resemble a printed document. However, the actual appearance
of the printed document is dependent on the printer and paper selected.
In this case, the document may never have actually existed in printed
form, so the image is only a representation of the document. For common
office automation documents such as word processing files or presentations,
this distinction is probably irrelevant.
It may not be so for
other forms of data. It may not be possible to produce the data in a large
transactional database, and much of it may not be relevant. In this case,
a report or custom query could be written to extract responsive data.
Arguably, this is not the form in which is ordinarily maintained,
but it is reasonably useful. Again, in this case, the distinction may
be academic.
Email is somewhere
in the middle. Often attorneys may specify delivery of email in native
format as part of the request for discovery. The issue may become the
definition of native format. Microsoft Exchange email is stored in Exchange
Databases when it is stored on the server. These databases are extremely
difficult to maintain or search outside of the Exchange environment. Some
search tools convert the Exchange databases to Personal Folders (PST files),
which can be searched and indexed by some (but not all) desktop and litigation
support tools. Other requests for discovery ask for emails as Message
files (.MSG). Arguably the data never existed as message files, and is
only produced in that format at the time of production. Metadata associated
with the message may be lost in the extraction and production.
Internet email, on
the other hand, exists as plain text. While on the mail server it is stored
in some format set by the server, and the mail client also stores it in
whatever format it chooses. So long as the Internet headers are maintained
by the client and server, the storage format is irrelevant to the mail
transport function. Production of that email in a proprietary or non-standard
format may be difficult to search or view and may introduce information
that did not exist in the original email if the format has fields that
were not present in the message.
The alternative to
native format is to produce data in some sort of processed format. This
may be in TIFF images with associated metadata, or may be in some intermediate
form. Email, for example, may be exported from storage into text or HTML
files. These are not the native format, but they are readily searchable
so long as the appropriate header fields are also extracted. Images are
not readily searchable, and, if the text is not provided, will have to
be processed through Optical Character Recognition before automated tools
can be used. The success rate of OCR brings on other issues, and it is
likely that errors in the OCR conversion will result in search hits being
missed.
Role of Forensics
We are increasingly
seeing an intersection between computer forensics and electronic discovery.
The tools and techniques of traditional forensics provide a great deal
of synergy to e-discovery, but it is not without drawbacks.
First, the advantage
to using forensics techniques (and tools) is that the process is well-known
and repeatable. Stating that I used the EnCase tool to obtain forensically
sound images of the custodians computers demonstrates that
the organization took clear steps to preserve all relevant information
and that the chance of inadvertent spoliation was dramatically reduced.
The image files are
in a format that can be searched using accepted tools and will not be
modified or altered by the process of searching. Where metadata about
files or email messages is important to the litigation, the ability to
examine files without modification can be vital.
Where data has been
inadvertently deleted, forensics tools may be used to recover the data.
While the Federal Rules allow for good faith errors or omissions due to
routine business operations, the burden of preserving potentially responsive
evidence is still on the respondents and the rules allow for sanctions.
Forensics techniques to recover data can be used to demonstrate good faith
in preserving and producing the relevant information.
However, forensics
techniques may have the potential to preserve or produce too much information.
Principle 9 of the Sedona principles states Absent a showing of
special need and relevance, a responding party should not be required
to preserve, review, or produce deleted, shadowed, fragmented, or residual
electronically stored information.3 This, however, is exactly what
computer forensics is designed to do. When using forensics tools for production,
the technicians should take steps to ensure that only the appropriate
data is searched and produced. For example, deleted files and file slack
may be excluded from the search.
Conclusions
Technicians have a
special role to play in the entire e-discovery process and should be involved
throughout. The firms General Counsel cannot be expected to be an
expert in all aspects of the companys IT infrastructure (any more
than IT can be expected to know obscure legal details). It is the IT staff
and users who have the resources and obligation to preserve ESI and who
know where it may be located (both within the organization and with outside
providers). They also know the details of the data and can advise the
legal staff on technical matters such as backup procedures, storage information,
and data formats that will allow the legal experts to negotiate a good,
responsive, and reasonable discovery protocol which complies with the
requirement to produce relevant information with the minimum disruption
to normal operations or excessive cost.
Note: The author
is not an attorney. The opinions expressed in this article are his alone.
Nothing in this article should be construed as providing legal advice
of any kind.
|
 |
|
 |
Sponsor/Partner Offers
University of Fairfax
Let your INFOSEC career soar! A Compliance Spectrumâ„¢ Fellowship can help you earn an INFOSEC MS/PhD online. Read more >>
|
|
 |
|