Register | Submissions | Contact Us 
homearticleswebinarseventssponsorspartnersexpertsabout us

Technical Challenges to Electronic Discovery
BY RUSSELL SHUMWAY,
CISSP, professor, University of Fairfax

Under the newly updated Federal Rules of Evidence, electronic discovery has taken on new emphasis. Technical staff now have a pivotal role in the process.

Under the newly updated Federal Rules of Evidence, electronic discovery has taken on new emphasis. This article will discuss some of the technical challenges to electronic discovery and some strategies to aid in compliance.

In one sense, electronic discovery is not a new issue. Parties in litigation have been required to produce all relevant information under the rules of evidence for some time. However, the new Federal Rules of Evidence, updated in December 2006, formalize the process. Technical staff have a pivotal role in the process.

Many IT organizations may go for months without direct contact with the firm's General Counsel. IT staff may, in fact, prefer it that way. Production of electronic discovery materials, however, involves close cooperation between IT, users, the General Counsel (and outside counsel), management, and perhaps external consultants. Even if outside consultants are brought it to perform the production, IT will still be intimately involved in identifying potential sources of data and providing access to systems and repositories of data.

Organizations typically have defined backup processes. These may involve incremental or differential backups or full backups of data and/or system files. Backup procedures may also involve the destruction of older materials or the routine purging of files or email messages after a certain period. Such processes are acceptable and there are many good reasons for document retention and destruction policies, enforced through automated processes. It is important, however, to remember that the processes must be suspended and the information safeguarded against intentional or inadvertent destruction once the litigation is pending. Preserving this information, like its eventual production, requires close coordination between the legal and IT departments.

Volume

Perhaps the biggest impact of electronic discovery on litigation has been the dramatic increase in the volume of data. ESI is both easier to maintain and to produce than paper files. While organizations have long maintained paper files, the sheer physical volume of those files makes it more difficult to maintain large historical files or to maintain a history of a given document. Unless there is a legal or regulatory reason to maintain versions, it is likely that paper files contain only final versions of documents. If a document was electronically prepared, however, multiple versions of the document may exist in electronic storage. Documents circulated for comments through email likely exist as email attachments in multiple versions that are not retained in hard copy files.

Email is perhaps the single biggest source of data in many organizations. The specific location of stored email is highly dependent on the organization’s mail architecture, but typically mail is stored both on the mail server and in local user files or archives. The archives may be on the user’s workstation, on a shared network drive, or even on removable media. Both server stores and user archives may also exist on backup media if those systems were part of a backup process.

Multi-megabyte personal email stores have become the norm, so a reasonable discovery request for email from 10 custodians can easily run into 5 -10 gigabytes of data. A gigabyte of data roughly translates into a stack of paper 85 feet tall or 255,000 pages. Producing that volume of data is trivial; reviewing it is not.

In addition to email, electronically stored information may include traditional documents, web pages, spreadsheets, and presentations. These are normally in standard formats and the content is typically searchable using either plain-text search engines or readily available tools. Although the metadata associated with a document may not be searchable by a test search engine, the content is normally accessible. Other formats, however, may not be. Examples might include databases (both small desktop database programs or large transactional databases) and image files. Many organizations have fax servers that forward received faxes as email attachments. These attachments are typically images of the fax and the content is not searchable. If these images are potentially responsive, they must be manually reviewed.

Relevant data may also exist in many physical or logical locations. The obvious include networked file servers and user workstations, but data may also be contained in removeable media including thumb drives or CDs. It may exist on personal devices or blackberry-type devices. It may even exist on home computers if they were used to telecommute. While the legal authority for procuring or producing that data is clearly outside the scope of this discussion, the technical challenges of acquiring and producing the data should be well understood by all involved parties.

Keyword and Automated Searching

One of the major advantages of electronic data versus traditional paper discovery is that it lends itself to automated search tools. Because of the problems (and costs) associated with manual review of documents, initial production in large cases is often conducted via a mutually-agreed-upon list of keywords. Keywords can dramatically reduce the universe of documents from the entire data storage of the company down to a subset that can be assumed to at least contain responsive documents. Once produced, that reduced set can be manually reviewed to further reduce the list to those documents that are responsive. Automated searches can also be used to find privileged documents that are not required to be produced. For example, a search might exclude any documents containing correspondence from outside counsel.

While valuable, keyword searches are not a panacea. All too often attorneys produce a list of keywords in the initial “meet and confer” meeting that is overly broad or contains words that may look good but may produce a large number of non-responsive documents.

Email. Word searches can be structured to include or exclude certain parties. The example previously discussed would be to exclude correspondence with outside counsel. Discovery requests may include all correspondence to or from a particular party. It may seem obvious, but if that party is also a custodian whose mail store is being searched, their entire store will show up in a search. Email addresses, especially in certain environments, may include display names (John Doe), fully qualified SMTP addresses ( john.doe@company.comThis e-mail address is being protected from spam bots, you need JavaScript enabled to view it ), or both. Where display names are present, a search on the domain name may not provide any results.

Poorly chosen terms. Until a term is tested against live data, it is incredibly difficult to predict how many documents might be produced. It is surprising how many documents may contain false positive keyword hits. A classic example is the word “sex,” often requested in cases of employee misconduct. In a typical Windows installation, the phrase “sex” is found well over 10,000 times inside system files in such phrases as “ProcessExit” or “AccessException.” The number of false positives can be reduced by either excluding system and executable files from the search, or by searching for whole words only. However, this should be clearly stated in the initial agreement.

Overly broad terms. While any search term has the potential to be overly broad and produce non-responsive material, it is a non-trivial task to define those terms. Terms that may seem perfectly logical when viewed in the context of the case may not be appropriate within the context of the environment. If the case involves, for example, price fixing it might seem appropriate to look for words such as “price” or “cost” from certain custodians. However, if those persons were in positions where they could influence pricing, it is likely that virtually all of their documents will contain those keywords, even if they are not relevant to the case.

Impossible (or impractical) terms. Search terms may be overly specified in the order and may be impossible to implement in certain search tools. For example, a search term designed to capture “John Doe” and “John L. Doe” might be specified as “John” within 4 characters of “Doe.” This could be constructed as a regular expression within some search engines, but there is no way to do that particular search in others. The persons responsible for conducting the searches and familiar with the tools should be part of the initial discussions in order to construct terms that are feasible and practical and which still capture the relevant data.1 They can also advise on data formats that cannot be electronically searched or which may require either additional tools or outside consultants.

Production Format

Principle 12 of the Sedona Principles for Electronic Document Production states: “Absent party agreement or court order specifying the form or forms of production, production should be made in the form or forms in which the information is ordinarily maintained or in a reasonably useful form, taking into account the need to produce reasonably accessible metadata.”2

There are essentially two major forms of production: Native format (such as word processing documents or email archives) or images or text files extracted from the data. Many litigation support tools used for review of electronic data are geared towards images of paper documents with database load files associated with the image that contain additional information. Word processing documents could be extracted to TIFF images and the associated text and metadata included in the litigation database. The document would still be searchable, but the reviewer could see an image of the document during the search.

This is perfectly appropriate from a technical standpoint when dealing with physical data. It may not be as appropriate with electronic data, depending on the nature of the data. At the one extreme, a word processing file could generally convert to TIFF and resemble a printed document. However, the actual appearance of the printed document is dependent on the printer and paper selected. In this case, the document may never have actually existed in printed form, so the image is only a representation of the document. For common office automation documents such as word processing files or presentations, this distinction is probably irrelevant.

It may not be so for other forms of data. It may not be possible to produce the data in a large transactional database, and much of it may not be relevant. In this case, a report or custom query could be written to extract responsive data. Arguably, this is not “the form in which is ordinarily maintained,” but it is reasonably useful. Again, in this case, the distinction may be academic.

Email is somewhere in the middle. Often attorneys may specify delivery of email in native format as part of the request for discovery. The issue may become the definition of native format. Microsoft Exchange email is stored in Exchange Databases when it is stored on the server. These databases are extremely difficult to maintain or search outside of the Exchange environment. Some search tools convert the Exchange databases to Personal Folders (PST files), which can be searched and indexed by some (but not all) desktop and litigation support tools. Other requests for discovery ask for emails as Message files (.MSG). Arguably the data never existed as message files, and is only produced in that format at the time of production. Metadata associated with the message may be lost in the extraction and production.

Internet email, on the other hand, exists as plain text. While on the mail server it is stored in some format set by the server, and the mail client also stores it in whatever format it chooses. So long as the Internet headers are maintained by the client and server, the storage format is irrelevant to the mail transport function. Production of that email in a proprietary or non-standard format may be difficult to search or view and may introduce information that did not exist in the original email if the format has fields that were not present in the message.

The alternative to native format is to produce data in some sort of processed format. This may be in TIFF images with associated metadata, or may be in some intermediate form. Email, for example, may be exported from storage into text or HTML files. These are not the native format, but they are readily searchable so long as the appropriate header fields are also extracted. Images are not readily searchable, and, if the text is not provided, will have to be processed through Optical Character Recognition before automated tools can be used. The success rate of OCR brings on other issues, and it is likely that errors in the OCR conversion will result in search hits being missed.

Role of Forensics

We are increasingly seeing an intersection between computer forensics and electronic discovery. The tools and techniques of traditional forensics provide a great deal of synergy to e-discovery, but it is not without drawbacks.

First, the advantage to using forensics techniques (and tools) is that the process is well-known and repeatable. Stating that “I used the EnCase tool to obtain forensically sound images of the custodian’s computers” demonstrates that the organization took clear steps to preserve all relevant information and that the chance of inadvertent spoliation was dramatically reduced.

The image files are in a format that can be searched using accepted tools and will not be modified or altered by the process of searching. Where metadata about files or email messages is important to the litigation, the ability to examine files without modification can be vital.

Where data has been inadvertently deleted, forensics tools may be used to recover the data. While the Federal Rules allow for good faith errors or omissions due to routine business operations, the burden of preserving potentially responsive evidence is still on the respondents and the rules allow for sanctions. Forensics techniques to recover data can be used to demonstrate good faith in preserving and producing the relevant information.

However, forensics techniques may have the potential to preserve or produce too much information. Principle 9 of the Sedona principles states “Absent a showing of special need and relevance, a responding party should not be required to preserve, review, or produce deleted, shadowed, fragmented, or residual electronically stored information.”3 This, however, is exactly what computer forensics is designed to do. When using forensics tools for production, the technicians should take steps to ensure that only the appropriate data is searched and produced. For example, deleted files and file slack may be excluded from the search.

Conclusions

Technicians have a special role to play in the entire e-discovery process and should be involved throughout. The firm’s General Counsel cannot be expected to be an expert in all aspects of the company’s IT infrastructure (any more than IT can be expected to know obscure legal details). It is the IT staff and users who have the resources and obligation to preserve ESI and who know where it may be located (both within the organization and with outside providers). They also know the details of the data and can advise the legal staff on technical matters such as backup procedures, storage information, and data formats that will allow the legal experts to negotiate a good, responsive, and reasonable discovery protocol which complies with the requirement to produce relevant information with the minimum disruption to normal operations or excessive cost.

Note: The author is not an attorney. The opinions expressed in this article are his alone. Nothing in this article should be construed as providing legal advice of any kind.



1. As an exercise, the Regular Expression for such a search would be john (\l. ) ?doe, which would find john doe or john l. doe but not john h. doe. This is valid in any search engine that supports regular expressions including grep and EnCase, but you could not use it in OnTrack’s Power Controls program for searching Exchange stores.

If the intent is to capture every instance of john within 4 characters of Doe the appropriate expression is john.?.?.?.?doe, which captures john doe, john l. doe, john h. doe, and john or doe, but not john lewis doe.<

2. The Sedona Principles, Second Edition (2007), Principle 12.

3. The Sedona Principles, Second Edition (2007), Principle 9.
Sponsor/Partner Offers

University of Fairfax
Let your INFOSEC career soar! A Compliance Spectrumâ„¢ Fellowship can help you earn an INFOSEC MS/PhD online. Read more >>


HOME | ARTICLES | WEBINARS | SIGN UP | EVENTS | SPONSORS | EXPERTS | ABOUT | CONTACT
Copyright ©2008 The Compliance Authority, Inc. | Privacy Policy