November 6, 2024

Times Insider: How We Sped Through 900 Pages of Cohen Documents in Under 10 Minutes

But OCR technology is found in all kinds of day-to-day tasks, like online banking and toll-road license plate scanning, as well as in website security Captchas and even mobile language-translation apps that “translate” photos taken by travelers. It turns printed notation of all shapes and styles back into digital content, so text can be copy/pasted or saved in digital form.

OCR works by isolating each individual letter, then comparing its extracted shape against mappings of letterforms across dozens of written notation systems, like languages or music. It does this for every letterform on a page, as well as for punctuation, formatting like italics and even meaningful white space. By preserving the order of matches, it creates a digital edition.

But the process is not foolproof. Distorted letterforms — whether from skewed pages, aged paper, old-fashioned typefaces or even the vagaries of handwriting — sometimes cause the software to make imperfect matches, like mistaking a letterform like “d” for “ol” to get “olog” instead of “dog,” or archaic letters like a long s for the letter “f.” So we need to be open-minded about how to search within documents that have been OCR’d.

(The long-s error provides amusing reading within certain eras of digitized works. Late-18th-century authors weren’t particularly foul-mouthed — modern software just struggles to read them right.)

When the Cohen search warrant affidavits were unsealed last week, DocumentHelper came to the rescue.

“As Ben and I crashed our way through nearly 900 pages with little time to spare,” Willy Rashbaum recounted, “it served as equal parts power drill, spotlight, microscope and jackhammer.

“Almost like finding the proverbial needle in a haystack, it helped us locate useful and potentially newsworthy nuggets of information in a vast collection of court documents, which would have been an otherwise daunting task in the limited time we had to review it.”

Article source: https://www.nytimes.com/2019/03/26/reader-center/times-documents-reporters-cohen.html?partner=rss&emc=rss

Speak Your Mind