Yandex taught neural networks to decipher archival records with complex spelling
Miscellaneous / / April 03, 2023
Historical manuscripts, which are difficult for a person to parse, are almost instantly converted by artificial intelligence into printed text.
Yandex has launched a new service called Archive Search, which uses neural networks to decipher archival records with complex pre-revolutionary spelling.
The service provides access to more than 2.5 million pages of historical documents with text transcripts. Its algorithm, built on the basis of an optical character recognition system, takes into account the peculiarities of handwriting, recognizes letters that have lost their relevance, and understands the special structure of archival documents.
The company's specialists trained the neural network on a data array of hundreds of thousands of handwritten lines from real texts of the 18th-19th centuries and tens of millions of generated examples.
Manuscripts that are difficult for an unprepared person to parse, Yandex technology almost instantly turns into printed text. Thanks to this, in the database of the service, you can quickly find documents with a mention of the last name, locality, or any other words.
"Search in archives" will increase the efficiency of the work of historians, sociologists, demographers, genealogists and will help those who are looking for information about their family.
The first fund presented in the service was the Main Archive of Moscow - it was on its materials that the developers trained the neural network. The database also contains documents from the archives of the Orenburg and Novgorod regions. Over time, the number of storages and available scanned files will increase.
You can search for materials from the 18th - early 20th centuries, which are most popular with users. These are parish registers, confession sheets and revision tales with the results of the population census. Documents can be found in the catalog or through the search bar. There are filters by years, archives, funds and inventories.
Next to the scan of each page, a line-by-line decoding made by neural networks is displayed. If you hover over the desired fragment, it will immediately be highlighted on the digital copy.