Creating a corpus of texts for AI

Artificial Intelligence (AI) technology has recently taken an increasingly important place in the global progress of ICT due to the rapid development of computer technology and the resulting huge demand for machine translation. Moreover, AI is entering a new phase of development - where it is not a limited number of technology centers, but even medium and even small companies. In other words, the creation of AI has entered a phase of competitive development, when we are witnessing the emergence of competition between technological companies creating their own AI, developing completely new neural networks.

Against the backdrop of these developments, computer linguistics, which is the basis for AI training, is becoming even more important.
One of the urgent tasks of computer linguistics, solved as part of a set of tools for automated text analysis, is the automatic classification of texts. To train a classifier on a large set of subject areas, the task of full automation of this process is relevant, which requires a marked corpus of texts.

With the rapid growth of the amount of processed information in recent decades, the need to develop methods and tools of computer linguistics is only increasing. One of the tasks of computer linguistics is automatic classification of texts, i.e. assigning a text to this or that domain or its subset based on some algorithm with some probability. Some algorithms use for this purpose only data, obtained directly from this text, such algorithms have low accuracy and often do not correspond to the human solution of the classification problem, some algorithms use additional information (training text samples, subject dictionaries, lists of characteristic words etc.), that requires additional data preparation.
