Grigori Sidorov is currently a full research professor of the Natural Language Processing Laboratory, Center for Computing Research, Instituto Politecnico Nacional, Mexico City, Mexico.
He obtained his PhD in Computational and Structural Linguistics from “Lomonosov” Moscow State University, Russia, in 1996.
Dr. Sidorov is a National Researcher of Mexico (SNI) of Excellence Level 3 (highest) and academician of the Mexican Academy of Sciences.
He is Editor-in-Chief of the research journal “Computación y Sistemas” (http://www.cys.cic.ipn.mx), which belongs to the index of Excellence of CONACYT (Ministry of Science of Mexico), Thomson-Reuters Web of Science (Scielo collection), Scopus, among other indexes.
He is an author of more than 250 scientific publications, of them 35 in ISI-JCR-indexed journals, and 7 books.
Keynote: Authorship attribution using word embeddings and syntactic n-grams
We discuss the problem of authorship attribution. First, we discuss the idea of vector space model applied to computational linguistics and present related concepts: syntactic n-grams and soft cosine similarity. Then we present two methods for automatic authorship attribution based on distributed representation at the document level. In the first approach, we learn document embeddings from word and word n-grams and then train an SVM classifier on such embeddings.
We conducted experiments over six datasets used in the state-of-the-art and for the majority of the datasets we obtained comparable or better results. In the second approach, we explored the document embeddings for cross-topic authorship attribution. We learn document embeddings based different types of n-grams, including character n-grams, word n-grams, and n-grams of POS tags. We made experiments on The Guardian corpus.
Experimental results show that our method outperforms word-based embeddings and character n-gram based linear models, which are among the most effective approaches for identifying the writing style of an author.