• Nhut

Detection and Normalization of Temporal Expressions in French Text — Part 1: Build a Dataset

Dernière mise à jour : 8 avr.

This series “Detection and Normalization of Temporal Expressions in French Text” describes how to detect and normalize temporal expressions (date expressions) in French text. This article, the first one in the series, describes how to build a dataset containing temporal expressions if we do not possess one.


1. Introduction to the Series In human languages, the same notions of date/time can be expressed in various ways. For instance, at the moment this article is written (Monday 14 Feb 2022), the day 07 Feb 2022 can be referred to by the following expressions:

  • Last Monday

  • Monday of last week

  • One week ago

  • 7 days ago

  • Feb 7 2022

  • 02/07/2022

  • Feb 7th this year

  • etc.

By Temporal Expression Normalization, we consider the problem of how to express a temporal expression in a “standard” way in order for the machines to understand the same notion of date/time. This is a non-AI problem that is already tackled, for example, with TimeML, the Markup Language for Temporal and Event Expression. With this kind of normalization, machines are expected to infer the date and time points/intervals exactly as human understanding.

By Temporal Expression Detection, we mean the problem of detecting time expressions written in human text or speech. This is an AI NLP (Natural Language Processing) problem that can be tackled under various modelling methods: Named-Entity Recognition (a.k.a Token Classification, Encoding-Decoding Pipelines etc.) where the output will be represented as a list of temporal expressions in the text, together with their positions if required.

1.1 Objective of the Series

In this article, we would like to combine the two problems mentioned above. We expect the machine to look at a human text like:

then to output something like:

Then, in the next step, we can use a non-AI approach to post-process the above format into absolute dates (time moments, time intervals, durations and frequencies as well):

Some of the ideas have been covered by the library dateparser supports, although the scope this article covers differs a lot from dateparser. The objective is not to reimplement the same library, but to explain how we can approach and deal with this kind of problem.


1.2 Use Cases

We can think of different use cases for the problems above, some of which have been covered in our company’s clients’ projects.

  • Detection and normalization of moments/periods where a customer requests a document. E.g. “Je voudrais un document pour les trois derniers mois.”

  • Indexation of pages/parts of a book/document based on the time they cover. E.g. Given a history book about World War II, where the content is not organized in chronological order, can we create an index of pages/paragraphs or even events based on the date/time they mention?

  • Pre-annotation of time expressions for the relation classification problem. E.g. In a large document, we would like to detect the moments/dates when any kind of event happened and link them with the (absolute) time. It would be more convenient to automatically annotate the temporal expressions and the events first, then further carry on linking them together.


1.3 Roadmap

  • Creation of the dataset: Since it is not possible to use customer data to illustrate the article and not easy to find a French corpus adapted for the purpose, we will start explaining how to construct a toy dataset that has a rich presence of temporal expressions.

  • Defining a normalized format for time expression: For the objective of illustration, it is not necessary to use the full complex format of the TimeML mark language. Instead, we introduce a lighter format which may be enough to demonstrate the AI methods and the use case application later. After this step, we will annotate the dataset so that each text is associated with the normalized terms of its temporal expressions.

  • Fine-tune a Hugging Face Transformer model of type “Sequence to Sequence”(Seq2seq) which allows us to do the job.

  • Demonstrate a use case with an application.


The 4 steps above will be detailed within the 4 articles of this series.

A potential generalization of this problem is the possibility to link events with dates, which requires relation classification models and will be studied in a future series.




2. Construction of Dataset

2.1 How to construct a useful dataset if we don’t have one

In practice, given any data science problem, it is not easy to find an adequate dataset (full of elements of interest). In our problem, we could not find a dataset with a full variety of temporal expression formats. Luckily, temporal expressions appear a lot in daily language, so the idea was to look into one or several sufficiently large corpora and filter out a rich subset of such expressions. We will introduce here two datasets in the French language that I randomly found on Hugging Face dataset hubs. They are not that large, so can easily be imported to illustrate the idea.

  • https://huggingface.co/datasets/asi/wikitext_fr: Text from Wikipedia

  • https://huggingface.co/datasets/giga_fren: A dataset for machine translation between English and French where we will only use the French texts.


The approaches that can be used to extract useful parts from those datasets:

  1. Keyword-based methods: Search for temporal expressions using regular expressions (like last [A-Za-z]{3,6}day, \d+ (days?|month?|weeks?|year?) ago etc.)

  2. Pre-trained models method: Use a pre-trained model that tackles the same problem — extraction of temporal expressions — to predict a part of the whole corpus. There are such open-source models available like dateparser, datefinder etc. although they do not handle all or most of the cases we intend to extract.

  3. Similarity methods: Select a sample query (N weeks ago, last year, February 4th, etc). Use a text-embedding model to encode paragraphs/sentences in the corpus as vectors, then look for the most similar paragraphs/sentences to the sample queries. The notion "most similar" is typically translated into a mathematical notion like cosine-similarity.

To avoid selection bias introduced by choosing a single method and/or a single-source dataset, cleverly combining various approaches on different corpora was our way of building a good final dataset.

In this article, we will only illustrate the 3rd approach (Similarity methods).

2.2 Download and Read the Original Dataset

Let us begin with asi/wikitext_fr. The dataset is registered within the Hugging Face dataset hub, we can easily download it using dataset.load_dataset

All the code part below is supposed to be launched in a Python3-kernel of Jupyter notebook.


Output

The output corpus is a datasets.DatasetDict already split into "train", "validation" and "test" sections, together containing more than 380000 records.

Output

Let’s have a look at some parts of the dataset to see what it looks like.

Output

The output shows that each item in the dataset is an object with only one field — "paragraph". This structure will be used in the next section.

2.3 The Sentence Embedding Model

To look for similar items with our queries, we need a text-embedding model as a backbone. We will use a model trained by my colleague Van-Tuan DANG (see how this has been done in this article). The trained model has been uploaded into Hugging Face model hub under the name: dangvantuan/sentence-camembert-large.


What is inside the sim_model ? We can have a look at each layer. sim_model


Output

We will not go into further details as it is not in the scope of this article. We just want to get an idea of what the similarity model looks like. It is in fact an object of type SentenceTransformer defined by the sentence-transformers library.

The lower layer says that only the first 128 tokens of the text will be taken into account.

The upper layer Pooling says that the output of this model should be a vector of dimension 1024 per item.

Preparing to save the embeddings (encoded texts under the format of tensors)

To encode every item in the corpus, we use a function in torch that converts the text into tensors (vector of several vectors of real numbers). We can store this huge tensor as a torch file having the extension .pt, say embeddings.pt. We use the sentence-embedding model to select similar paragraphs.

2.4 Encode the Texts

As the corpus contains lots of short texts of several characters with no possibility to contain dates, we may try focusing on the long texts first (say 100 characters or more). Also, as the corpus volume is large, we will only treat a subsample of it (say 200000 first items of the train split).


The following embed function will use sentence_transformers.encode to encode the text, then save tensors with torch.save once and for all. For later uses, the function will check if the embeddings file is already there and just load them.


Output (when the embeddings file is found)

Each item should now become a vector of 1024 coordinates, where 1024 is the output size of the similarity model as expected

Output

2.5 Search by Similarity

The following is an example of queries that can be used as a reference to look for similar text in the dataset.


The following function search_by_query loops on each query in the list queries, encodes it using the loaded model, then looks for the items that have the highest similarity in the corpus corpus_sentences (whose encoded tensors are stored in embedded_weights_file). However, it will only keep the results with cosine similarity above a particular threshold. If there are still too many examples, it will return only the top nb_examples if specified.



Let’s apply the function to search for the examples expected to contain temporal expressions.

Let’s print the result and store the selected items into a text file. The following script is used to print the result in a clear HTML format beginning with the queries, the 3 first best examples (together with their index and similarity score).


Output



We observe that some of the examples contain more or less some time expression. For example, the query term “en 1970” is considered by the model to be similar to “769”, “1655” in the first result and “1969” in the third result. There are still false positives, that is why we should not rely only on one approach and one model to look for examples.


2.6 Save the Selected Items

We can output the selected text into a text file which should be used later for the next steps.


Output


2.7 Repeat with Other Datasets

The first dataset seems to be texts in historical documents. We can diversify the datasets by adding other datasets. Let’s look at the second one that we introduced at the beginning of section 2. This is a dataset used for machine translation


Output

Let’s see some content.

Output


We do exactly the same processing for this one for the field "translation.fr" of this dataset: Retrieve the text, encode it as tensors, search for similar items with queries and finally stored the result. Concatenating the two selected items from 2 datasets together, we get something like this file: medium-temporal-expression-selected-raw-text.txt. Of course, to complete the dataset, we should try other approches (keywords, pretrained-models, other similarity models) on more datasets. We do exactly the same processing for this one for the field "translation.fr" of this dataset: Retrieve the text, encode it as tensors, search for similar items with queries and finally store the result. Concatenating the two selected items from 2 datasets together, we get something like this file: nhutljn-temporal-expression-selected-raw-text.txt. Of course, to complete the dataset, we should try other approaches (keywords, pre-trained models, other similarity models) on more datasets.

2.8 Recap During this section, we presented the text similarity method via text embedding to look for useful data from a large text corpus. The output file nhutljn-temporal-expression-selected-raw-text.txt will be used in the next section for annotation. References

  • [1] TimeMLhttp://timeml.org/site/index.html

  • [2] dateparserhttps://dateparser.readthedocs.io/en/latest/

  • [3] “Sequence to Sequence” in Huggingface Transformers

  • [4] Dataset asi/wikitext_frhttps://huggingface.co/datasets/asi/wikitext_fr

  • [5] Dataset giga_frenhttps://huggingface.co/datasets/giga_fren

  • [6] Sentence-camembert-large, a model by Van-Tuan DANG, https://huggingface.co/dangvantuan/sentence-camembert-large

Acknowledgement Thanks to our colleagues Al Houceine KILANI and Ismail EL HATIMI for the article review.

About Nhut DOAN NGUYEN is data scientist at La Javaness since March 2021



38 vues0 commentaire