Update README.md
Browse files
README.md
CHANGED
|
@@ -2,8 +2,23 @@
|
|
| 2 |
|
| 3 |
In contrast with most text-segmentation approach, Estienne is based on token classification. Editorial structure are identified similarly to named-entity recognition.
|
| 4 |
|
| 5 |
-
Estienne was trained on 2,000 example of manually annotated texts, excerpted at random from three very large dataset collected by Pleias: Common Corpus (cultural heritage texts in the public domain), Marianne-OpenData (French/English administrative documents) and OpenScientificPile (scientific publications in free licenses, indexed on OpenAlex).
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
|
| 7 |
Estienne supports the following segmentations:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
|
| 9 |
The model is named in reference to the humanist Henri Estienne who introduced many practices of text segmentation still in use in scholarly edition today.
|
|
|
|
| 2 |
|
| 3 |
In contrast with most text-segmentation approach, Estienne is based on token classification. Editorial structure are identified similarly to named-entity recognition.
|
| 4 |
|
| 5 |
+
Estienne was trained on 2,000 example of manually annotated texts, excerpted at random from three very large dataset collected by Pleias: Common Corpus (cultural heritage texts in the public domain), Marianne-OpenData (French/English administrative documents) and OpenScientificPile (scientific publications in free licenses, indexed on OpenAlex).
|
| 6 |
+
|
| 7 |
+
Given the diversity of the corpus, Estienne should work out on diverse document formats in European languages.
|
| 8 |
+
|
| 9 |
+
As Deberta remove newline by default and has no support for it in the tokenizer, they should be replaced by pilcrows (¶)
|
| 10 |
|
| 11 |
Estienne supports the following segmentations:
|
| 12 |
+
* **Text**
|
| 13 |
+
* **Separator** - actually a segmentation separator. They are generally based on newline (actually ¶) with some variations due to text segmentation understanding.
|
| 14 |
+
* **Title**
|
| 15 |
+
* **Table**
|
| 16 |
+
* **Dialog** - any kind of speaker attributed intervention.
|
| 17 |
+
* **Bibliography** - statement of a specific bibliographic reference, either in a bibliography section or a footnote.
|
| 18 |
+
* **Contact** - personal information, can be especially useful in the context of PII removal.
|
| 19 |
+
* **Paratext** - any non-meaningful text included in standard documents like header, page numbering, section recall, etc.
|
| 20 |
+
* **Author** - author names and signatures.
|
| 21 |
+
* **Date** - statement of date and time, common in letters and newspaper articles.
|
| 22 |
+
* **Keyword** - list of keywords, especially common in scientific publications.
|
| 23 |
|
| 24 |
The model is named in reference to the humanist Henri Estienne who introduced many practices of text segmentation still in use in scholarly edition today.
|