nielsr HF Staff commited on
Commit
2d0b09c
·
verified ·
1 Parent(s): 9094568

Improve model card: Add fill-mask pipeline tag, license, language, and domain tags

Browse files

This PR improves the model card by:

* Adding the `license: apache-2.0` metadata.
* Specifying the `pipeline_tag: fill-mask`, enabling better discoverability at https://huggingface.co/models?pipeline_tag=fill-mask.
* Including relevant `language: ja` and additional `tags` such as `japanese`, `pharmaceutical`, `bert`, and `continual-pretraining`.
* Adding a direct link to the paper and the GitHub repository at the top of the model card for better visibility.

Files changed (1) hide show
  1. README.md +24 -17
README.md CHANGED
@@ -1,16 +1,23 @@
1
  ---
2
  library_name: transformers
3
- tags: []
 
 
 
 
 
 
 
4
  ---
5
 
6
- # Model Card
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
9
- Our **JpharmaBERT (base)** is a continually pre-trained version of the BERT model ([tohoku-nlp/bert-base-japanese-v3](https://huggingface.co/tohoku-nlp/bert-base-japanese-v3)), further trained on pharmaceutical data — the same dataset used for [eques/jpharmatron](https://huggingface.co/EQUES/JPharmatron-7B).
10
 
11
- # Examoke Usage
 
 
12
 
13
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
14
  ```python
15
  import torch
16
  from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
@@ -34,25 +41,25 @@ for result in results:
34
 
35
  ### Training Data
36
 
37
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
38
  We used the same dataset as [eques/jpharmatron](https://huggingface.co/EQUES/JPharmatron-7B) for training our JpharmaBERT, which consists of:
39
- - Japanese text data (2B tokens) collected from pharmaceutical documents such as academic papers and package inserts
40
- - English data (8B tokens) obtained from PubMed abstracts
41
- - Pharmaceutical-related data (1.2B tokens) extracted from the multilingual CC100 dataset
42
 
43
- After removing duplicate entries across these sources, the final dataset contains approximately 9 billion tokens.
 
 
 
 
44
  (For details, please refer to our paper about Jpharmatron: [link](https://arxiv.org/abs/2505.16661))
45
 
46
  #### Training Hyperparameters
47
 
48
  The model was continually pre-trained with the following settings:
49
 
50
- - Mask probability: 15%
51
- - Maximum sequence length: 512 tokens
52
- - Number of training epochs: 6
53
- - Learning rate: 1e-4
54
- - Warm-up steps: 10,000
55
- - Per-device training batch size: 64
56
 
57
  ## Model Card Authors
58
 
 
1
  ---
2
  library_name: transformers
3
+ license: apache-2.0
4
+ language: ja
5
+ pipeline_tag: fill-mask
6
+ tags:
7
+ - japanese
8
+ - pharmaceutical
9
+ - bert
10
+ - continual-pretraining
11
  ---
12
 
13
+ # JpharmaBERT: A Japanese Language Model for Pharmaceutical NLP
14
 
15
+ [\ud83d\udcda Paper](https://huggingface.co/papers/2505.16661) - [\ud83d\udcbb Code](https://github.com/EQUES-AI/JpharmaBERT)
 
16
 
17
+ This is the **JpharmaBERT (base)** model, presented in the paper [A Japanese Language Model and Three New Evaluation Benchmarks for Pharmaceutical NLP](https://huggingface.co/papers/2505.16661). It is a continually pre-trained version of the BERT model ([tohoku-nlp/bert-base-japanese-v3](https://huggingface.co/tohoku-nlp/bert-base-japanese-v3)), further trained on pharmaceutical data.
18
+
19
+ # Example Usage
20
 
 
21
  ```python
22
  import torch
23
  from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
 
41
 
42
  ### Training Data
43
 
 
44
  We used the same dataset as [eques/jpharmatron](https://huggingface.co/EQUES/JPharmatron-7B) for training our JpharmaBERT, which consists of:
 
 
 
45
 
46
+ * Japanese text data (2B tokens) collected from pharmaceutical documents such as academic papers and package inserts
47
+ * English data (8B tokens) obtained from PubMed abstracts
48
+ * Pharmaceutical-related data (1.2B tokens) extracted from the multilingual CC100 dataset
49
+
50
+ After removing duplicate entries across these sources, the final dataset contains approximately 9 billion tokens.
51
  (For details, please refer to our paper about Jpharmatron: [link](https://arxiv.org/abs/2505.16661))
52
 
53
  #### Training Hyperparameters
54
 
55
  The model was continually pre-trained with the following settings:
56
 
57
+ * Mask probability: 15%
58
+ * Maximum sequence length: 512 tokens
59
+ * Number of training epochs: 6
60
+ * Learning rate: 1e-4
61
+ * Warm-up steps: 10,000
62
+ * Per-device training batch size: 64
63
 
64
  ## Model Card Authors
65