Commit
·
b8e0ab4
1
Parent(s):
0df683e
Update README.md
Browse files
README.md
CHANGED
|
@@ -9,21 +9,21 @@ metrics:
|
|
| 9 |
- wer
|
| 10 |
- cer
|
| 11 |
widget:
|
| 12 |
-
- src: https://raw.githubusercontent.com/
|
| 13 |
example_title: 랜덤 문장 1
|
| 14 |
-
- src: https://raw.githubusercontent.com/
|
| 15 |
example_title: 랜덤 문장 2
|
| 16 |
-
- src: https://raw.githubusercontent.com/
|
| 17 |
example_title: 챗봇 1
|
| 18 |
-
- src: https://raw.githubusercontent.com/
|
| 19 |
example_title: 챗봇 2
|
| 20 |
-
- src: https://raw.githubusercontent.com/
|
| 21 |
example_title: 뉴스 1
|
| 22 |
-
- src: https://raw.githubusercontent.com/
|
| 23 |
example_title: 뉴스 2
|
| 24 |
-
- src: https://raw.githubusercontent.com/
|
| 25 |
example_title: 영화 리뷰 1
|
| 26 |
-
- src: https://raw.githubusercontent.com/
|
| 27 |
example_title: 영화 리뷰 2
|
| 28 |
---
|
| 29 |
|
|
@@ -37,9 +37,11 @@ TrOCR has not yet released a multilingual model including Korean, so we trained
|
|
| 37 |
|
| 38 |
### Text data
|
| 39 |
We created training data by processing three types of datasets.
|
|
|
|
| 40 |
- News summarization dataset: https://huggingface.co/datasets/daekeun-ml/naver-news-summarization-ko
|
| 41 |
- Naver Movie Sentiment Classification: https://github.com/e9t/nsmc
|
| 42 |
- Chatbot dataset: https://github.com/songys/Chatbot_data
|
|
|
|
| 43 |
For efficient data collection, each sentence was separated by a sentence separator library (Kiwi Python wrapper; https://github.com/bab2min/kiwipiepy), and as a result, 637,401 samples were collected.
|
| 44 |
|
| 45 |
### Image Data
|
|
@@ -76,7 +78,7 @@ processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")
|
|
| 76 |
model = VisionEncoderDecoderModel.from_pretrained("daekeun-ml/ko-trocr-base-nsmc-news-chatbot")
|
| 77 |
tokenizer = AutoTokenizer.from_pretrained("daekeun-ml/ko-trocr-base-nsmc-news-chatbot")
|
| 78 |
|
| 79 |
-
url = "https://raw.githubusercontent.com/aws-samples/
|
| 80 |
response = requests.get(url)
|
| 81 |
img = Image.open(BytesIO(response.content))
|
| 82 |
|
|
|
|
| 9 |
- wer
|
| 10 |
- cer
|
| 11 |
widget:
|
| 12 |
+
- src: https://raw.githubusercontent.com/aws-samples/sm-kornlp/main/trocr/sample_imgs/random_2.jpg
|
| 13 |
example_title: 랜덤 문장 1
|
| 14 |
+
- src: https://raw.githubusercontent.com/aws-samples/sm-kornlp/main/trocr/sample_imgs/random_6.jpg
|
| 15 |
example_title: 랜덤 문장 2
|
| 16 |
+
- src: https://raw.githubusercontent.com/aws-samples/sm-kornlp/main/trocr/sample_imgs/chatbot_3.jpg
|
| 17 |
example_title: 챗봇 1
|
| 18 |
+
- src: https://raw.githubusercontent.com/aws-samples/sm-kornlp/main/trocr/sample_imgs/chatbot_5.jpg
|
| 19 |
example_title: 챗봇 2
|
| 20 |
+
- src: https://raw.githubusercontent.com/aws-samples/sm-kornlp/main/trocr/sample_imgs/news_1.jpg
|
| 21 |
example_title: 뉴스 1
|
| 22 |
+
- src: https://raw.githubusercontent.com/aws-samples/sm-kornlp/main/trocr/sample_imgs/news_3.jpg
|
| 23 |
example_title: 뉴스 2
|
| 24 |
+
- src: https://raw.githubusercontent.com/aws-samples/sm-kornlp/main/trocr/sample_imgs/nsmc_1.jpg
|
| 25 |
example_title: 영화 리뷰 1
|
| 26 |
+
- src: https://raw.githubusercontent.com/aws-samples/sm-kornlp/main/trocr/sample_imgs/nsmc_2.jpg
|
| 27 |
example_title: 영화 리뷰 2
|
| 28 |
---
|
| 29 |
|
|
|
|
| 37 |
|
| 38 |
### Text data
|
| 39 |
We created training data by processing three types of datasets.
|
| 40 |
+
|
| 41 |
- News summarization dataset: https://huggingface.co/datasets/daekeun-ml/naver-news-summarization-ko
|
| 42 |
- Naver Movie Sentiment Classification: https://github.com/e9t/nsmc
|
| 43 |
- Chatbot dataset: https://github.com/songys/Chatbot_data
|
| 44 |
+
|
| 45 |
For efficient data collection, each sentence was separated by a sentence separator library (Kiwi Python wrapper; https://github.com/bab2min/kiwipiepy), and as a result, 637,401 samples were collected.
|
| 46 |
|
| 47 |
### Image Data
|
|
|
|
| 78 |
model = VisionEncoderDecoderModel.from_pretrained("daekeun-ml/ko-trocr-base-nsmc-news-chatbot")
|
| 79 |
tokenizer = AutoTokenizer.from_pretrained("daekeun-ml/ko-trocr-base-nsmc-news-chatbot")
|
| 80 |
|
| 81 |
+
url = "https://raw.githubusercontent.com/aws-samples/sm-kornlp/main/trocr/sample_imgs/news_1.jpg"
|
| 82 |
response = requests.get(url)
|
| 83 |
img = Image.open(BytesIO(response.content))
|
| 84 |
|