Spaces:
Running
on
Zero
Running
on
Zero
baseline08_beta0.3.0_01Oct25: refactor oauth log in, - Marker converter, dropped llm_client, - add force_ocr: to phase in cli-option
Browse files- README.md +17 -14
- converters/extraction_converter.py +24 -36
- converters/pdf_to_md.py +18 -39
- file_handler/file_utils.py +29 -10
- llm/llm_login.py +14 -0
- requirements.txt +8 -5
- ui/gradio_ui.py +113 -119
- utils/config.py +7 -0
README.md
CHANGED
|
@@ -82,11 +82,11 @@ requires-python: ">=3.12"
|
|
| 82 |
[](https://www.python.org/)
|
| 83 |
[](LICENSE)
|
| 84 |
|
| 85 |
-
A Gradio-based web application for converting PDF and
|
| 86 |
|
| 87 |
## Features
|
| 88 |
-
- **PDF to Markdown**: Extract text, tables, and images from PDFs using Marker.
|
| 89 |
-
- **HTML to Markdown**: Convert HTML files to clean Markdown.
|
| 90 |
- **Batch Processing**: Upload multiple files or entire directories.
|
| 91 |
- **LLM Integration**: Optional use of Hugging Face or OpenAI models for advanced conversion (e.g., via Llama or GPT models).
|
| 92 |
- **Customizable Settings**: Adjust model parameters, output formats (Markdown/HTML), page ranges, and more via the UI.
|
|
@@ -104,21 +104,21 @@ parserpdf/
|
|
| 104 |
βββ converters/ # Conversion logic
|
| 105 |
β βββ __init__.py
|
| 106 |
β βββ extraction_converter.py # Document extraction utilities
|
| 107 |
-
β βββ pdf_to_md.py # Marker-based PDF β Markdown
|
| 108 |
-
β βββ html_to_md.py # HTML β Markdown
|
| 109 |
β βββ md_to_pdf.py # Markdown β PDF (pending full implementation)
|
| 110 |
βββ file_handler/ # File handling utilities
|
| 111 |
β βββ __init__.py
|
| 112 |
β βββ file_utils.py # Helpers for files, directories, and paths
|
| 113 |
βββ llm/ # LLM client integrations
|
| 114 |
β βββ __init__.py
|
| 115 |
-
β βββ hf_client.py # Hugging Face client wrapper
|
| 116 |
-
β βββ openai_client.py # Marker OpenAI client
|
| 117 |
β βββ llm_login.py # Authentication handlers
|
| 118 |
β βββ provider_validator.py # Provider validation
|
| 119 |
βββ ui/ # Gradio UI components
|
| 120 |
β βββ __init__.py
|
| 121 |
-
β βββ gradio_ui.py # UI layout
|
| 122 |
βββ utils/ # Utility modules
|
| 123 |
β βββ __init__.py
|
| 124 |
β βββ config.py # Configuration constants
|
|
@@ -132,8 +132,8 @@ parserpdf/
|
|
| 132 |
β βββ output_dir/ # Output directory
|
| 133 |
β βββ pdf/ # Sample PDFs
|
| 134 |
βββ logs/ # Log files (gitignored)
|
| 135 |
-
βββ tests/ # Unit tests
|
| 136 |
-
βββ tests_converter.py # tests for converters
|
| 137 |
βββ scrapyard/ # Development scraps
|
| 138 |
|
| 139 |
|
|
@@ -165,10 +165,11 @@ parserpdf/
|
|
| 165 |
HF_TOKEN=hf_xxx
|
| 166 |
OPENAI_API_KEY=sk-xxx
|
| 167 |
```
|
|
|
|
| 168 |
|
| 169 |
4. Install Marker (if not in requirements.txt):
|
| 170 |
```
|
| 171 |
-
pip install marker-pdf
|
| 172 |
```
|
| 173 |
|
| 174 |
## Usage
|
|
@@ -180,7 +181,7 @@ parserpdf/
|
|
| 180 |
2. Open the provided local URL (e.g., http://127.0.0.1:7860) in your browser.
|
| 181 |
|
| 182 |
3. In the UI:
|
| 183 |
-
- Upload PDF/HTML files or directories via the "PDF &
|
| 184 |
- Configure LLM/Marker settings in the accordions (e.g., select provider, model, tokens).
|
| 185 |
- Click "Process All Uploaded Files" to convert.
|
| 186 |
- View logs, JSON output, and download generated Markdown files.
|
|
@@ -208,13 +209,15 @@ parserpdf/
|
|
| 208 |
## Limitations & TODO
|
| 209 |
- Markdown β PDF is pending full implementation.
|
| 210 |
- HTML tab is deprecated; use main tab for mixed uploads.
|
| 211 |
-
- Large files/directories may require increased `max_workers
|
| 212 |
- No JSON/chunks output yet (flagged for future).
|
| 213 |
|
| 214 |
## Contributing
|
| 215 |
Fork the repo, create a branch, and submit a PR.
|
|
|
|
|
|
|
| 216 |
|
| 217 |
-
Ensure tests pass: - verify the application's functionality.
|
| 218 |
```
|
| 219 |
pytest tests/
|
| 220 |
```
|
|
|
|
| 82 |
[](https://www.python.org/)
|
| 83 |
[](LICENSE)
|
| 84 |
|
| 85 |
+
A Gradio-based web application for converting PDF, HTML and Word documents to Markdown format. Powered by the Marker library (a pipeline of deep learning models for document parsing) and optional LLM integration for enhanced processing. Supports batch processing of files and directories via an intuitive UI.
|
| 86 |
|
| 87 |
## Features
|
| 88 |
+
- **PDF to Markdown**: Extract text, tables, and images from PDFs, HTMLs and Word documents using Marker.
|
| 89 |
+
- **HTML to Markdown**: Convert HTML files to clean Markdown. #Deprecated
|
| 90 |
- **Batch Processing**: Upload multiple files or entire directories.
|
| 91 |
- **LLM Integration**: Optional use of Hugging Face or OpenAI models for advanced conversion (e.g., via Llama or GPT models).
|
| 92 |
- **Customizable Settings**: Adjust model parameters, output formats (Markdown/HTML), page ranges, and more via the UI.
|
|
|
|
| 104 |
βββ converters/ # Conversion logic
|
| 105 |
β βββ __init__.py
|
| 106 |
β βββ extraction_converter.py # Document extraction utilities
|
| 107 |
+
β βββ pdf_to_md.py # Marker-based PDF, HTML, Word β Markdown
|
| 108 |
+
β βββ html_to_md.py # HTML β Markdown #Deprecated
|
| 109 |
β βββ md_to_pdf.py # Markdown β PDF (pending full implementation)
|
| 110 |
βββ file_handler/ # File handling utilities
|
| 111 |
β βββ __init__.py
|
| 112 |
β βββ file_utils.py # Helpers for files, directories, and paths
|
| 113 |
βββ llm/ # LLM client integrations
|
| 114 |
β βββ __init__.py
|
| 115 |
+
β βββ hf_client.py # Hugging Face client wrapper ##PutOnHold
|
| 116 |
+
β βββ openai_client.py # Marker OpenAI client ##NotFullyImplemented
|
| 117 |
β βββ llm_login.py # Authentication handlers
|
| 118 |
β βββ provider_validator.py # Provider validation
|
| 119 |
βββ ui/ # Gradio UI components
|
| 120 |
β βββ __init__.py
|
| 121 |
+
β βββ gradio_ui.py # UI layout, event handlers and coordination
|
| 122 |
βββ utils/ # Utility modules
|
| 123 |
β βββ __init__.py
|
| 124 |
β βββ config.py # Configuration constants
|
|
|
|
| 132 |
β βββ output_dir/ # Output directory
|
| 133 |
β βββ pdf/ # Sample PDFs
|
| 134 |
βββ logs/ # Log files (gitignored)
|
| 135 |
+
βββ tests/ # Unit tests ##ToBeUpdated
|
| 136 |
+
β βββ tests_converter.py # tests for converters
|
| 137 |
βββ scrapyard/ # Development scraps
|
| 138 |
|
| 139 |
|
|
|
|
| 165 |
HF_TOKEN=hf_xxx
|
| 166 |
OPENAI_API_KEY=sk-xxx
|
| 167 |
```
|
| 168 |
+
- HuggingFace login (oauth) integrated with Gradio:
|
| 169 |
|
| 170 |
4. Install Marker (if not in requirements.txt):
|
| 171 |
```
|
| 172 |
+
pip install marker-pdf[full]
|
| 173 |
```
|
| 174 |
|
| 175 |
## Usage
|
|
|
|
| 181 |
2. Open the provided local URL (e.g., http://127.0.0.1:7860) in your browser.
|
| 182 |
|
| 183 |
3. In the UI:
|
| 184 |
+
- Upload PDF/HTML/Word files or directories via the "PDF, HTML & Word β Markdown" tab.
|
| 185 |
- Configure LLM/Marker settings in the accordions (e.g., select provider, model, tokens).
|
| 186 |
- Click "Process All Uploaded Files" to convert.
|
| 187 |
- View logs, JSON output, and download generated Markdown files.
|
|
|
|
| 209 |
## Limitations & TODO
|
| 210 |
- Markdown β PDF is pending full implementation.
|
| 211 |
- HTML tab is deprecated; use main tab for mixed uploads.
|
| 212 |
+
- Large files/directories may require increased `max_workers` and higher processing power.
|
| 213 |
- No JSON/chunks output yet (flagged for future).
|
| 214 |
|
| 215 |
## Contributing
|
| 216 |
Fork the repo, create a branch, and submit a PR.
|
| 217 |
+
- GitHub
|
| 218 |
+
- HuggingFace Space Community
|
| 219 |
|
| 220 |
+
Ensure tests pass: - verify the application's functionality. ##TardyOutdated
|
| 221 |
```
|
| 222 |
pytest tests/
|
| 223 |
```
|
converters/extraction_converter.py
CHANGED
|
@@ -53,6 +53,7 @@ class DocumentConverter:
|
|
| 53 |
output_format: str = "markdown",
|
| 54 |
output_dir: Optional[Union[str, Path]] = "output_dir",
|
| 55 |
use_llm: Optional[bool] = None, #bool = False, #Optional[bool] = False, #True,
|
|
|
|
| 56 |
page_range: Optional[str] = None, #str = None #Optional[str] = None,
|
| 57 |
):
|
| 58 |
|
|
@@ -68,6 +69,7 @@ class DocumentConverter:
|
|
| 68 |
self.max_retries = max_retries ## pass to __call__
|
| 69 |
self.output_dir = output_dir ## "output_dir": settings.DEBUG_DATA_FOLDER if debug else output_dir,
|
| 70 |
self.use_llm = use_llm if use_llm else False #use_llm[0] if isinstance(use_llm, tuple) else use_llm, #False, #True,
|
|
|
|
| 71 |
#self.page_range = page_range[0] if isinstance(page_range, tuple) else page_range ##SMY: iterating twice because self.page casting as hint type tuple!
|
| 72 |
self.page_range = page_range if page_range else None
|
| 73 |
# self.page_range = page_range[0] if isinstance(page_range, tuple) else page_range if isinstance(page_range, str) else None, ##Example: "0,4-8,16" ##Marker parses as List[int] #]debug #len(pdf_file)
|
|
@@ -80,6 +82,7 @@ class DocumentConverter:
|
|
| 80 |
|
| 81 |
# 0) Instantiate the LLM Client (OPENAIChatClient): Get a providerβagnostic chat function
|
| 82 |
##SMY: #future. Plan to integrate into Marker: uses its own LLM services (clients). As at 1.9.2, there's no huggingface client service.
|
|
|
|
| 83 |
try:
|
| 84 |
self.client = OpenAIChatClient(
|
| 85 |
model_id=model_id,
|
|
@@ -95,16 +98,17 @@ class DocumentConverter:
|
|
| 95 |
tb = traceback.format_exc() #exc.__traceback__
|
| 96 |
logger.exception(f"β Error initialising OpenAIChatClient: {exc}\n{tb}")
|
| 97 |
raise RuntimeError(f"β Error initialising OpenAIChatClient: {exc}\n{tb}") #.with_traceback(tb)
|
| 98 |
-
|
| 99 |
# 1) # Define the custom configuration for the Hugging Face LLM.
|
| 100 |
# Use typing.Dict and typing.Any for flexible dictionary type hints
|
| 101 |
try:
|
| 102 |
self.config_dict: Dict[str, Any] = self.get_config_dict(model_id=model_id, llm_service=str(self.llm_service), output_format=output_format)
|
| 103 |
-
|
| 104 |
-
|
| 105 |
##SMY: if falsely empty tuple () or None, pop the "page_range" key-value pair, else do nothing if truthy tuple value (i.e. keep as-is)
|
| 106 |
self.config_dict.pop("page_range", None) if not self.config_dict.get("page_range") else None
|
| 107 |
self.config_dict.pop("use_llm", None) if not self.config_dict.get("use_llm") or self.config_dict.get("use_llm") is False or self.config_dict.get("use_llm") == 'False' else None
|
|
|
|
| 108 |
|
| 109 |
logger.log(level=20, msg="βοΈ config_dict custom configured:", extra={"service": "openai"}) #, "config": str(self.config_dict)})
|
| 110 |
|
|
@@ -124,27 +128,17 @@ class DocumentConverter:
|
|
| 124 |
logger.exception(f"β Error parsing/processing custom config_dict: {exc}\n{tb}")
|
| 125 |
raise RuntimeError(f"β Error parsing/processing custom config_dict: {exc}\n{tb}") #.with_traceback(tb)
|
| 126 |
|
| 127 |
-
# 3)
|
| 128 |
-
try:
|
| 129 |
-
##self.artifact_dict: Dict[str, Any] = self.get_create_model_dict ##SMY: Might have to eliminate function afterall
|
| 130 |
-
#self.artifact_dict: Dict[str, Type[BaseModel]] = create_model_dict() ##SMY: BaseModel for Any??
|
| 131 |
-
self.artifact_dict = {} ##dummy
|
| 132 |
-
##logger.log(level=20, msg="βοΈ Create artifact_dict and llm_service retrieved:", extra={"llm_service": self.llm_service})
|
| 133 |
-
|
| 134 |
-
except Exception as exc:
|
| 135 |
-
tb = traceback.format_exc() #exc.__traceback__
|
| 136 |
-
logger.exception(f"β Error creating artifact_dict or retrieving LLM service: {exc}\n{tb}")
|
| 137 |
-
raise RuntimeError(f"β Error creating artifact_dict or retrieving LLM service: {exc}\n{tb}") #.with_traceback(tb)
|
| 138 |
-
|
| 139 |
-
# 4) Load models if not already loaded in reload mode
|
| 140 |
from globals import config_load_models
|
| 141 |
try:
|
| 142 |
-
if
|
|
|
|
|
|
|
|
|
|
| 143 |
model_dict = load_models()
|
| 144 |
'''if 'model_dict' not in globals():
|
| 145 |
#model_dict = self.load_models()
|
| 146 |
model_dict = load_models()'''
|
| 147 |
-
else: model_dict = config_load_models.model_dict
|
| 148 |
except OSError as exc_ose:
|
| 149 |
tb = traceback.format_exc() #exc.__traceback__
|
| 150 |
logger.warning(f"β οΈ OSError: the paging file is too small (to complete reload): {exc_ose}\n{tb}")
|
|
@@ -153,30 +147,28 @@ class DocumentConverter:
|
|
| 153 |
tb = traceback.format_exc() #exc.__traceback__
|
| 154 |
logger.exception(f"β Error loading models (reload): {exc}\n{tb}")
|
| 155 |
raise RuntimeError(f"β Error loading models (reload): {exc}\n{tb}") #.with_traceback(tb)
|
| 156 |
-
|
| 157 |
|
| 158 |
-
#
|
| 159 |
try: # Assign llm_service if api_token. ##SMY: split and slicing ##Gets the string value
|
| 160 |
llm_service_str = None if api_token == '' or api_token is None or self.use_llm is False else str(self.llm_service).split("'")[1] #
|
| 161 |
|
| 162 |
# sets api_key required by Marker ## to handle Marker's assertion test on OpenAI
|
| 163 |
-
|
| 164 |
-
|
| 165 |
-
|
| 166 |
|
| 167 |
config_dict = config_parser.generate_config_dict()
|
| 168 |
-
#config_dict["pdftext_worker"] = self.max_workers #1 ##SMY:
|
| 169 |
|
| 170 |
-
#self.converter:
|
| 171 |
self.converter = MarkerConverter(
|
| 172 |
-
##artifact_dict=self.artifact_dict,
|
| 173 |
#artifact_dict=create_model_dict(),
|
| 174 |
artifact_dict=model_dict if model_dict else create_model_dict(),
|
| 175 |
config=config_dict,
|
| 176 |
#config=config_parser.generate_config_dict(),
|
| 177 |
#llm_service=self.llm_service ##SMY expecting str but self.llm_service, is service object marker.services of type BaseServices
|
| 178 |
llm_service=llm_service_str, ##resolve
|
| 179 |
-
|
| 180 |
|
| 181 |
logger.log(level=20, msg="βοΈ MarkerConverter instantiated successfully:", extra={"converter.config": str(self.converter.config.get("openai_base_url")), "use_llm":self.converter.use_llm})
|
| 182 |
#return self.converter ##SMY: to query why did I comment out?. Bingo: "__init__() should return None, not 'PdfConverter'"
|
|
@@ -187,21 +179,20 @@ class DocumentConverter:
|
|
| 187 |
|
| 188 |
# Define the custom configuration for HF LLM.
|
| 189 |
def get_config_dict(self, model_id: str, llm_service=MarkerOpenAIService, output_format: Optional[str] = "markdown" ) -> Dict[str, Any]:
|
| 190 |
-
""" Define the custom configuration for the Hugging Face LLM. """
|
| 191 |
|
| 192 |
try:
|
| 193 |
-
## Enable higher quality processing
|
| 194 |
-
# llm_service disused here
|
| 195 |
##llm_service = llm_service.removeprefix("<class '").removesuffix("'>") # e.g <class 'marker.services.openai.OpenAIService'>
|
| 196 |
#llm_service = str(llm_service).split("'")[1] ## SMY: split and slicing
|
| 197 |
self.use_llm = self.use_llm[0] if isinstance(self.use_llm, tuple) else self.use_llm
|
| 198 |
self.page_range = self.page_range[0] if isinstance(self.page_range, tuple) else self.page_range #if isinstance(self.page_range, str) else None, ##SMY: passing as hint type tuple!
|
| 199 |
|
| 200 |
-
|
| 201 |
config_dict = {
|
| 202 |
"output_format" : output_format, #"markdown",
|
| 203 |
"openai_model" : self.model_id, #self.client.model_id, #"model_name"
|
| 204 |
-
"openai_api_key" : self.
|
| 205 |
"openai_base_url": self.openai_base_url, #self.client.base_url, #self.base_url,
|
| 206 |
"temperature" : self.temperature, #self.client.temperature,
|
| 207 |
"top_p" : self.top_p, #self.client.top_p,
|
|
@@ -210,6 +201,7 @@ class DocumentConverter:
|
|
| 210 |
"max_retries" : self.max_retries, #3, ## pass to __call__
|
| 211 |
"output_dir" : self.output_dir,
|
| 212 |
"use_llm" : self.use_llm, #False, #True,
|
|
|
|
| 213 |
"page_range" : self.page_range, ##debug #len(pdf_file)
|
| 214 |
}
|
| 215 |
return config_dict
|
|
@@ -219,10 +211,6 @@ class DocumentConverter:
|
|
| 219 |
raise RuntimeError(f"β Error configuring custom config_dict: {exc}\n{tb}") #").with_traceback(tb)
|
| 220 |
#raise
|
| 221 |
|
| 222 |
-
''' # create/load models. Called to curtail reloading models at each instance
|
| 223 |
-
def load_models():
|
| 224 |
-
return create_model_dict()'''
|
| 225 |
-
|
| 226 |
##SMY: flagged for deprecation
|
| 227 |
##SMY: marker prefer default artifact dictionary (marker.models.create_model_dict) instead of overridding
|
| 228 |
#def get_extraction_converter(self, chat_fn):
|
|
|
|
| 53 |
output_format: str = "markdown",
|
| 54 |
output_dir: Optional[Union[str, Path]] = "output_dir",
|
| 55 |
use_llm: Optional[bool] = None, #bool = False, #Optional[bool] = False, #True,
|
| 56 |
+
force_ocr: Optional[bool] = None, #bool = False,
|
| 57 |
page_range: Optional[str] = None, #str = None #Optional[str] = None,
|
| 58 |
):
|
| 59 |
|
|
|
|
| 69 |
self.max_retries = max_retries ## pass to __call__
|
| 70 |
self.output_dir = output_dir ## "output_dir": settings.DEBUG_DATA_FOLDER if debug else output_dir,
|
| 71 |
self.use_llm = use_llm if use_llm else False #use_llm[0] if isinstance(use_llm, tuple) else use_llm, #False, #True,
|
| 72 |
+
self.force_ocr = force_ocr if force_ocr else False
|
| 73 |
#self.page_range = page_range[0] if isinstance(page_range, tuple) else page_range ##SMY: iterating twice because self.page casting as hint type tuple!
|
| 74 |
self.page_range = page_range if page_range else None
|
| 75 |
# self.page_range = page_range[0] if isinstance(page_range, tuple) else page_range if isinstance(page_range, str) else None, ##Example: "0,4-8,16" ##Marker parses as List[int] #]debug #len(pdf_file)
|
|
|
|
| 82 |
|
| 83 |
# 0) Instantiate the LLM Client (OPENAIChatClient): Get a providerβagnostic chat function
|
| 84 |
##SMY: #future. Plan to integrate into Marker: uses its own LLM services (clients). As at 1.9.2, there's no huggingface client service.
|
| 85 |
+
'''
|
| 86 |
try:
|
| 87 |
self.client = OpenAIChatClient(
|
| 88 |
model_id=model_id,
|
|
|
|
| 98 |
tb = traceback.format_exc() #exc.__traceback__
|
| 99 |
logger.exception(f"β Error initialising OpenAIChatClient: {exc}\n{tb}")
|
| 100 |
raise RuntimeError(f"β Error initialising OpenAIChatClient: {exc}\n{tb}") #.with_traceback(tb)
|
| 101 |
+
'''
|
| 102 |
# 1) # Define the custom configuration for the Hugging Face LLM.
|
| 103 |
# Use typing.Dict and typing.Any for flexible dictionary type hints
|
| 104 |
try:
|
| 105 |
self.config_dict: Dict[str, Any] = self.get_config_dict(model_id=model_id, llm_service=str(self.llm_service), output_format=output_format)
|
| 106 |
+
|
| 107 |
+
##SMY: execute if page_range is none. `else None` ensures valid syntactic expression
|
| 108 |
##SMY: if falsely empty tuple () or None, pop the "page_range" key-value pair, else do nothing if truthy tuple value (i.e. keep as-is)
|
| 109 |
self.config_dict.pop("page_range", None) if not self.config_dict.get("page_range") else None
|
| 110 |
self.config_dict.pop("use_llm", None) if not self.config_dict.get("use_llm") or self.config_dict.get("use_llm") is False or self.config_dict.get("use_llm") == 'False' else None
|
| 111 |
+
self.config_dict.pop("force_ocr", None) if not self.config_dict.get("force_ocr") or self.config_dict.get("force_ocr") is False or self.config_dict.get("force_ocr") == 'False' else None
|
| 112 |
|
| 113 |
logger.log(level=20, msg="βοΈ config_dict custom configured:", extra={"service": "openai"}) #, "config": str(self.config_dict)})
|
| 114 |
|
|
|
|
| 128 |
logger.exception(f"β Error parsing/processing custom config_dict: {exc}\n{tb}")
|
| 129 |
raise RuntimeError(f"β Error parsing/processing custom config_dict: {exc}\n{tb}") #.with_traceback(tb)
|
| 130 |
|
| 131 |
+
# 3) Load models if not already loaded in reload mode
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 132 |
from globals import config_load_models
|
| 133 |
try:
|
| 134 |
+
if config_load_models.model_dict:
|
| 135 |
+
model_dict = config_load_models.model_dict
|
| 136 |
+
#elif not config_load_models.model_dict or 'model_dict' not in globals():
|
| 137 |
+
else:
|
| 138 |
model_dict = load_models()
|
| 139 |
'''if 'model_dict' not in globals():
|
| 140 |
#model_dict = self.load_models()
|
| 141 |
model_dict = load_models()'''
|
|
|
|
| 142 |
except OSError as exc_ose:
|
| 143 |
tb = traceback.format_exc() #exc.__traceback__
|
| 144 |
logger.warning(f"β οΈ OSError: the paging file is too small (to complete reload): {exc_ose}\n{tb}")
|
|
|
|
| 147 |
tb = traceback.format_exc() #exc.__traceback__
|
| 148 |
logger.exception(f"β Error loading models (reload): {exc}\n{tb}")
|
| 149 |
raise RuntimeError(f"β Error loading models (reload): {exc}\n{tb}") #.with_traceback(tb)
|
|
|
|
| 150 |
|
| 151 |
+
# 4) Instantiate Marker's MarkerConverter (PdfConverter) with config managed by config_parser
|
| 152 |
try: # Assign llm_service if api_token. ##SMY: split and slicing ##Gets the string value
|
| 153 |
llm_service_str = None if api_token == '' or api_token is None or self.use_llm is False else str(self.llm_service).split("'")[1] #
|
| 154 |
|
| 155 |
# sets api_key required by Marker ## to handle Marker's assertion test on OpenAI
|
| 156 |
+
if llm_service_str:
|
| 157 |
+
os.environ["OPENAI_API_KEY"] = api_token if api_token and api_token != '' else os.getenv("OPENAI_API_KEY") or os.getenv("GEMINI_API_KEY") or os.getenv("HF_TOKEN") or os.getenv("HUGGINGFACEHUB_API_TOKEN")
|
| 158 |
+
#logger.log(level=20, msg="self.converter: instantiating MarkerConverter:", extra={"llm_service_str": llm_service_str, "api_token": api_token}) ##debug
|
| 159 |
|
| 160 |
config_dict = config_parser.generate_config_dict()
|
| 161 |
+
#config_dict["pdftext_worker"] = self.max_workers #1 ##SMY: moved to get_config_dicts()
|
| 162 |
|
| 163 |
+
#self.converter: marker.converters.pdf.PdfConverter
|
| 164 |
self.converter = MarkerConverter(
|
|
|
|
| 165 |
#artifact_dict=create_model_dict(),
|
| 166 |
artifact_dict=model_dict if model_dict else create_model_dict(),
|
| 167 |
config=config_dict,
|
| 168 |
#config=config_parser.generate_config_dict(),
|
| 169 |
#llm_service=self.llm_service ##SMY expecting str but self.llm_service, is service object marker.services of type BaseServices
|
| 170 |
llm_service=llm_service_str, ##resolve
|
| 171 |
+
)
|
| 172 |
|
| 173 |
logger.log(level=20, msg="βοΈ MarkerConverter instantiated successfully:", extra={"converter.config": str(self.converter.config.get("openai_base_url")), "use_llm":self.converter.use_llm})
|
| 174 |
#return self.converter ##SMY: to query why did I comment out?. Bingo: "__init__() should return None, not 'PdfConverter'"
|
|
|
|
| 179 |
|
| 180 |
# Define the custom configuration for HF LLM.
|
| 181 |
def get_config_dict(self, model_id: str, llm_service=MarkerOpenAIService, output_format: Optional[str] = "markdown" ) -> Dict[str, Any]:
|
| 182 |
+
""" Define the custom configuration for the Hugging Face LLM: combining Markers cli_options and LLM. """
|
| 183 |
|
| 184 |
try:
|
| 185 |
+
## LLM Enable higher quality processing. ## See MarkerOpenAIService,
|
|
|
|
| 186 |
##llm_service = llm_service.removeprefix("<class '").removesuffix("'>") # e.g <class 'marker.services.openai.OpenAIService'>
|
| 187 |
#llm_service = str(llm_service).split("'")[1] ## SMY: split and slicing
|
| 188 |
self.use_llm = self.use_llm[0] if isinstance(self.use_llm, tuple) else self.use_llm
|
| 189 |
self.page_range = self.page_range[0] if isinstance(self.page_range, tuple) else self.page_range #if isinstance(self.page_range, str) else None, ##SMY: passing as hint type tuple!
|
| 190 |
|
| 191 |
+
##SMY: TODO: convert to {inputs} and called from gradio_ui
|
| 192 |
config_dict = {
|
| 193 |
"output_format" : output_format, #"markdown",
|
| 194 |
"openai_model" : self.model_id, #self.client.model_id, #"model_name"
|
| 195 |
+
"openai_api_key" : self.openai_api_key, #self.client.openai_api_key, #self.api_token,
|
| 196 |
"openai_base_url": self.openai_base_url, #self.client.base_url, #self.base_url,
|
| 197 |
"temperature" : self.temperature, #self.client.temperature,
|
| 198 |
"top_p" : self.top_p, #self.client.top_p,
|
|
|
|
| 201 |
"max_retries" : self.max_retries, #3, ## pass to __call__
|
| 202 |
"output_dir" : self.output_dir,
|
| 203 |
"use_llm" : self.use_llm, #False, #True,
|
| 204 |
+
"force_ocr" : self.force_ocr, #False,
|
| 205 |
"page_range" : self.page_range, ##debug #len(pdf_file)
|
| 206 |
}
|
| 207 |
return config_dict
|
|
|
|
| 211 |
raise RuntimeError(f"β Error configuring custom config_dict: {exc}\n{tb}") #").with_traceback(tb)
|
| 212 |
#raise
|
| 213 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 214 |
##SMY: flagged for deprecation
|
| 215 |
##SMY: marker prefer default artifact dictionary (marker.models.create_model_dict) instead of overridding
|
| 216 |
#def get_extraction_converter(self, chat_fn):
|
converters/pdf_to_md.py
CHANGED
|
@@ -1,13 +1,13 @@
|
|
| 1 |
# converters/pdf_to_md.py
|
| 2 |
import os
|
| 3 |
from pathlib import Path
|
| 4 |
-
from typing import List, Dict,
|
| 5 |
import traceback ## Extract, format and print information about Python stack traces.
|
| 6 |
import time
|
| 7 |
|
| 8 |
-
|
| 9 |
from converters.extraction_converter import DocumentConverter #, DocumentExtractor #as docextractor #ExtractionConverter #get_extraction_converter ## SMY: should disuse
|
| 10 |
-
from file_handler.file_utils import collect_pdf_paths, collect_html_paths, collect_markdown_paths, create_outputdir
|
| 11 |
|
| 12 |
|
| 13 |
from utils import config
|
|
@@ -43,7 +43,9 @@ def init_worker(#self,
|
|
| 43 |
output_format: str, #: str = "markdown",
|
| 44 |
output_dir: str, #: Union | None = "output_dir",
|
| 45 |
use_llm: bool, #: bool | None = False,
|
|
|
|
| 46 |
page_range: str, #: str | None = None
|
|
|
|
| 47 |
):
|
| 48 |
|
| 49 |
#'''
|
|
@@ -58,35 +60,6 @@ def init_worker(#self,
|
|
| 58 |
# Define global variables
|
| 59 |
global docconverter
|
| 60 |
global converter
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
##SMY: kept for future implementation. Replaced with DocumentConverter.
|
| 64 |
-
'''
|
| 65 |
-
# 1) Instantiate the DocumentExtractor
|
| 66 |
-
logger.log(level=20, msg="initialising docextractor:", extra={"model_id": model_id, "hf_provider": hf_provider})
|
| 67 |
-
try:
|
| 68 |
-
docextractor = DocumentExtractor(
|
| 69 |
-
provider=provider,
|
| 70 |
-
model_id=model_id,
|
| 71 |
-
hf_provider=hf_provider,
|
| 72 |
-
endpoint_url=endpoint_url,
|
| 73 |
-
backend_choice=backend_choice,
|
| 74 |
-
system_message=system_message,
|
| 75 |
-
max_tokens=max_tokens,
|
| 76 |
-
temperature=temperature,
|
| 77 |
-
top_p=top_p,
|
| 78 |
-
stream=stream,
|
| 79 |
-
api_token=api_token,
|
| 80 |
-
)
|
| 81 |
-
logger.log(level=20, msg="βοΈ docextractor initialised:", extra={"model_id": model_id, "hf_provider": hf_provider})
|
| 82 |
-
except Exception as exc:
|
| 83 |
-
#logger.error(f"Failed to initialise DocumentExtractor: {exc}")
|
| 84 |
-
tb = traceback.format_exc()
|
| 85 |
-
logger.exception(f"init_worker: Error initialising DocumentExtractor β {exc}\n{tb}", exc_info=True)
|
| 86 |
-
return f"β init_worker: error initialising DocumentExtractor β {exc}\n{tb}"
|
| 87 |
-
|
| 88 |
-
self.docextractor = docextractor
|
| 89 |
-
'''
|
| 90 |
|
| 91 |
#'''
|
| 92 |
# 1) Instantiate the DocumentConverter
|
|
@@ -105,6 +78,7 @@ def init_worker(#self,
|
|
| 105 |
output_format, #: str = "markdown",
|
| 106 |
output_dir, #: Union | None = "output_dir",
|
| 107 |
use_llm, #: bool | None = False,
|
|
|
|
| 108 |
page_range, #: str | None = None
|
| 109 |
)
|
| 110 |
logger.log(level=20, msg="βοΈ docextractor initialised:", extra={"docconverter model_id": docconverter.converter.config.get("openai_model"), "docconverter use_llm": docconverter.converter.use_llm, "docconverter output_dir": docconverter.output_dir})
|
|
@@ -127,8 +101,9 @@ class PdfToMarkdownConverter:
|
|
| 127 |
|
| 128 |
#def __init__(self, options: Dict | None = None):
|
| 129 |
def __init__(self, options: Dict | None = None): #extractor: DocumentExtractor, options: Dict | None = None):
|
| 130 |
-
self.options = options or {}
|
| 131 |
self.output_dir_string = ''
|
|
|
|
| 132 |
#self.OUTPUT_DIR = config.OUTPUT_DIR ##flag unused
|
| 133 |
#self.MAX_RETRIES = config.MAX_RETRIES ##flag unused
|
| 134 |
#self.docconverter = None #DocumentConverter
|
|
@@ -197,25 +172,28 @@ class PdfToMarkdownConverter:
|
|
| 197 |
return {"file": md_file.name, "images": images_count, "filepath": md_file, "image_path": image_path} ####SMY should be Dict[str, int, str]. Dicts are not necessarily ordered.
|
| 198 |
|
| 199 |
#def convert_files(src_path: str, output_dir: str, max_retries: int = 2) -> str:
|
| 200 |
-
def convert_files(self, src_path: str, output_dir_string: str = None, max_retries: int = 2) -> Union[Dict, str]: #str:
|
|
|
|
| 201 |
#def convert_files(self, src_path: str) -> str:
|
| 202 |
"""
|
| 203 |
Worker task: use `extractor` to convert file with retry/backoff.
|
| 204 |
Returns a short log line.
|
| 205 |
"""
|
| 206 |
|
| 207 |
-
try:
|
| 208 |
output_dir = create_outputdir(root=src_path, output_dir_string=self.output_dir_string)
|
| 209 |
logger.info(f"β output_dir created: {output_dir}") #{create_outputdir(src_path)}"
|
| 210 |
except Exception as exc:
|
| 211 |
tb = traceback.format_exc()
|
| 212 |
logger.exception("β error creating output_dir β {exc}\n{tb}", exc_info=True)
|
| 213 |
-
return f"β error creating output_dir β {exc}\n{tb}"
|
|
|
|
| 214 |
|
| 215 |
try:
|
| 216 |
#if Path(src_path).suffix.lower() not in {".pdf", ".html", ".htm"}:
|
| 217 |
#if not Path(src_path).name.endswith(tuple({".pdf", ".html"})): #,".docx", ".doc", ".pptx", ".ppt", ".xlsx", ".xls"})):
|
| 218 |
-
if not Path(src_path).name.endswith((".pdf", ".html")): #,".docx", ".doc", ".pptx", ".ppt", ".xlsx", ".xls"})):
|
|
|
|
| 219 |
logger.log(level=20, msg=f"skipped {Path(src_path).name}", exc_info=True)
|
| 220 |
return f"skipped {Path(src_path).name}"
|
| 221 |
except Exception as exc:
|
|
@@ -226,7 +204,8 @@ class PdfToMarkdownConverter:
|
|
| 226 |
#max_retries = self.MAX_RETRIES
|
| 227 |
for attempt in range(1, max_retries + 1):
|
| 228 |
try:
|
| 229 |
-
info = self.extract(str(src_path), str(output_dir.stem)) #extractor.converter(str(src_path), str(output_dir)) #
|
|
|
|
| 230 |
logger.log(level=20, msg=f"β : info about extracted {Path(src_path).name}: ", extra={"info": str(info)})
|
| 231 |
''' ##SMY: moving formating to calling Gradio
|
| 232 |
img_count = info.get("images", 0)
|
|
@@ -239,7 +218,7 @@ class PdfToMarkdownConverter:
|
|
| 239 |
except Exception as exc:
|
| 240 |
if attempt == max_retries:
|
| 241 |
tb = traceback.format_exc()
|
| 242 |
-
return f"β {info.get('file')} β {exc}\n{tb}"
|
| 243 |
#return f"β {md_filename} β {exc}\n{tb}"
|
| 244 |
|
| 245 |
#time.sleep(2 ** attempt)
|
|
|
|
| 1 |
# converters/pdf_to_md.py
|
| 2 |
import os
|
| 3 |
from pathlib import Path
|
| 4 |
+
from typing import List, Dict, Union, Optional
|
| 5 |
import traceback ## Extract, format and print information about Python stack traces.
|
| 6 |
import time
|
| 7 |
|
| 8 |
+
from ui.gradio_ui import gr
|
| 9 |
from converters.extraction_converter import DocumentConverter #, DocumentExtractor #as docextractor #ExtractionConverter #get_extraction_converter ## SMY: should disuse
|
| 10 |
+
from file_handler.file_utils import write_markdown, dump_images, collect_pdf_paths, collect_html_paths, collect_markdown_paths, create_outputdir
|
| 11 |
|
| 12 |
|
| 13 |
from utils import config
|
|
|
|
| 43 |
output_format: str, #: str = "markdown",
|
| 44 |
output_dir: str, #: Union | None = "output_dir",
|
| 45 |
use_llm: bool, #: bool | None = False,
|
| 46 |
+
force_ocr: bool,
|
| 47 |
page_range: str, #: str | None = None
|
| 48 |
+
progress: gr.Progress = gr.Progress(),
|
| 49 |
):
|
| 50 |
|
| 51 |
#'''
|
|
|
|
| 60 |
# Define global variables
|
| 61 |
global docconverter
|
| 62 |
global converter
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
#'''
|
| 65 |
# 1) Instantiate the DocumentConverter
|
|
|
|
| 78 |
output_format, #: str = "markdown",
|
| 79 |
output_dir, #: Union | None = "output_dir",
|
| 80 |
use_llm, #: bool | None = False,
|
| 81 |
+
force_ocr,
|
| 82 |
page_range, #: str | None = None
|
| 83 |
)
|
| 84 |
logger.log(level=20, msg="βοΈ docextractor initialised:", extra={"docconverter model_id": docconverter.converter.config.get("openai_model"), "docconverter use_llm": docconverter.converter.use_llm, "docconverter output_dir": docconverter.output_dir})
|
|
|
|
| 101 |
|
| 102 |
#def __init__(self, options: Dict | None = None):
|
| 103 |
def __init__(self, options: Dict | None = None): #extractor: DocumentExtractor, options: Dict | None = None):
|
| 104 |
+
self.options = options or {} ##SMY: TOBE implemented - bring all Marker's options
|
| 105 |
self.output_dir_string = ''
|
| 106 |
+
self.output_dir = self.output_dir_string ## placeholder
|
| 107 |
#self.OUTPUT_DIR = config.OUTPUT_DIR ##flag unused
|
| 108 |
#self.MAX_RETRIES = config.MAX_RETRIES ##flag unused
|
| 109 |
#self.docconverter = None #DocumentConverter
|
|
|
|
| 172 |
return {"file": md_file.name, "images": images_count, "filepath": md_file, "image_path": image_path} ####SMY should be Dict[str, int, str]. Dicts are not necessarily ordered.
|
| 173 |
|
| 174 |
#def convert_files(src_path: str, output_dir: str, max_retries: int = 2) -> str:
|
| 175 |
+
#def convert_files(self, src_path: str, output_dir_string: str = None, max_retries: int = 2, progress = gr.Progress()) -> Union[Dict, str]: #str:
|
| 176 |
+
def convert_files(self, src_path: str, max_retries: int = 2, progress = gr.Progress()) -> Union[Dict, str]:
|
| 177 |
#def convert_files(self, src_path: str) -> str:
|
| 178 |
"""
|
| 179 |
Worker task: use `extractor` to convert file with retry/backoff.
|
| 180 |
Returns a short log line.
|
| 181 |
"""
|
| 182 |
|
| 183 |
+
'''try: ##moved to gradio_ui. sets to PdfToMarkdownConverter.output_dir_string
|
| 184 |
output_dir = create_outputdir(root=src_path, output_dir_string=self.output_dir_string)
|
| 185 |
logger.info(f"β output_dir created: {output_dir}") #{create_outputdir(src_path)}"
|
| 186 |
except Exception as exc:
|
| 187 |
tb = traceback.format_exc()
|
| 188 |
logger.exception("β error creating output_dir β {exc}\n{tb}", exc_info=True)
|
| 189 |
+
return f"β error creating output_dir β {exc}\n{tb}"'''
|
| 190 |
+
output_dir = Path(self.output_dir) ## takes the value from gradio_ui
|
| 191 |
|
| 192 |
try:
|
| 193 |
#if Path(src_path).suffix.lower() not in {".pdf", ".html", ".htm"}:
|
| 194 |
#if not Path(src_path).name.endswith(tuple({".pdf", ".html"})): #,".docx", ".doc", ".pptx", ".ppt", ".xlsx", ".xls"})):
|
| 195 |
+
#if not Path(src_path).name.endswith((".pdf", ".html", ".docx", ".doc")): #,".docx", ".doc", ".pptx", ".ppt", ".xlsx", ".xls"})):
|
| 196 |
+
if not Path(src_path).name.endswith(config.file_types_tuple): #,".docx", ".doc", ".pptx", ".ppt", ".xlsx", ".xls"})):
|
| 197 |
logger.log(level=20, msg=f"skipped {Path(src_path).name}", exc_info=True)
|
| 198 |
return f"skipped {Path(src_path).name}"
|
| 199 |
except Exception as exc:
|
|
|
|
| 204 |
#max_retries = self.MAX_RETRIES
|
| 205 |
for attempt in range(1, max_retries + 1):
|
| 206 |
try:
|
| 207 |
+
#info = self.extract(str(src_path), str(output_dir.stem)) #extractor.converter(str(src_path), str(output_dir)) #
|
| 208 |
+
info = self.extract(str(src_path), str(output_dir)) #extractor.converter(str(src_path), str(output_dir)) #
|
| 209 |
logger.log(level=20, msg=f"β : info about extracted {Path(src_path).name}: ", extra={"info": str(info)})
|
| 210 |
''' ##SMY: moving formating to calling Gradio
|
| 211 |
img_count = info.get("images", 0)
|
|
|
|
| 218 |
except Exception as exc:
|
| 219 |
if attempt == max_retries:
|
| 220 |
tb = traceback.format_exc()
|
| 221 |
+
return f"β {info.get('file', 'UnboundlocalError: info is None')} β {exc}\n{tb}"
|
| 222 |
#return f"β {md_filename} β {exc}\n{tb}"
|
| 223 |
|
| 224 |
#time.sleep(2 ** attempt)
|
file_handler/file_utils.py
CHANGED
|
@@ -6,15 +6,15 @@ import shutil
|
|
| 6 |
import tempfile
|
| 7 |
|
| 8 |
from itertools import chain
|
| 9 |
-
from typing import List, Union, Any, Mapping
|
| 10 |
from PIL import Image
|
| 11 |
|
| 12 |
#import utils.config as config ##SMY: currently unused
|
| 13 |
|
| 14 |
-
##SMY:
|
| 15 |
#def create_outputdir(root: Union[str, Path], out_dir:Union[str, Path] = None) -> Path: #List[Path]:
|
| 16 |
def create_outputdir(root: Union[str, Path], output_dir_string:str = None) -> Path: #List[Path]:
|
| 17 |
-
""" Create output dir
|
| 18 |
|
| 19 |
''' ##preserved for future implementation if needed again
|
| 20 |
root = root if isinstance(root, Path) else Path(root)
|
|
@@ -24,10 +24,12 @@ def create_outputdir(root: Union[str, Path], output_dir_string:str = None) -> Pa
|
|
| 24 |
out_dir = out_dir if out_dir else "output_md" ## SMY: default to outputdir in config file = "output_md"
|
| 25 |
output_dir = root.parent / out_dir #"md_output" ##SMY: concatenating output str with src Path
|
| 26 |
'''
|
|
|
|
| 27 |
|
| 28 |
## map to img_path. Opt to putting output within same output_md folder rather than individual source folders
|
| 29 |
output_dir_string = output_dir_string if output_dir_string else "output_dir" ##redundant SMY: default to outputdir in config file = "output_md"
|
| 30 |
-
output_dir = Path("data") / output_dir_string #"output_md" ##SMY: concatenating output str with src Path
|
|
|
|
| 31 |
output_dir.mkdir(mode=0o2755, parents=True, exist_ok=True) #,mode=0o2755
|
| 32 |
return output_dir
|
| 33 |
|
|
@@ -225,6 +227,17 @@ def check_create_file(filename: Union[str, Path]) -> Path:
|
|
| 225 |
|
| 226 |
return filename_path
|
| 227 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 228 |
def zip_processed_files(root_dir: str, file_paths: list[str], tz_hours=None, date_format='%d%b%Y_%H-%M-%S') -> Path:
|
| 229 |
"""
|
| 230 |
Creates a zip file from a list of file paths (strings) and returns the Path object.
|
|
@@ -247,11 +260,14 @@ def zip_processed_files(root_dir: str, file_paths: list[str], tz_hours=None, dat
|
|
| 247 |
raise ValueError(f"Root directory does not exist: {root_path}")
|
| 248 |
|
| 249 |
# Create a temporary directory in a location where Gradio can access it.
|
| 250 |
-
|
|
|
|
| 251 |
#gradio_output_dir.mkdir(exist_ok=True)
|
| 252 |
file_utils.check_create_dir(gradio_output_dir)
|
| 253 |
final_zip_path = gradio_output_dir / f"outputs_processed_{utils.get_time_now_str(tz_hours=tz_hours, date_format=date_format)}.zip"
|
| 254 |
-
|
|
|
|
|
|
|
| 255 |
# Use a context manager to create the zip file: use zipfile() opposed to shutil.make_archive
|
| 256 |
# 'w' mode creates a new file, overwriting if it already exists.
|
| 257 |
zip_unprocessed = 0
|
|
@@ -442,7 +458,7 @@ def write_markdown(
|
|
| 442 |
Notes
|
| 443 |
-----
|
| 444 |
The function is intentionally lightweight: it only handles path resolution,
|
| 445 |
-
directory creation, and file I/O. All rendering logic
|
| 446 |
calling this helper.
|
| 447 |
"""
|
| 448 |
src = Path(src_path)
|
|
@@ -460,9 +476,11 @@ def write_markdown(
|
|
| 460 |
|
| 461 |
## Opt to putting output within same output_md folder rather than individual source folders
|
| 462 |
#md_path = Path("data\\pdf") / "output_md" / f"{src.stem}" / md_name ##debug
|
| 463 |
-
md_path = Path("data") / output_dir / f"{src.stem}" / md_name ##debug
|
|
|
|
| 464 |
##SMY: [resolved] Permission Errno13 - https://stackoverflow.com/a/57454275
|
| 465 |
-
md_path.parent.mkdir(mode=0o2755, parents=True, exist_ok=True) ##SMY: create nested md_path if not exists
|
|
|
|
| 466 |
md_path.parent.chmod(0)
|
| 467 |
|
| 468 |
try:
|
|
@@ -531,7 +549,8 @@ def dump_images(
|
|
| 531 |
#img_path = Path(src.parent) / f"{Path(output_dir).stem}" / f"{src.stem}" / img_name
|
| 532 |
|
| 533 |
#img_path = Path("data\\pdf") / "output_md" / f"{src.stem}" / img_name ##debug
|
| 534 |
-
img_path = Path("data") / output_dir / f"{src.stem}" / img_name ##debug
|
|
|
|
| 535 |
#img_path.mkdir(mode=0o777, parents=True, exist_ok=True) ##SMY: create nested img_path if not exists
|
| 536 |
#img_path.parent.mkdir(parents=True, exist_ok=True)
|
| 537 |
|
|
|
|
| 6 |
import tempfile
|
| 7 |
|
| 8 |
from itertools import chain
|
| 9 |
+
from typing import List, Optional, Union, Any, Mapping
|
| 10 |
from PIL import Image
|
| 11 |
|
| 12 |
#import utils.config as config ##SMY: currently unused
|
| 13 |
|
| 14 |
+
##SMY: flagged: deprecated vis duplicated. See create_temp_folder() and marker/marker/config/parser.py ~ https://github.com/datalab-to/marker/blob/master/marker/config/parser.py#L169
|
| 15 |
#def create_outputdir(root: Union[str, Path], out_dir:Union[str, Path] = None) -> Path: #List[Path]:
|
| 16 |
def create_outputdir(root: Union[str, Path], output_dir_string:str = None) -> Path: #List[Path]:
|
| 17 |
+
""" Create output dir default to Temp """
|
| 18 |
|
| 19 |
''' ##preserved for future implementation if needed again
|
| 20 |
root = root if isinstance(root, Path) else Path(root)
|
|
|
|
| 24 |
out_dir = out_dir if out_dir else "output_md" ## SMY: default to outputdir in config file = "output_md"
|
| 25 |
output_dir = root.parent / out_dir #"md_output" ##SMY: concatenating output str with src Path
|
| 26 |
'''
|
| 27 |
+
root = create_temp_folder()
|
| 28 |
|
| 29 |
## map to img_path. Opt to putting output within same output_md folder rather than individual source folders
|
| 30 |
output_dir_string = output_dir_string if output_dir_string else "output_dir" ##redundant SMY: default to outputdir in config file = "output_md"
|
| 31 |
+
#output_dir = Path("data") / output_dir_string #"output_md" ##SMY: concatenating output str with src Path
|
| 32 |
+
output_dir = Path(root) / output_dir_string #"output_md" ##SMY: concatenating output str with src Path
|
| 33 |
output_dir.mkdir(mode=0o2755, parents=True, exist_ok=True) #,mode=0o2755
|
| 34 |
return output_dir
|
| 35 |
|
|
|
|
| 227 |
|
| 228 |
return filename_path
|
| 229 |
|
| 230 |
+
def create_temp_folder(tempfolder: Optional[str | Path] = ''):
|
| 231 |
+
""" Create a temp folder Gradio and output_dir if supplied"""
|
| 232 |
+
# Create a temporary directory in a location where Gradio can access it.
|
| 233 |
+
#gradio_output_dir = Path(tempfile.gettempdir()) / "gradio_temp_output"/ tempfolder if tempfolder else Path(tempfile.gettempdir()) / "gradio_temp_output"
|
| 234 |
+
#gradio_output_dir.mkdir(exist_ok=True)
|
| 235 |
+
#gradio_output_dir = check_create_dir(gradio_output_dir)
|
| 236 |
+
|
| 237 |
+
gradio_output_dir = check_create_dir(Path(tempfile.gettempdir()) / "gradio_temp_output"/ tempfolder if tempfolder else Path(tempfile.gettempdir()) / "gradio_temp_output")
|
| 238 |
+
|
| 239 |
+
return gradio_output_dir
|
| 240 |
+
|
| 241 |
def zip_processed_files(root_dir: str, file_paths: list[str], tz_hours=None, date_format='%d%b%Y_%H-%M-%S') -> Path:
|
| 242 |
"""
|
| 243 |
Creates a zip file from a list of file paths (strings) and returns the Path object.
|
|
|
|
| 260 |
raise ValueError(f"Root directory does not exist: {root_path}")
|
| 261 |
|
| 262 |
# Create a temporary directory in a location where Gradio can access it.
|
| 263 |
+
##SMY: synced with create_temp_folder()
|
| 264 |
+
'''gradio_output_dir = Path(tempfile.gettempdir()) / "gradio_temp_output"
|
| 265 |
#gradio_output_dir.mkdir(exist_ok=True)
|
| 266 |
file_utils.check_create_dir(gradio_output_dir)
|
| 267 |
final_zip_path = gradio_output_dir / f"outputs_processed_{utils.get_time_now_str(tz_hours=tz_hours, date_format=date_format)}.zip"
|
| 268 |
+
'''
|
| 269 |
+
final_zip_path = Path(root_dir).parent / f"outputs_processed_{utils.get_time_now_str(tz_hours=tz_hours, date_format=date_format)}.zip"
|
| 270 |
+
|
| 271 |
# Use a context manager to create the zip file: use zipfile() opposed to shutil.make_archive
|
| 272 |
# 'w' mode creates a new file, overwriting if it already exists.
|
| 273 |
zip_unprocessed = 0
|
|
|
|
| 458 |
Notes
|
| 459 |
-----
|
| 460 |
The function is intentionally lightweight: it only handles path resolution,
|
| 461 |
+
directory creation, and file I/O. All rendering logic are performed before
|
| 462 |
calling this helper.
|
| 463 |
"""
|
| 464 |
src = Path(src_path)
|
|
|
|
| 476 |
|
| 477 |
## Opt to putting output within same output_md folder rather than individual source folders
|
| 478 |
#md_path = Path("data\\pdf") / "output_md" / f"{src.stem}" / md_name ##debug
|
| 479 |
+
#md_path = Path("data") / output_dir / f"{src.stem}" / md_name ##debug
|
| 480 |
+
md_path = Path(output_dir) / f"{src.stem}" / md_name ##debug
|
| 481 |
##SMY: [resolved] Permission Errno13 - https://stackoverflow.com/a/57454275
|
| 482 |
+
#md_path.parent.mkdir(mode=0o2755, parents=True, exist_ok=True) ##SMY: create nested md_path if not exists
|
| 483 |
+
md_path.parent.mkdir(parents=True, exist_ok=True) ##SMY: md_path now resides in Temp
|
| 484 |
md_path.parent.chmod(0)
|
| 485 |
|
| 486 |
try:
|
|
|
|
| 549 |
#img_path = Path(src.parent) / f"{Path(output_dir).stem}" / f"{src.stem}" / img_name
|
| 550 |
|
| 551 |
#img_path = Path("data\\pdf") / "output_md" / f"{src.stem}" / img_name ##debug
|
| 552 |
+
#img_path = Path("data") / output_dir / f"{src.stem}" / img_name ##debug
|
| 553 |
+
img_path = Path(output_dir) / f"{src.stem}" / img_name
|
| 554 |
#img_path.mkdir(mode=0o777, parents=True, exist_ok=True) ##SMY: create nested img_path if not exists
|
| 555 |
#img_path.parent.mkdir(parents=True, exist_ok=True)
|
| 556 |
|
llm/llm_login.py
CHANGED
|
@@ -5,6 +5,7 @@ from time import sleep
|
|
| 5 |
from typing import Optional
|
| 6 |
|
| 7 |
from utils.logger import get_logger
|
|
|
|
| 8 |
|
| 9 |
## Get logger instance
|
| 10 |
logger = get_logger(__name__)
|
|
@@ -14,6 +15,19 @@ def disable_immplicit_token():
|
|
| 14 |
# Explicitly disable implicit token propagation; we rely on explicit auth or env var
|
| 15 |
os.environ["HF_HUB_DISABLE_IMPLICIT_TOKEN"] = "1"
|
| 16 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
def login_huggingface(token: Optional[str] = None):
|
| 18 |
"""
|
| 19 |
Login to Hugging Face account. Prioritize CLI login for privacy and determinism.
|
|
|
|
| 5 |
from typing import Optional
|
| 6 |
|
| 7 |
from utils.logger import get_logger
|
| 8 |
+
from ui.gradio_ui import gr
|
| 9 |
|
| 10 |
## Get logger instance
|
| 11 |
logger = get_logger(__name__)
|
|
|
|
| 15 |
# Explicitly disable implicit token propagation; we rely on explicit auth or env var
|
| 16 |
os.environ["HF_HUB_DISABLE_IMPLICIT_TOKEN"] = "1"
|
| 17 |
|
| 18 |
+
#def get_login_token( api_token_arg, oauth_token: gr.OAuthToken | None=None,):
|
| 19 |
+
def get_login_token( api_token_arg, oauth_token):
|
| 20 |
+
""" Use user's supplied token or Get token from logged-in users, else from token stored on the machine. Return token"""
|
| 21 |
+
#oauth_token = get_token() if oauth_token is not None else api_token_arg
|
| 22 |
+
if api_token_arg != '': # or not None: #| None:
|
| 23 |
+
oauth_token = api_token_arg
|
| 24 |
+
elif oauth_token:
|
| 25 |
+
oauth_token = oauth_token.token
|
| 26 |
+
else: oauth_token = '' if not get_token() else get_token()
|
| 27 |
+
|
| 28 |
+
#return str(oauth_token) if oauth_token else '' ##token value or empty string
|
| 29 |
+
return oauth_token if oauth_token else '' ##token value or empty string
|
| 30 |
+
|
| 31 |
def login_huggingface(token: Optional[str] = None):
|
| 32 |
"""
|
| 33 |
Login to Hugging Face account. Prioritize CLI login for privacy and determinism.
|
requirements.txt
CHANGED
|
@@ -1,5 +1,8 @@
|
|
| 1 |
-
gradio>=5.44.0
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
gradio>=5.44.0 # gradio[mcp]>=5.44.0
|
| 2 |
+
#mcp>=1.15.0 # MCP Python SDK (Model Coontext Protocol)
|
| 3 |
+
marker-pdf[full]>=1.10.0 # pip install marker (GitHub: https://github.com/datalab-to/marker)
|
| 4 |
+
weasyprint>=59.0 # optional fallback if pandoc is not available
|
| 5 |
+
#pandoc==2.3 # for Markdown β PDF conversion
|
| 6 |
+
python-magic==0.4.27 # fileβtype detection
|
| 7 |
+
#pdfdfium2 # Python binding to PDFium for PDF rendering, inspection, manipution and creation
|
| 8 |
+
#huggingface_hub>=0.34.0 # HuggingFace integration
|
ui/gradio_ui.py
CHANGED
|
@@ -1,4 +1,5 @@
|
|
| 1 |
# ui/gradio_ui.py
|
|
|
|
| 2 |
import gradio as gr
|
| 3 |
from concurrent.futures import ProcessPoolExecutor, as_completed
|
| 4 |
import asyncio
|
|
@@ -7,23 +8,21 @@ from pathlib import Path, WindowsPath
|
|
| 7 |
from typing import Optional, Union #, Dict, List, Any, Tuple
|
| 8 |
|
| 9 |
from huggingface_hub import get_token
|
| 10 |
-
from numpy import append, iterable
|
| 11 |
|
| 12 |
#import file_handler
|
|
|
|
| 13 |
import file_handler.file_utils
|
| 14 |
-
from utils.config import TITLE, DESCRIPTION, DESCRIPTION_PDF_HTML, DESCRIPTION_PDF, DESCRIPTION_HTML, DESCRIPTION_MD
|
| 15 |
from utils.utils import is_dict, is_list_of_dicts
|
| 16 |
from file_handler.file_utils import zip_processed_files, process_dicts_data, collect_pdf_paths, collect_html_paths, collect_markdown_paths, create_outputdir ## should move to handling file
|
| 17 |
from file_handler.file_utils import find_file
|
| 18 |
from utils.get_config import get_config_value
|
| 19 |
|
| 20 |
-
#from llm.hf_client import HFChatClient ## SMY: unused. See converters.extraction_converter
|
| 21 |
from llm.provider_validator import is_valid_provider, suggest_providers
|
| 22 |
-
from llm.llm_login import is_loggedin_huggingface, login_huggingface
|
| 23 |
from converters.extraction_converter import DocumentConverter as docconverter #DocumentExtractor #as docextractor
|
| 24 |
from converters.pdf_to_md import PdfToMarkdownConverter, init_worker
|
| 25 |
-
#from converters.md_to_pdf import MarkdownToPdfConverter
|
| 26 |
-
#from converters.html_to_md import HtmlToMarkdownConverter ##SMY: PENDING: implementation
|
| 27 |
|
| 28 |
import traceback ## Extract, format and print information about Python stack traces.
|
| 29 |
from utils.logger import get_logger
|
|
@@ -32,7 +31,6 @@ logger = get_logger(__name__) ##NB: setup_logging() ## set logging
|
|
| 32 |
|
| 33 |
# Instantiate converters class once β they are stateless
|
| 34 |
pdf2md_converter = PdfToMarkdownConverter()
|
| 35 |
-
#html2md_converter = HtmlToMarkdownConverter()
|
| 36 |
#md2pdf_converter = MarkdownToPdfConverter()
|
| 37 |
|
| 38 |
|
|
@@ -42,25 +40,18 @@ from converters.extraction_converter import load_models
|
|
| 42 |
from globals import config_load_models
|
| 43 |
try:
|
| 44 |
if not config_load_models.model_dict:
|
| 45 |
-
|
|
|
|
| 46 |
'''if 'model_dict' not in globals():
|
| 47 |
global model_dict
|
| 48 |
model_dict = load_models()'''
|
|
|
|
| 49 |
except Exception as exc:
|
| 50 |
#tb = traceback.format_exc() #exc.__traceback__
|
| 51 |
logger.exception(f"β Error loading models (reload): {exc}") #\n{tb}")
|
| 52 |
raise RuntimeError(f"β Error loading models (reload): {exc}") #\n{tb}")
|
| 53 |
|
| 54 |
-
def get_login_token( api_token_arg, oauth_token: gr.OAuthToken | None=None,):
|
| 55 |
-
""" Use user's supplied token or Get token from logged-in users, else from token stored on the machine. Return token"""
|
| 56 |
-
#oauth_token = get_token() if oauth_token is not None else api_token_arg
|
| 57 |
-
if api_token_arg != '': # or not None: #| None:
|
| 58 |
-
oauth_token = api_token_arg
|
| 59 |
-
elif oauth_token:
|
| 60 |
-
oauth_token = oauth_token
|
| 61 |
-
else: get_token()
|
| 62 |
-
|
| 63 |
-
return oauth_token.token if oauth_token else '' ##token value or empty string
|
| 64 |
|
| 65 |
# pool executor to convert files called by Gradio
|
| 66 |
##SMY: TODO: future: refactor to gradio_process.py and
|
|
@@ -90,6 +81,7 @@ def convert_batch(
|
|
| 90 |
#output_dir: Optional[Union[str, Path]] = "output_dir",
|
| 91 |
output_dir_string: str = "output_dir_default",
|
| 92 |
use_llm: bool = False, #Optional[bool] = False, #True,
|
|
|
|
| 93 |
page_range: str = None, #Optional[str] = None,
|
| 94 |
tz_hours: str = None,
|
| 95 |
oauth_token: gr.OAuthToken | None=None,
|
|
@@ -103,15 +95,16 @@ def convert_batch(
|
|
| 103 |
"""
|
| 104 |
|
| 105 |
# login: Update the Gradio UI to improve user-friendly eXperience - commencing
|
| 106 |
-
#
|
| 107 |
-
|
|
|
|
| 108 |
|
| 109 |
# get token from logged-in user:
|
| 110 |
api_token = get_login_token(api_token_arg=api_token_gr, oauth_token=oauth_token)
|
| 111 |
##SMY: Strictly debug. Must not be live
|
| 112 |
-
#logger.log(level=30, msg="Commencing: get_login_token", extra={"api_token
|
| 113 |
|
| 114 |
-
try:
|
| 115 |
##SMY: might deprecate. To replace with oauth login from Gradio ui or integrate cleanly.
|
| 116 |
#login_huggingface(api_token) ## attempt login if not already logged in. NB: HF CLI login prompt would not display in Process Worker.
|
| 117 |
|
|
@@ -131,9 +124,8 @@ def convert_batch(
|
|
| 131 |
tb = traceback.format_exc()
|
| 132 |
logger.exception(f"β Error during login_huggingface β {exc}\n{tb}", exc_info=True) # Log the full traceback
|
| 133 |
return [gr.update(interactive=True), f"β An error occurred during login_huggingface β {exc}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log"] # return the exception message
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
## debug
|
| 138 |
#logger.log(level=30, msg="pdf_files_inputs", extra={"input_arg[0]:": pdf_files[0]})
|
| 139 |
|
|
@@ -143,22 +135,23 @@ def convert_batch(
|
|
| 143 |
#outputs=[log_output, files_individual_JSON, files_individual_downloads],
|
| 144 |
return [gr.update(interactive=True), "Initialising ProcessPool: No files uploaded.", {"Upload":"No files uploaded"}, f"dummy_log.log"]
|
| 145 |
|
| 146 |
-
|
| 147 |
# Get config values if not provided
|
| 148 |
-
config_file = find_file("config.ini") ##from file_handler.file_utils
|
| 149 |
-
model_id = get_config_value(config_file, "MARKER_CAP", "MODEL_ID")
|
| 150 |
-
openai_base_url = get_config_value(config_file, "MARKER_CAP", "OPENAI_BASE_URL")
|
| 151 |
-
openai_image_format = get_config_value(config_file, "MARKER_CAP", "OPENAI_IMAGE_FORMAT")
|
| 152 |
-
max_workers = get_config_value(config_file, "MARKER_CAP", "MAX_WORKERS")
|
| 153 |
-
max_retries = get_config_value(config_file, "MARKER_CAP", "MAX_RETRIES")
|
| 154 |
-
output_format = get_config_value(config_file, "MARKER_CAP", "OUTPUT_FORMAT")
|
| 155 |
-
output_dir_string = str(get_config_value(config_file, "MARKER_CAP", "OUTPUT_DIR")
|
| 156 |
-
use_llm = get_config_value(config_file, "MARKER_CAP", "USE_LLM")
|
| 157 |
-
page_range = get_config_value(config_file,"MARKER_CAP", "PAGE_RANGE")
|
| 158 |
-
|
|
|
|
| 159 |
|
| 160 |
# Create the initargs tuple from the Gradio inputs: # 'files' is an iterable, and handled separately.
|
| 161 |
-
|
| 162 |
yield gr.update(interactive=False), f"Initialising init_args", {"process": "Processing files ..."}, f"dummy_log.log"
|
| 163 |
init_args = (
|
| 164 |
provider,
|
|
@@ -180,83 +173,91 @@ def convert_batch(
|
|
| 180 |
output_format,
|
| 181 |
output_dir_string,
|
| 182 |
use_llm,
|
|
|
|
| 183 |
page_range,
|
| 184 |
)
|
| 185 |
|
| 186 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 187 |
try:
|
| 188 |
results = [] ## initialised pool result holder
|
| 189 |
-
# Create a pool with init_worker initialiser
|
| 190 |
logger.log(level=30, msg="Initialising ProcessPoolExecutor: pool:", extra={"pdf_files": pdf_files, "files_len": len(pdf_files), "model_id": model_id, "output_dir": output_dir_string}) #pdf_files_count
|
| 191 |
-
#progress((5,16), desc=f"Initialising ProcessPoolExecutor: Processing Files ...")
|
| 192 |
yield gr.update(interactive=False), f"Initialising ProcessPoolExecutor: Processing Files ...", {"process": "Processing files ..."}, f"dummy_log.log"
|
|
|
|
| 193 |
|
|
|
|
| 194 |
with ProcessPoolExecutor(
|
| 195 |
max_workers=max_workers,
|
| 196 |
initializer=init_worker,
|
| 197 |
initargs=init_args
|
| 198 |
) as pool:
|
| 199 |
-
|
| 200 |
-
|
| 201 |
-
|
| 202 |
-
# Update the Gradio UI to improve user-friendly eXperience
|
| 203 |
-
#outputs=[process_button, log_output, files_individual_JSON, files_individual_downloads],
|
| 204 |
-
|
| 205 |
-
|
| 206 |
# Map the files (pdf_files) to the conversion function (pdf2md_converter.convert_file)
|
| 207 |
# The 'docconverter' argument is implicitly handled by the initialiser
|
| 208 |
#futures = [pool.map(pdf2md_converter.convert_files, f) for f in pdf_files]
|
| 209 |
#logs = [f.result() for f in as_completed(futures)]
|
| 210 |
#futures = [pool.submit(pdf2md_converter.convert_files, file) for file in pdf_files]
|
| 211 |
#logs = [f.result() for f in futures]
|
| 212 |
-
|
| 213 |
try:
|
| 214 |
-
#(7,16), desc=f"ProcessPoolExecutor: Creating output_dir")
|
| 215 |
-
yield gr.update(interactive=False), f"Creating output_dir ...", {"process": "Processing files ..."}, f"dummy_log.log"
|
| 216 |
-
pdf2md_converter.output_dir_string = output_dir_string ##SMY: attempt setting directly to resolve pool.map iterable
|
| 217 |
-
#progress((8,16), desc=f"ProcessPoolExecutor: Created output_dir.")
|
| 218 |
-
yield gr.update(interactive=False), f"Created output_dir ...", {"process": "Processing files ..."}, f"dummy_log.log"
|
| 219 |
-
|
| 220 |
-
except Exception as exc:
|
| 221 |
-
# Raise the exception to stop the Gradio app: exception to halt execution
|
| 222 |
-
logger.exception("Error during creating output_dir", exc_info=True) # Log the full traceback
|
| 223 |
-
traceback.print_exc() # Print the exception traceback
|
| 224 |
-
#return f"An error occurred during pool.map: {str(exc)}", f"Error: {exc}", f"Error: {exc}" ## return the exception message
|
| 225 |
-
# Update the Gradio UI to improve user-friendly eXperience
|
| 226 |
-
yield gr.update(interactive=True), f"An error occurred creating output_dir: {str(exc)}", {"Error":f"Error: {exc}"}, f"dummy_log.log" ## return the exception message
|
| 227 |
-
|
| 228 |
-
try:
|
| 229 |
-
#progress((9,16), desc=f"ProcessPoolExecutor: Pooling file conversion ...")
|
| 230 |
yield gr.update(interactive=True), f"ProcessPoolExecutor: Pooling file conversion ...", {"process": "Processing files ..."}, f"dummy_log.log"
|
|
|
|
|
|
|
| 231 |
# Use progress.tqdm to integrate with the executor map
|
| 232 |
#results = pool.map(pdf2md_converter.convert_files, pdf_files) ##SMY iterables #max_retries #output_dir_string)
|
| 233 |
for result_interim in progress.tqdm(
|
| 234 |
-
iterable=pool.map(pdf2md_converter.convert_files, pdf_files), total=len(pdf_files)
|
| 235 |
):
|
| 236 |
results.append(result_interim)
|
| 237 |
-
#progress((10,16), desc=f"ProcessPoolExecutor: Pooling file conversion result: [{str(result_interim)}[:20]]")
|
| 238 |
# Update the Gradio UI to improve user-friendly eXperience
|
| 239 |
yield gr.update(interactive=True), f"ProcessPoolExecutor: Pooling file conversion result: [{str(result_interim)}[:20]]", {"process": "Processing files ..."}, f"dummy_log.log"
|
|
|
|
| 240 |
|
| 241 |
-
|
| 242 |
-
|
| 243 |
except Exception as exc:
|
| 244 |
# Raise the exception to stop the Gradio app: exception to halt execution
|
| 245 |
logger.exception("Error during pooling file conversion", exc_info=True) # Log the full traceback
|
| 246 |
-
traceback.print_exc() # Print the exception traceback
|
| 247 |
-
return [gr.update(interactive=True), f"An error occurred during pool.map: {str(exc)}", {"Error":f"Error: {exc}"}, f"dummy_log.log"] ## return the exception message
|
| 248 |
# Update the Gradio UI to improve user-friendly eXperience
|
| 249 |
-
|
|
|
|
| 250 |
|
| 251 |
-
#
|
| 252 |
try:
|
|
|
|
| 253 |
logger.log(level=20, msg="ProcessPoolExecutor pool result:", extra={"results": str(results)})
|
|
|
|
| 254 |
logs = []
|
| 255 |
logs_files_images = []
|
|
|
|
| 256 |
#logs.extend(results) ## performant pythonic
|
| 257 |
#logs = list[results] ##
|
| 258 |
logs = [result for result in results] ## pythonic list comprehension
|
| 259 |
-
## logs : [file , images , filepath, image_path]
|
| 260 |
|
| 261 |
#logs_files_images = logs_files.extend(logs_images) #zip(logs_files, logs_images) ##SMY: in progress
|
| 262 |
logs_count = 0
|
|
@@ -268,64 +269,48 @@ def convert_batch(
|
|
| 268 |
# Update the Gradio UI to improve user-friendly eXperience
|
| 269 |
#yield gr.update(interactive=False), f"Processing files: {logs_files_images[logs_count]}", {"process": "Processing files"}, f"dummy_log.log"
|
| 270 |
logs_count = i+i_image
|
| 271 |
-
|
| 272 |
-
#progress((12,16), desc="Processing results from files conversion") ##rekickin
|
| 273 |
-
#logs_files_images.append(logs_filepath) ## to del
|
| 274 |
-
#logs_files_images.extend(logs_images) ## to del
|
| 275 |
except Exception as exc:
|
| 276 |
-
|
| 277 |
-
|
| 278 |
return [gr.update(interactive=True), f"An error occurred during processing results logs: {str(exc)}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log"] ## return the exception message
|
| 279 |
#yield gr.update(interactive=True), f"An error occurred during processing results logs: {str(exc)}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log" ## return the exception message
|
| 280 |
-
|
| 281 |
-
#'''
|
| 282 |
except Exception as exc:
|
| 283 |
tb = traceback.format_exc()
|
| 284 |
logger.exception(f"β Error during ProcessPoolExecutor β {exc}\n{tb}" , exc_info=True) # Log the full traceback
|
| 285 |
#traceback.print_exc() # Print the exception traceback
|
| 286 |
-
yield gr.update(interactive=True), f"β An error occurred during ProcessPoolExecutorβ {exc}
|
| 287 |
|
| 288 |
-
|
| 289 |
-
logger.log(level=20, msg="ProcessPoolExecutor pool result:", extra={"results": str(results)})
|
| 290 |
-
logs = []
|
| 291 |
-
#logs.extend(results) ## performant pythonic
|
| 292 |
-
#logs = list[results] ##
|
| 293 |
-
logs = [result for result in results] ## pythonic list comprehension
|
| 294 |
-
'''
|
| 295 |
-
|
| 296 |
-
# Zip Processed md Files and images. Insert to first index
|
| 297 |
try: ##from file_handler.file_utils
|
| 298 |
-
|
| 299 |
-
zipped_processed_files = zip_processed_files(root_dir=f"
|
| 300 |
logs_files_images.insert(0, zipped_processed_files)
|
| 301 |
-
#logs_files_images.insert(1, "====================")
|
| 302 |
|
| 303 |
-
|
| 304 |
#yield gr.update(interactive=False), f"Processing zip and files: {logs_files_images}", {"process": "Processing files"}, f"dummy_log.log"
|
| 305 |
|
| 306 |
except Exception as exc:
|
| 307 |
tb = traceback.format_exc()
|
| 308 |
logger.exception(f"β Error during zipping processed files β {exc}\n{tb}" , exc_info=True) # Log the full traceback
|
| 309 |
#traceback.print_exc() # Print the exception traceback
|
| 310 |
-
#return gr.update(interactive=True), f"β An error occurred during zipping files β {exc}\n{tb}", f"Error: {exc}", f"Error: {exc}" # return the exception message
|
| 311 |
yield gr.update(interactive=True), f"β An error occurred during zipping files β {exc}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log" # return the exception message
|
|
|
|
| 312 |
|
| 313 |
|
| 314 |
# Return processed files log
|
| 315 |
try:
|
| 316 |
-
|
|
|
|
| 317 |
## # Convert logs list of dicts to formatted json string
|
| 318 |
logs_return_formatted_json_string = file_handler.file_utils.process_dicts_data(logs) #"\n".join(log for log in logs) ##SMY outputs to gr.JSON component with no need for json.dumps(data, indent=)
|
| 319 |
-
#logs_files_images_return = "\n".join(path for path in logs_files_images) ##TypeError: sequence item 0: expected str instance, WindowsPath found
|
| 320 |
-
|
| 321 |
-
##convert the List of Path objects to List of string for gr.Files output
|
| 322 |
-
#logs_files_images_return = list(str(path) for path in logs_files_images)
|
| 323 |
|
| 324 |
## # Convert any Path objects to strings, but leave strings as-is
|
| 325 |
logs_files_images_return = list(str(path) if isinstance(path, Path) else path for path in logs_files_images)
|
| 326 |
logger.log(level=20, msg="File conversion complete. Sending outcome to Gradio:", extra={"logs_files_image_return": str(logs_files_images_return)}) ## debug: FileNotFoundError: [WinError 2] The system cannot find the file specified: 'Error or no image_path'
|
| 327 |
|
| 328 |
-
|
|
|
|
| 329 |
#outputs=[process_button, log_output, files_individual_JSON, files_individual_downloads],
|
| 330 |
#return "\n".join(logs), "\n".join(logs_files_images) #"\n".join(logs_files)
|
| 331 |
|
|
@@ -338,8 +323,8 @@ def convert_batch(
|
|
| 338 |
tb = traceback.format_exc()
|
| 339 |
logger.exception(f"β Error during returning result logs β {exc}\n{tb}" , exc_info=True) # Log the full traceback
|
| 340 |
#traceback.print_exc() # Print the exception traceback
|
| 341 |
-
|
| 342 |
-
|
| 343 |
|
| 344 |
#return "\n".join(log for log in logs), "\n".join(str(path) for path in logs_files_images)
|
| 345 |
#print(f'logs_files_images: {"\n".join(str(path) for path in logs_files_images)}')
|
|
@@ -517,7 +502,7 @@ def build_interface() -> gr.Blocks:
|
|
| 517 |
#message = f"Accumulated {len(updated_files)} file(s) total.\n\nAll file paths:\n{file_info}"
|
| 518 |
message = f"Accumulated {len(updated_files)} file(s) total: \n{filename_info}"
|
| 519 |
|
| 520 |
-
return updated_files, message
|
| 521 |
|
| 522 |
# with gr.Blocks(title=TITLE) as demo
|
| 523 |
with gr.Blocks(title=TITLE, css=custom_css) as demo:
|
|
@@ -592,7 +577,7 @@ def build_interface() -> gr.Blocks:
|
|
| 592 |
)
|
| 593 |
|
| 594 |
# Clean UI: Model parameters hidden in collapsible accordion
|
| 595 |
-
with gr.Accordion("βοΈ Marker Settings", open=False):
|
| 596 |
gr.Markdown(f"#### **Marker Configuration**")
|
| 597 |
with gr.Row():
|
| 598 |
openai_base_url_tb = gr.Textbox(
|
|
@@ -607,7 +592,7 @@ def build_interface() -> gr.Blocks:
|
|
| 607 |
value="webp",
|
| 608 |
)
|
| 609 |
output_format_dd = gr.Dropdown(
|
| 610 |
-
choices=["markdown", "html"], #, "json", "chunks"], ##SMY: To be enabled later
|
| 611 |
#choices=["markdown", "html", "json", "chunks"],
|
| 612 |
label="Output Format",
|
| 613 |
value="markdown",
|
|
@@ -633,10 +618,15 @@ def build_interface() -> gr.Blocks:
|
|
| 633 |
value=2,
|
| 634 |
step=1 #0.01
|
| 635 |
)
|
| 636 |
-
|
| 637 |
-
|
| 638 |
-
|
| 639 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 640 |
page_range_tb = gr.Textbox(
|
| 641 |
label="Page Range (Optional)",
|
| 642 |
placeholder="Example: 0,1-5,8,12-15",
|
|
@@ -677,13 +667,14 @@ def build_interface() -> gr.Blocks:
|
|
| 677 |
btn_pdf_convert = gr.Button("Convert PDF(s)")
|
| 678 |
'''
|
| 679 |
|
|
|
|
| 680 |
with gr.Column(elem_classes=["file-or-directory-area"]):
|
| 681 |
with gr.Row():
|
| 682 |
file_btn = gr.UploadButton(
|
| 683 |
#file_btn = gr.File(
|
| 684 |
label="Upload Multiple Files",
|
| 685 |
file_count="multiple",
|
| 686 |
-
file_types=["file"],
|
| 687 |
#height=25, #"sm",
|
| 688 |
size="sm",
|
| 689 |
elem_classes=["gradio-upload-btn"]
|
|
@@ -692,7 +683,7 @@ def build_interface() -> gr.Blocks:
|
|
| 692 |
#dir_btn = gr.File(
|
| 693 |
label="Upload a Directory",
|
| 694 |
file_count="directory",
|
| 695 |
-
|
| 696 |
#height=25, #"0.5",
|
| 697 |
size="sm",
|
| 698 |
elem_classes=["gradio-upload-btn"]
|
|
@@ -702,8 +693,8 @@ def build_interface() -> gr.Blocks:
|
|
| 702 |
output_textbox = gr.Textbox(label="Accumulated Files", lines=3) #, max_lines=4) #10
|
| 703 |
|
| 704 |
with gr.Row():
|
| 705 |
-
process_button = gr.Button("Process All Uploaded Files", variant="primary")
|
| 706 |
-
clear_button = gr.Button("Clear All Uploads", variant="secondary")
|
| 707 |
|
| 708 |
|
| 709 |
# --- PDF β Markdown tab ---
|
|
@@ -890,8 +881,10 @@ def build_interface() -> gr.Blocks:
|
|
| 890 |
"""
|
| 891 |
#msg = f"Files list cleared: {do_logout()}" ## use as needed
|
| 892 |
msg = f"Files list cleared."
|
| 893 |
-
yield [], msg, '', ''
|
| 894 |
#return [], f"Files list cleared.", [], []
|
|
|
|
|
|
|
| 895 |
|
| 896 |
#hf_login_logout_btn.click(fn=custom_do_logout, inputs=None, outputs=hf_login_logout_btn)
|
| 897 |
##unused
|
|
@@ -905,14 +898,14 @@ def build_interface() -> gr.Blocks:
|
|
| 905 |
file_btn.upload(
|
| 906 |
fn=accumulate_files,
|
| 907 |
inputs=[file_btn, uploaded_file_list],
|
| 908 |
-
outputs=[uploaded_file_list, output_textbox]
|
| 909 |
)
|
| 910 |
|
| 911 |
# Event handler for the directory upload button
|
| 912 |
dir_btn.upload(
|
| 913 |
fn=accumulate_files,
|
| 914 |
inputs=[dir_btn, uploaded_file_list],
|
| 915 |
-
outputs=[uploaded_file_list, output_textbox]
|
| 916 |
)
|
| 917 |
|
| 918 |
# Event handler for the "Clear" button
|
|
@@ -957,6 +950,7 @@ def build_interface() -> gr.Blocks:
|
|
| 957 |
output_format_dd,
|
| 958 |
output_dir_tb,
|
| 959 |
use_llm_cb,
|
|
|
|
| 960 |
page_range_tb,
|
| 961 |
tz_hours_num, #state_tz_hours
|
| 962 |
]
|
|
|
|
| 1 |
# ui/gradio_ui.py
|
| 2 |
+
from ast import Interactive
|
| 3 |
import gradio as gr
|
| 4 |
from concurrent.futures import ProcessPoolExecutor, as_completed
|
| 5 |
import asyncio
|
|
|
|
| 8 |
from typing import Optional, Union #, Dict, List, Any, Tuple
|
| 9 |
|
| 10 |
from huggingface_hub import get_token
|
|
|
|
| 11 |
|
| 12 |
#import file_handler
|
| 13 |
+
from file_handler import file_utils
|
| 14 |
import file_handler.file_utils
|
| 15 |
+
from utils.config import TITLE, DESCRIPTION, DESCRIPTION_PDF_HTML, DESCRIPTION_PDF, DESCRIPTION_HTML, DESCRIPTION_MD, file_types_list, file_types_tuple
|
| 16 |
from utils.utils import is_dict, is_list_of_dicts
|
| 17 |
from file_handler.file_utils import zip_processed_files, process_dicts_data, collect_pdf_paths, collect_html_paths, collect_markdown_paths, create_outputdir ## should move to handling file
|
| 18 |
from file_handler.file_utils import find_file
|
| 19 |
from utils.get_config import get_config_value
|
| 20 |
|
|
|
|
| 21 |
from llm.provider_validator import is_valid_provider, suggest_providers
|
| 22 |
+
from llm.llm_login import get_login_token, is_loggedin_huggingface, login_huggingface
|
| 23 |
from converters.extraction_converter import DocumentConverter as docconverter #DocumentExtractor #as docextractor
|
| 24 |
from converters.pdf_to_md import PdfToMarkdownConverter, init_worker
|
| 25 |
+
#from converters.md_to_pdf import MarkdownToPdfConverter ##SMY: PENDING: implementation
|
|
|
|
| 26 |
|
| 27 |
import traceback ## Extract, format and print information about Python stack traces.
|
| 28 |
from utils.logger import get_logger
|
|
|
|
| 31 |
|
| 32 |
# Instantiate converters class once β they are stateless
|
| 33 |
pdf2md_converter = PdfToMarkdownConverter()
|
|
|
|
| 34 |
#md2pdf_converter = MarkdownToPdfConverter()
|
| 35 |
|
| 36 |
|
|
|
|
| 40 |
from globals import config_load_models
|
| 41 |
try:
|
| 42 |
if not config_load_models.model_dict:
|
| 43 |
+
model_dict = load_models()
|
| 44 |
+
config_load_models.model_dict = model_dict
|
| 45 |
'''if 'model_dict' not in globals():
|
| 46 |
global model_dict
|
| 47 |
model_dict = load_models()'''
|
| 48 |
+
logger.log(level=30, msg="Config_load_model: ", extra={"model_dict": str(model_dict)})
|
| 49 |
except Exception as exc:
|
| 50 |
#tb = traceback.format_exc() #exc.__traceback__
|
| 51 |
logger.exception(f"β Error loading models (reload): {exc}") #\n{tb}")
|
| 52 |
raise RuntimeError(f"β Error loading models (reload): {exc}") #\n{tb}")
|
| 53 |
|
| 54 |
+
#def get_login_token( api_token_arg, oauth_token: gr.OAuthToken | None=None,): ##moved to llm_login
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
# pool executor to convert files called by Gradio
|
| 57 |
##SMY: TODO: future: refactor to gradio_process.py and
|
|
|
|
| 81 |
#output_dir: Optional[Union[str, Path]] = "output_dir",
|
| 82 |
output_dir_string: str = "output_dir_default",
|
| 83 |
use_llm: bool = False, #Optional[bool] = False, #True,
|
| 84 |
+
force_ocr: bool = True, #Optional[bool] = False,
|
| 85 |
page_range: str = None, #Optional[str] = None,
|
| 86 |
tz_hours: str = None,
|
| 87 |
oauth_token: gr.OAuthToken | None=None,
|
|
|
|
| 95 |
"""
|
| 96 |
|
| 97 |
# login: Update the Gradio UI to improve user-friendly eXperience - commencing
|
| 98 |
+
# [template]: #outputs=[process_button, log_output, files_individual_JSON, files_individual_downloads],
|
| 99 |
+
yield gr.update(interactive=False), f"Commencing Processing ... Getting login", {"process": "Commencing Processing"}, f"dummy_log.log"
|
| 100 |
+
progress((0,16), f"Commencing Processing ...")
|
| 101 |
|
| 102 |
# get token from logged-in user:
|
| 103 |
api_token = get_login_token(api_token_arg=api_token_gr, oauth_token=oauth_token)
|
| 104 |
##SMY: Strictly debug. Must not be live
|
| 105 |
+
#logger.log(level=30, msg="Commencing: get_login_token", extra={"api_token": api_token, "api_token_gr": api_token_gr})
|
| 106 |
|
| 107 |
+
'''try:
|
| 108 |
##SMY: might deprecate. To replace with oauth login from Gradio ui or integrate cleanly.
|
| 109 |
#login_huggingface(api_token) ## attempt login if not already logged in. NB: HF CLI login prompt would not display in Process Worker.
|
| 110 |
|
|
|
|
| 124 |
tb = traceback.format_exc()
|
| 125 |
logger.exception(f"β Error during login_huggingface β {exc}\n{tb}", exc_info=True) # Log the full traceback
|
| 126 |
return [gr.update(interactive=True), f"β An error occurred during login_huggingface β {exc}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log"] # return the exception message
|
| 127 |
+
'''
|
| 128 |
+
progress((1,16), desc=f"Log in: {is_loggedin_huggingface(api_token)}")
|
|
|
|
| 129 |
## debug
|
| 130 |
#logger.log(level=30, msg="pdf_files_inputs", extra={"input_arg[0]:": pdf_files[0]})
|
| 131 |
|
|
|
|
| 135 |
#outputs=[log_output, files_individual_JSON, files_individual_downloads],
|
| 136 |
return [gr.update(interactive=True), "Initialising ProcessPool: No files uploaded.", {"Upload":"No files uploaded"}, f"dummy_log.log"]
|
| 137 |
|
| 138 |
+
progress((2,16), desc=f"Getting configuration values")
|
| 139 |
# Get config values if not provided
|
| 140 |
+
config_file = find_file("config.ini") ##from file_handler.file_utils ##takes a bit of time to process. #NeedOptimise
|
| 141 |
+
model_id = model_id if model_id else get_config_value(config_file, "MARKER_CAP", "MODEL_ID")
|
| 142 |
+
openai_base_url = openai_base_url if openai_base_url else get_config_value(config_file, "MARKER_CAP", "OPENAI_BASE_URL")
|
| 143 |
+
openai_image_format = openai_image_format if openai_image_format else get_config_value(config_file, "MARKER_CAP", "OPENAI_IMAGE_FORMAT")
|
| 144 |
+
max_workers = max_workers if max_workers else get_config_value(config_file, "MARKER_CAP", "MAX_WORKERS")
|
| 145 |
+
max_retries = max_retries if max_retries else get_config_value(config_file, "MARKER_CAP", "MAX_RETRIES")
|
| 146 |
+
output_format = output_format if output_format else get_config_value(config_file, "MARKER_CAP", "OUTPUT_FORMAT")
|
| 147 |
+
output_dir_string = output_dir_string if output_dir_string else str(get_config_value(config_file, "MARKER_CAP", "OUTPUT_DIR"))
|
| 148 |
+
use_llm = use_llm if use_llm else get_config_value(config_file, "MARKER_CAP", "USE_LLM")
|
| 149 |
+
page_range = page_range if page_range else get_config_value(config_file,"MARKER_CAP", "PAGE_RANGE")
|
| 150 |
+
|
| 151 |
+
progress((3,16), desc=f"Retrieved configuration values")
|
| 152 |
|
| 153 |
# Create the initargs tuple from the Gradio inputs: # 'files' is an iterable, and handled separately.
|
| 154 |
+
progress((4,16), desc=f"Initialiasing init_args")
|
| 155 |
yield gr.update(interactive=False), f"Initialising init_args", {"process": "Processing files ..."}, f"dummy_log.log"
|
| 156 |
init_args = (
|
| 157 |
provider,
|
|
|
|
| 173 |
output_format,
|
| 174 |
output_dir_string,
|
| 175 |
use_llm,
|
| 176 |
+
force_ocr,
|
| 177 |
page_range,
|
| 178 |
)
|
| 179 |
|
| 180 |
+
# create output_dir
|
| 181 |
+
try:
|
| 182 |
+
yield gr.update(interactive=False), f"Creating output_dir ...", {"process": "Processing files ..."}, f"dummy_log.log"
|
| 183 |
+
progress((5,16), desc=f"ProcessPoolExecutor: Creating output_dir")
|
| 184 |
+
|
| 185 |
+
#pdf2md_converter.output_dir_string = output_dir_string ##SMY: attempt setting directly to resolve pool.map iterable
|
| 186 |
+
|
| 187 |
+
# Create Marker output_dir in temporary directory where Gradio can access it.
|
| 188 |
+
output_dir = file_utils.create_temp_folder(output_dir_string)
|
| 189 |
+
pdf2md_converter.output_dir = output_dir
|
| 190 |
+
|
| 191 |
+
logger.info(f"β output_dir created: ", extra={"output_dir": pdf2md_converter.output_dir.name, "in": str(pdf2md_converter.output_dir.parent)})
|
| 192 |
+
yield gr.update(interactive=False), f"Created output_dir ...", {"process": "Processing files ..."}, f"dummy_log.log"
|
| 193 |
+
progress((6,16), desc=f"β Created output_dir.")
|
| 194 |
+
except Exception as exc:
|
| 195 |
+
tb = traceback.format_exc()
|
| 196 |
+
tbp = traceback.print_exc() # Print the exception traceback
|
| 197 |
+
logger.exception("β error creating output_dir β {exc}\n{tb}", exc_info=True) # Log the full traceback
|
| 198 |
+
|
| 199 |
+
# Update the Gradio UI to improve user-friendly eXperience
|
| 200 |
+
yield gr.update(interactive=True), f"β An error occurred creating output_dir: {str(exc)}", {"Error":f"Error: {exc}"}, f"dummy_log.log" ## return the exception message
|
| 201 |
+
return f"An error occurred creating output_dir: {str(exc)}", f"Error: {exc}", f"Error: {exc}" ## return the exception message
|
| 202 |
+
|
| 203 |
+
# Process file conversion leveraging ProcessPoolExecutor for efficiency
|
| 204 |
try:
|
| 205 |
results = [] ## initialised pool result holder
|
|
|
|
| 206 |
logger.log(level=30, msg="Initialising ProcessPoolExecutor: pool:", extra={"pdf_files": pdf_files, "files_len": len(pdf_files), "model_id": model_id, "output_dir": output_dir_string}) #pdf_files_count
|
|
|
|
| 207 |
yield gr.update(interactive=False), f"Initialising ProcessPoolExecutor: Processing Files ...", {"process": "Processing files ..."}, f"dummy_log.log"
|
| 208 |
+
progress((7,16), desc=f"Initialising ProcessPoolExecutor: Processing Files ...")
|
| 209 |
|
| 210 |
+
# Create a pool with init_worker initialiser
|
| 211 |
with ProcessPoolExecutor(
|
| 212 |
max_workers=max_workers,
|
| 213 |
initializer=init_worker,
|
| 214 |
initargs=init_args
|
| 215 |
) as pool:
|
| 216 |
+
logger.log(level=30, msg="Initialising ProcessPoolExecutor: pool:", extra={"pdf_files": pdf_files, "files_len": len(pdf_files), "model_id": model_id, "output_dir": output_dir_string}) #pdf_files_count
|
| 217 |
+
progress((8,16), desc=f"Starting ProcessPool queue: Processing Files ...")
|
| 218 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
| 219 |
# Map the files (pdf_files) to the conversion function (pdf2md_converter.convert_file)
|
| 220 |
# The 'docconverter' argument is implicitly handled by the initialiser
|
| 221 |
#futures = [pool.map(pdf2md_converter.convert_files, f) for f in pdf_files]
|
| 222 |
#logs = [f.result() for f in as_completed(futures)]
|
| 223 |
#futures = [pool.submit(pdf2md_converter.convert_files, file) for file in pdf_files]
|
| 224 |
#logs = [f.result() for f in futures]
|
|
|
|
| 225 |
try:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 226 |
yield gr.update(interactive=True), f"ProcessPoolExecutor: Pooling file conversion ...", {"process": "Processing files ..."}, f"dummy_log.log"
|
| 227 |
+
progress((9,16), desc=f"ProcessPoolExecutor: Pooling file conversion ...")
|
| 228 |
+
|
| 229 |
# Use progress.tqdm to integrate with the executor map
|
| 230 |
#results = pool.map(pdf2md_converter.convert_files, pdf_files) ##SMY iterables #max_retries #output_dir_string)
|
| 231 |
for result_interim in progress.tqdm(
|
| 232 |
+
iterable=pool.map(pdf2md_converter.convert_files, pdf_files) #, max_retries), total=len(pdf_files)
|
| 233 |
):
|
| 234 |
results.append(result_interim)
|
|
|
|
| 235 |
# Update the Gradio UI to improve user-friendly eXperience
|
| 236 |
yield gr.update(interactive=True), f"ProcessPoolExecutor: Pooling file conversion result: [{str(result_interim)}[:20]]", {"process": "Processing files ..."}, f"dummy_log.log"
|
| 237 |
+
progress((10,16), desc=f"ProcessPoolExecutor: Pooling file conversion result: [{str(result_interim)}[:20]]")
|
| 238 |
|
| 239 |
+
yield gr.update(interactive=True), f"ProcessPoolExecutor: Got Results from files conversion: [{str(result_interim)}[:20]]", {"process": "Processing files ..."}, f"dummy_log.log"
|
| 240 |
+
progress((11,16), desc=f"ProcessPoolExecutor: Got Results from files conversion")
|
| 241 |
except Exception as exc:
|
| 242 |
# Raise the exception to stop the Gradio app: exception to halt execution
|
| 243 |
logger.exception("Error during pooling file conversion", exc_info=True) # Log the full traceback
|
| 244 |
+
tbp = traceback.print_exc() # Print the exception traceback
|
|
|
|
| 245 |
# Update the Gradio UI to improve user-friendly eXperience
|
| 246 |
+
yield gr.update(interactive=True), f"An error occurred during pool.map: {str(exc)}", {"Error":f"Error: {exc}\n{tbp}"}, f"dummy_log.log" ## return the exception message
|
| 247 |
+
return [gr.update(interactive=True), f"An error occurred during pool.map: {str(exc)}", {"Error":f"Error: {exc}\n{tbp}"}, f"dummy_log.log"] ## return the exception message
|
| 248 |
|
| 249 |
+
# Process file conversion results
|
| 250 |
try:
|
| 251 |
+
progress((12,16), desc="Processing results from files conversion") ##rekickin
|
| 252 |
logger.log(level=20, msg="ProcessPoolExecutor pool result:", extra={"results": str(results)})
|
| 253 |
+
|
| 254 |
logs = []
|
| 255 |
logs_files_images = []
|
| 256 |
+
|
| 257 |
#logs.extend(results) ## performant pythonic
|
| 258 |
#logs = list[results] ##
|
| 259 |
logs = [result for result in results] ## pythonic list comprehension
|
| 260 |
+
# [template] ## logs : [file , images , filepath, image_path]
|
| 261 |
|
| 262 |
#logs_files_images = logs_files.extend(logs_images) #zip(logs_files, logs_images) ##SMY: in progress
|
| 263 |
logs_count = 0
|
|
|
|
| 269 |
# Update the Gradio UI to improve user-friendly eXperience
|
| 270 |
#yield gr.update(interactive=False), f"Processing files: {logs_files_images[logs_count]}", {"process": "Processing files"}, f"dummy_log.log"
|
| 271 |
logs_count = i+i_image
|
|
|
|
|
|
|
|
|
|
|
|
|
| 272 |
except Exception as exc:
|
| 273 |
+
tbp = traceback.print_exc() # Print the exception traceback
|
| 274 |
+
logger.exception("Error during processing results logs β {exc}\n{tbp}", exc_info=True) # Log the full traceback
|
| 275 |
return [gr.update(interactive=True), f"An error occurred during processing results logs: {str(exc)}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log"] ## return the exception message
|
| 276 |
#yield gr.update(interactive=True), f"An error occurred during processing results logs: {str(exc)}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log" ## return the exception message
|
|
|
|
|
|
|
| 277 |
except Exception as exc:
|
| 278 |
tb = traceback.format_exc()
|
| 279 |
logger.exception(f"β Error during ProcessPoolExecutor β {exc}\n{tb}" , exc_info=True) # Log the full traceback
|
| 280 |
#traceback.print_exc() # Print the exception traceback
|
| 281 |
+
yield gr.update(interactive=True), f"β An error occurred during ProcessPoolExecutorβ {exc}", {"Error":f"Error: {exc}"}, f"dummy_log.log" # return the exception message
|
| 282 |
|
| 283 |
+
# Zip Processed Files and images. Insert to first index
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 284 |
try: ##from file_handler.file_utils
|
| 285 |
+
progress((13,16), desc="Zipping processed files and images")
|
| 286 |
+
zipped_processed_files = zip_processed_files(root_dir=f"{output_dir}", file_paths=logs_files_images, tz_hours=tz_hours, date_format='%d%b%Y_%H-%M-%S') #date_format='%d%b%Y'
|
| 287 |
logs_files_images.insert(0, zipped_processed_files)
|
|
|
|
| 288 |
|
| 289 |
+
progress((14,16), desc="Zipped processed files and images")
|
| 290 |
#yield gr.update(interactive=False), f"Processing zip and files: {logs_files_images}", {"process": "Processing files"}, f"dummy_log.log"
|
| 291 |
|
| 292 |
except Exception as exc:
|
| 293 |
tb = traceback.format_exc()
|
| 294 |
logger.exception(f"β Error during zipping processed files β {exc}\n{tb}" , exc_info=True) # Log the full traceback
|
| 295 |
#traceback.print_exc() # Print the exception traceback
|
|
|
|
| 296 |
yield gr.update(interactive=True), f"β An error occurred during zipping files β {exc}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log" # return the exception message
|
| 297 |
+
return gr.update(interactive=True), f"β An error occurred during zipping files β {exc}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log" # return the exception message
|
| 298 |
|
| 299 |
|
| 300 |
# Return processed files log
|
| 301 |
try:
|
| 302 |
+
progress((15,16), desc="Formatting processed log results")
|
| 303 |
+
|
| 304 |
## # Convert logs list of dicts to formatted json string
|
| 305 |
logs_return_formatted_json_string = file_handler.file_utils.process_dicts_data(logs) #"\n".join(log for log in logs) ##SMY outputs to gr.JSON component with no need for json.dumps(data, indent=)
|
| 306 |
+
#logs_files_images_return = "\n".join(path for path in logs_files_images) ##TypeError: sequence item 0: expected str instance, WindowsPath found
|
|
|
|
|
|
|
|
|
|
| 307 |
|
| 308 |
## # Convert any Path objects to strings, but leave strings as-is
|
| 309 |
logs_files_images_return = list(str(path) if isinstance(path, Path) else path for path in logs_files_images)
|
| 310 |
logger.log(level=20, msg="File conversion complete. Sending outcome to Gradio:", extra={"logs_files_image_return": str(logs_files_images_return)}) ## debug: FileNotFoundError: [WinError 2] The system cannot find the file specified: 'Error or no image_path'
|
| 311 |
|
| 312 |
+
progress((16,16), desc="Complete processing and formatting file processing results")
|
| 313 |
+
# [templates]
|
| 314 |
#outputs=[process_button, log_output, files_individual_JSON, files_individual_downloads],
|
| 315 |
#return "\n".join(logs), "\n".join(logs_files_images) #"\n".join(logs_files)
|
| 316 |
|
|
|
|
| 323 |
tb = traceback.format_exc()
|
| 324 |
logger.exception(f"β Error during returning result logs β {exc}\n{tb}" , exc_info=True) # Log the full traceback
|
| 325 |
#traceback.print_exc() # Print the exception traceback
|
| 326 |
+
yield gr.update(interactive=True), f"β An error occurred during returning result logsβ {exc}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log" # return the exception message
|
| 327 |
+
return [gr.update(interactive=True), f"β An error occurred during returning result logsβ {exc}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log"] # return the exception message
|
| 328 |
|
| 329 |
#return "\n".join(log for log in logs), "\n".join(str(path) for path in logs_files_images)
|
| 330 |
#print(f'logs_files_images: {"\n".join(str(path) for path in logs_files_images)}')
|
|
|
|
| 502 |
#message = f"Accumulated {len(updated_files)} file(s) total.\n\nAll file paths:\n{file_info}"
|
| 503 |
message = f"Accumulated {len(updated_files)} file(s) total: \n{filename_info}"
|
| 504 |
|
| 505 |
+
return updated_files, message, gr.update(interactive=True), gr.update(interactive=True)
|
| 506 |
|
| 507 |
# with gr.Blocks(title=TITLE) as demo
|
| 508 |
with gr.Blocks(title=TITLE, css=custom_css) as demo:
|
|
|
|
| 577 |
)
|
| 578 |
|
| 579 |
# Clean UI: Model parameters hidden in collapsible accordion
|
| 580 |
+
with gr.Accordion("βοΈ Marker Converter Settings", open=False):
|
| 581 |
gr.Markdown(f"#### **Marker Configuration**")
|
| 582 |
with gr.Row():
|
| 583 |
openai_base_url_tb = gr.Textbox(
|
|
|
|
| 592 |
value="webp",
|
| 593 |
)
|
| 594 |
output_format_dd = gr.Dropdown(
|
| 595 |
+
choices=["markdown", "html", "json"], #, "json", "chunks"], ##SMY: To be enabled later
|
| 596 |
#choices=["markdown", "html", "json", "chunks"],
|
| 597 |
label="Output Format",
|
| 598 |
value="markdown",
|
|
|
|
| 618 |
value=2,
|
| 619 |
step=1 #0.01
|
| 620 |
)
|
| 621 |
+
with gr.Column():
|
| 622 |
+
use_llm_cb = gr.Checkbox(
|
| 623 |
+
label="Use LLM for Marker conversion",
|
| 624 |
+
value=False
|
| 625 |
+
)
|
| 626 |
+
force_ocr_cb = gr.Checkbox(
|
| 627 |
+
label="force OCR on all pages",
|
| 628 |
+
value=True,
|
| 629 |
+
)
|
| 630 |
page_range_tb = gr.Textbox(
|
| 631 |
label="Page Range (Optional)",
|
| 632 |
placeholder="Example: 0,1-5,8,12-15",
|
|
|
|
| 667 |
btn_pdf_convert = gr.Button("Convert PDF(s)")
|
| 668 |
'''
|
| 669 |
|
| 670 |
+
file_types_list.extend(file_types_tuple)
|
| 671 |
with gr.Column(elem_classes=["file-or-directory-area"]):
|
| 672 |
with gr.Row():
|
| 673 |
file_btn = gr.UploadButton(
|
| 674 |
#file_btn = gr.File(
|
| 675 |
label="Upload Multiple Files",
|
| 676 |
file_count="multiple",
|
| 677 |
+
file_types= file_types_list, #["file"], ##config.file_types_list
|
| 678 |
#height=25, #"sm",
|
| 679 |
size="sm",
|
| 680 |
elem_classes=["gradio-upload-btn"]
|
|
|
|
| 683 |
#dir_btn = gr.File(
|
| 684 |
label="Upload a Directory",
|
| 685 |
file_count="directory",
|
| 686 |
+
file_types= file_types_list, #["file"], #Warning: The `file_types` parameter is ignored when `file_count` is 'directory'
|
| 687 |
#height=25, #"0.5",
|
| 688 |
size="sm",
|
| 689 |
elem_classes=["gradio-upload-btn"]
|
|
|
|
| 693 |
output_textbox = gr.Textbox(label="Accumulated Files", lines=3) #, max_lines=4) #10
|
| 694 |
|
| 695 |
with gr.Row():
|
| 696 |
+
process_button = gr.Button("Process All Uploaded Files", variant="primary", interactive=False)
|
| 697 |
+
clear_button = gr.Button("Clear All Uploads", variant="secondary", interactive=False)
|
| 698 |
|
| 699 |
|
| 700 |
# --- PDF β Markdown tab ---
|
|
|
|
| 881 |
"""
|
| 882 |
#msg = f"Files list cleared: {do_logout()}" ## use as needed
|
| 883 |
msg = f"Files list cleared."
|
| 884 |
+
#yield [], msg, '', ''
|
| 885 |
#return [], f"Files list cleared.", [], []
|
| 886 |
+
yield [], msg, None, None
|
| 887 |
+
return [], f"Files list cleared.", None, None
|
| 888 |
|
| 889 |
#hf_login_logout_btn.click(fn=custom_do_logout, inputs=None, outputs=hf_login_logout_btn)
|
| 890 |
##unused
|
|
|
|
| 898 |
file_btn.upload(
|
| 899 |
fn=accumulate_files,
|
| 900 |
inputs=[file_btn, uploaded_file_list],
|
| 901 |
+
outputs=[uploaded_file_list, output_textbox, process_button, clear_button]
|
| 902 |
)
|
| 903 |
|
| 904 |
# Event handler for the directory upload button
|
| 905 |
dir_btn.upload(
|
| 906 |
fn=accumulate_files,
|
| 907 |
inputs=[dir_btn, uploaded_file_list],
|
| 908 |
+
outputs=[uploaded_file_list, output_textbox, process_button, clear_button]
|
| 909 |
)
|
| 910 |
|
| 911 |
# Event handler for the "Clear" button
|
|
|
|
| 950 |
output_format_dd,
|
| 951 |
output_dir_tb,
|
| 952 |
use_llm_cb,
|
| 953 |
+
force_ocr_cb,
|
| 954 |
page_range_tb,
|
| 955 |
tz_hours_num, #state_tz_hours
|
| 956 |
]
|
utils/config.py
CHANGED
|
@@ -28,6 +28,13 @@ DESCRIPTION_MD = (
|
|
| 28 |
"Upload Markdown/LaTeX files and generate a polished PDF."
|
| 29 |
)
|
| 30 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
# Conversion defaults
|
| 32 |
DEFAULT_MARKER_OPTIONS = {
|
| 33 |
"include_images": True,
|
|
|
|
| 28 |
"Upload Markdown/LaTeX files and generate a polished PDF."
|
| 29 |
)
|
| 30 |
|
| 31 |
+
# File types
|
| 32 |
+
file_types_list = []
|
| 33 |
+
file_types_tuple = (".pdf", ".html", ".docx", ".doc")
|
| 34 |
+
#file_types_list = list[file_types_tuple]
|
| 35 |
+
#file_types_list.extend(file_types_tuple)
|
| 36 |
+
|
| 37 |
+
|
| 38 |
# Conversion defaults
|
| 39 |
DEFAULT_MARKER_OPTIONS = {
|
| 40 |
"include_images": True,
|