Spaces:

semmyk
/

parserPDF

Running on Zero

App Files Files Community

semmyk commited on Oct 1

Commit

8290881

1 Parent(s): 42d6e84

baseline08_beta0.3.0_01Oct25: refactor oauth log in, - Marker converter, dropped llm_client, - add force_ocr: to phase in cli-option

Browse files

Files changed (8) hide show

README.md +17 -14
converters/extraction_converter.py +24 -36
converters/pdf_to_md.py +18 -39
file_handler/file_utils.py +29 -10
llm/llm_login.py +14 -0
requirements.txt +8 -5
ui/gradio_ui.py +113 -119
utils/config.py +7 -0

README.md CHANGED Viewed

@@ -82,11 +82,11 @@ requires-python: ">=3.12"
 [![Python](https://img.shields.io/badge/Python->=3.12-blue?logo=python)](https://www.python.org/)
 [![MIT License](https://img.shields.io/badge/License-MIT-yellow?logo=mit)](LICENSE)
-A Gradio-based web application for converting PDF and HTML documents to Markdown format. Powered by the Marker library (a pipeline of deep learning models for document parsing) and optional LLM integration for enhanced processing. Supports batch processing of files and directories via an intuitive UI.
 ## Features
-- **PDF to Markdown**: Extract text, tables, and images from PDFs using Marker.
-- **HTML to Markdown**: Convert HTML files to clean Markdown.
 - **Batch Processing**: Upload multiple files or entire directories.
 - **LLM Integration**: Optional use of Hugging Face or OpenAI models for advanced conversion (e.g., via Llama or GPT models).
 - **Customizable Settings**: Adjust model parameters, output formats (Markdown/HTML), page ranges, and more via the UI.
@@ -104,21 +104,21 @@ parserpdf/
 ├── converters/                 # Conversion logic
 │   ├── __init__.py
 │   ├── extraction_converter.py # Document extraction utilities
-│   ├── pdf_to_md.py            # Marker-based PDF → Markdown
-│   ├── html_to_md.py           # HTML → Markdown
 │   └── md_to_pdf.py            # Markdown → PDF (pending full implementation)
 ├── file_handler/               # File handling utilities
 │   ├── __init__.py
 │   └── file_utils.py           # Helpers for files, directories, and paths
 ├── llm/                        # LLM client integrations
 │   ├── __init__.py
-│   ├── hf_client.py            # Hugging Face client wrapper
-│   ├── openai_client.py        # Marker OpenAI client
 │   ├── llm_login.py            # Authentication handlers
 │   └── provider_validator.py   # Provider validation
 ├── ui/                         # Gradio UI components
 │   ├── __init__.py
-│   └── gradio_ui.py            # UI layout and event handlers
 ├── utils/                      # Utility modules
 │   ├── __init__.py
 │   ├── config.py               # Configuration constants
@@ -132,8 +132,8 @@ parserpdf/
 │   ├── output_dir/             # Output directory
 │   ├── pdf/                    # Sample PDFs
 ├── logs/                       # Log files (gitignored)
-├── tests/                      # Unit tests
-├── tests_converter.py          # tests for converters
 └── scrapyard/                  # Development scraps
@@ -165,10 +165,11 @@ parserpdf/
      HF_TOKEN=hf_xxx
      OPENAI_API_KEY=sk-xxx
      ```
 4. Install Marker (if not in requirements.txt):
    ```
-   pip install marker-pdf
    ```
 ## Usage
@@ -180,7 +181,7 @@ parserpdf/
 2. Open the provided local URL (e.g., http://127.0.0.1:7860) in your browser.
 3. In the UI:
-   - Upload PDF/HTML files or directories via the "PDF & HTML ➜ Markdown" tab.
    - Configure LLM/Marker settings in the accordions (e.g., select provider, model, tokens).
    - Click "Process All Uploaded Files" to convert.
    - View logs, JSON output, and download generated Markdown files.
@@ -208,13 +209,15 @@ parserpdf/
 ## Limitations & TODO
 - Markdown → PDF is pending full implementation.
 - HTML tab is deprecated; use main tab for mixed uploads.
-- Large files/directories may require increased `max_workers`.
 - No JSON/chunks output yet (flagged for future).
 ## Contributing
 Fork the repo, create a branch, and submit a PR.
-Ensure tests pass: - verify the application's functionality.
 ```
 pytest tests/
 ```

 [![Python](https://img.shields.io/badge/Python->=3.12-blue?logo=python)](https://www.python.org/)
 [![MIT License](https://img.shields.io/badge/License-MIT-yellow?logo=mit)](LICENSE)
+A Gradio-based web application for converting PDF, HTML and Word documents to Markdown format. Powered by the Marker library (a pipeline of deep learning models for document parsing) and optional LLM integration for enhanced processing. Supports batch processing of files and directories via an intuitive UI.
 ## Features
+- **PDF to Markdown**: Extract text, tables, and images from PDFs, HTMLs and Word documents using Marker.
+- **HTML to Markdown**: Convert HTML files to clean Markdown. #Deprecated
 - **Batch Processing**: Upload multiple files or entire directories.
 - **LLM Integration**: Optional use of Hugging Face or OpenAI models for advanced conversion (e.g., via Llama or GPT models).
 - **Customizable Settings**: Adjust model parameters, output formats (Markdown/HTML), page ranges, and more via the UI.
 ├── converters/                 # Conversion logic
 │   ├── __init__.py
 │   ├── extraction_converter.py # Document extraction utilities
+│   ├── pdf_to_md.py            # Marker-based PDF, HTML, Word → Markdown
+│   ├── html_to_md.py           # HTML → Markdown  #Deprecated
 │   └── md_to_pdf.py            # Markdown → PDF (pending full implementation)
 ├── file_handler/               # File handling utilities
 │   ├── __init__.py
 │   └── file_utils.py           # Helpers for files, directories, and paths
 ├── llm/                        # LLM client integrations
 │   ├── __init__.py
+│   ├── hf_client.py            # Hugging Face client wrapper  ##PutOnHold
+│   ├── openai_client.py        # Marker OpenAI client         ##NotFullyImplemented
 │   ├── llm_login.py            # Authentication handlers
 │   └── provider_validator.py   # Provider validation
 ├── ui/                         # Gradio UI components
 │   ├── __init__.py
+│   └── gradio_ui.py            # UI layout, event handlers and coordination
 ├── utils/                      # Utility modules
 │   ├── __init__.py
 │   ├── config.py               # Configuration constants
 │   ├── output_dir/             # Output directory
 │   ├── pdf/                    # Sample PDFs
 ├── logs/                       # Log files (gitignored)
+├── tests/                      # Unit tests   ##ToBeUpdated
+│   ├── tests_converter.py          # tests for converters
 └── scrapyard/                  # Development scraps
      HF_TOKEN=hf_xxx
      OPENAI_API_KEY=sk-xxx
      ```
+   - HuggingFace login (oauth) integrated with Gradio:
 4. Install Marker (if not in requirements.txt):
    ```
+   pip install marker-pdf[full]
    ```
 ## Usage
 2. Open the provided local URL (e.g., http://127.0.0.1:7860) in your browser.
 3. In the UI:
+   - Upload PDF/HTML/Word files or directories via the "PDF, HTML & Word ➜ Markdown" tab.
    - Configure LLM/Marker settings in the accordions (e.g., select provider, model, tokens).
    - Click "Process All Uploaded Files" to convert.
    - View logs, JSON output, and download generated Markdown files.
 ## Limitations & TODO
 - Markdown → PDF is pending full implementation.
 - HTML tab is deprecated; use main tab for mixed uploads.
+- Large files/directories may require increased `max_workers` and higher processing power.
 - No JSON/chunks output yet (flagged for future).
 ## Contributing
 Fork the repo, create a branch, and submit a PR.
+- GitHub
+- HuggingFace Space Community
+Ensure tests pass: - verify the application's functionality. ##TardyOutdated
 ```
 pytest tests/
 ```

converters/extraction_converter.py CHANGED Viewed

@@ -53,6 +53,7 @@ class DocumentConverter:
         output_format: str = "markdown",
         output_dir: Optional[Union[str, Path]] = "output_dir",
         use_llm: Optional[bool] = None,  #bool = False,  #Optional[bool] = False,  #True,
         page_range: Optional[str] = None,  #str = None  #Optional[str] = None,
         ):
@@ -68,6 +69,7 @@ class DocumentConverter:
         self.max_retries = max_retries  ## pass to __call__
         self.output_dir = output_dir    ## "output_dir": settings.DEBUG_DATA_FOLDER if debug else output_dir,
         self.use_llm = use_llm if use_llm else False  #use_llm[0] if isinstance(use_llm, tuple) else use_llm,  #False,  #True,
         #self.page_range = page_range[0] if isinstance(page_range, tuple) else page_range   ##SMY: iterating twice because self.page casting as hint type tuple!
         self.page_range = page_range if page_range else None
         # self.page_range = page_range[0] if isinstance(page_range, tuple) else page_range if isinstance(page_range, str) else None,  ##Example: "0,4-8,16"  ##Marker parses as List[int]  #]debug  #len(pdf_file)
@@ -80,6 +82,7 @@ class DocumentConverter:
         # 0) Instantiate the LLM Client (OPENAIChatClient): Get a provider‐agnostic chat function
         ##SMY: #future. Plan to integrate into Marker: uses its own LLM services (clients). As at 1.9.2, there's no huggingface client service.
         try:
             self.client = OpenAIChatClient(
             model_id=model_id,
@@ -95,16 +98,17 @@ class DocumentConverter:
             tb = traceback.format_exc()   #exc.__traceback__
             logger.exception(f"✗ Error initialising OpenAIChatClient: {exc}\n{tb}")
             raise RuntimeError(f"✗ Error initialising OpenAIChatClient: {exc}\n{tb}")  #.with_traceback(tb)
         # 1) # Define the custom configuration for the Hugging Face LLM.
                 # Use typing.Dict and typing.Any for flexible dictionary type hints
         try:
             self.config_dict: Dict[str, Any] = self.get_config_dict(model_id=model_id, llm_service=str(self.llm_service), output_format=output_format)
-            #self.config_dict.pop("page_range") if self.config_dict.get("page_range")[0] is None else None  ##SMY: execute if page_range is none. `else None` ensures valid syntactic expression
             ##SMY: if falsely empty tuple () or None, pop the "page_range" key-value pair, else do nothing if truthy tuple value (i.e. keep as-is)
             self.config_dict.pop("page_range", None) if not self.config_dict.get("page_range") else None
             self.config_dict.pop("use_llm", None) if not self.config_dict.get("use_llm") or self.config_dict.get("use_llm") is False or self.config_dict.get("use_llm") == 'False'  else None
             logger.log(level=20, msg="✔️ config_dict custom configured:", extra={"service": "openai"})  #, "config": str(self.config_dict)})
@@ -124,27 +128,17 @@ class DocumentConverter:
             logger.exception(f"✗ Error parsing/processing custom config_dict: {exc}\n{tb}")
             raise RuntimeError(f"✗ Error parsing/processing custom config_dict: {exc}\n{tb}")  #.with_traceback(tb)
-        # 3) Create the artifact dictionary and retrieve the LLM service.  ##SMY: disused
-        try:
-            ##self.artifact_dict: Dict[str, Any] = self.get_create_model_dict  ##SMY: Might have to eliminate function afterall
-            #self.artifact_dict: Dict[str, Type[BaseModel]] = create_model_dict()  ##SMY: BaseModel for Any??
-            self.artifact_dict = {}  ##dummy
-            ##logger.log(level=20, msg="✔️ Create artifact_dict and llm_service retrieved:", extra={"llm_service": self.llm_service})
-        except Exception as exc:
-            tb = traceback.format_exc()   #exc.__traceback__
-            logger.exception(f"✗ Error creating artifact_dict or retrieving LLM service: {exc}\n{tb}")
-            raise RuntimeError(f"✗ Error creating artifact_dict or retrieving LLM service: {exc}\n{tb}")  #.with_traceback(tb)
-        # 4) Load models if not already loaded in reload mode
         from globals import config_load_models
         try:
-            if not config_load_models.model_dict or 'model_dict' not in globals():
                 model_dict = load_models()
                 '''if 'model_dict' not in globals():
                     #model_dict = self.load_models()
                     model_dict = load_models()'''
-            else: model_dict = config_load_models.model_dict
         except OSError as exc_ose:
             tb = traceback.format_exc()   #exc.__traceback__
             logger.warning(f"⚠️ OSError: the paging file is too small (to complete reload): {exc_ose}\n{tb}")
@@ -153,30 +147,28 @@ class DocumentConverter:
             tb = traceback.format_exc()   #exc.__traceback__
             logger.exception(f"✗ Error loading models (reload): {exc}\n{tb}")
             raise RuntimeError(f"✗ Error loading models (reload): {exc}\n{tb}")  #.with_traceback(tb)
-        # 5) Instantiate Marker's MarkerConverter (PdfConverter) with config managed by config_parser
         try:  # Assign llm_service if api_token.  ##SMY: split and slicing  ##Gets the string value
             llm_service_str = None if api_token == '' or api_token is None or self.use_llm is False else str(self.llm_service).split("'")[1]  #
             # sets api_key required by Marker ## to handle Marker's assertion test on OpenAI
-            #os.environ["OPENAI_API_KEY"] = api_token if api_token !='' or api_token is not None else self.openai_api_key  ##SMY: looks lame
-            os.environ["OPENAI_API_KEY"] = api_token if api_token and api_token != '' else os.getenv("OPENAI_API_KEY") or os.getenv("GEMINI_API_KEY") or os.getenv("HF_TOKEN") or os.getenv("HUGGINGFACEHUB_API_TOKEN")
-            #logger.log(level=20, msg="self.converter: instantiating MarkerConverter:", extra={"llm_service_str": llm_service_str, "api_token": api_token})  ##debug
             config_dict = config_parser.generate_config_dict()
-            #config_dict["pdftext_worker"] = self.max_workers  #1  ##SMY: move to get_config_dicts()
-            #self.converter: MarkerConverter = MarkerConverter(
             self.converter = MarkerConverter(
-                ##artifact_dict=self.artifact_dict,
                 #artifact_dict=create_model_dict(),
                 artifact_dict=model_dict if model_dict else create_model_dict(),
                 config=config_dict,
                 #config=config_parser.generate_config_dict(),
                 #llm_service=self.llm_service  ##SMY expecting str but self.llm_service, is service object marker.services of type BaseServices
                 llm_service=llm_service_str,    ##resolve
-            )
             logger.log(level=20, msg="✔️ MarkerConverter instantiated successfully:", extra={"converter.config": str(self.converter.config.get("openai_base_url")), "use_llm":self.converter.use_llm})
             #return self.converter  ##SMY: to query why did I comment out?. Bingo: "__init__() should return None, not 'PdfConverter'"
@@ -187,21 +179,20 @@ class DocumentConverter:
         # Define the custom configuration for HF LLM.
     def get_config_dict(self, model_id: str, llm_service=MarkerOpenAIService, output_format: Optional[str] = "markdown" ) -> Dict[str, Any]:
-        """ Define the custom configuration for the Hugging Face LLM. """
         try:
-            ## Enable higher quality processing with LLMs.  ## See MarkerOpenAIService,
-            # llm_service disused here
             ##llm_service = llm_service.removeprefix("<class '").removesuffix("'>")  # e.g <class 'marker.services.openai.OpenAIService'>
             #llm_service  = str(llm_service).split("'")[1]  ## SMY: split and slicing
             self.use_llm = self.use_llm[0] if isinstance(self.use_llm, tuple) else self.use_llm
             self.page_range = self.page_range[0] if isinstance(self.page_range, tuple) else self.page_range #if isinstance(self.page_range, str) else None,  ##SMY: passing as hint type tuple!
             config_dict = {
                 "output_format" : output_format,     #"markdown",
                 "openai_model"   : self.model_id,    #self.client.model_id,  #"model_name"
-                "openai_api_key" : self.client.openai_api_key,   #self.client.openai_api_key,  #self.api_token,
                 "openai_base_url": self.openai_base_url,  #self.client.base_url,  #self.base_url,
                 "temperature"    : self.temperature,      #self.client.temperature,
                 "top_p"          : self.top_p,            #self.client.top_p,
@@ -210,6 +201,7 @@ class DocumentConverter:
                 "max_retries"    : self.max_retries,  #3,  ## pass to __call__
                 "output_dir"     : self.output_dir,
                 "use_llm"        : self.use_llm,      #False,  #True,
                 "page_range"     : self.page_range,   ##debug  #len(pdf_file)
             }
             return config_dict
@@ -219,10 +211,6 @@ class DocumentConverter:
             raise RuntimeError(f"✗ Error configuring custom config_dict: {exc}\n{tb}")  #").with_traceback(tb)
             #raise
-    ''' # create/load models. Called to curtail reloading models at each instance
-    def load_models():
-        return create_model_dict()'''
     ##SMY: flagged for deprecation
     ##SMY: marker prefer default artifact dictionary (marker.models.create_model_dict) instead of overridding
     #def get_extraction_converter(self, chat_fn):

         output_format: str = "markdown",
         output_dir: Optional[Union[str, Path]] = "output_dir",
         use_llm: Optional[bool] = None,  #bool = False,  #Optional[bool] = False,  #True,
+        force_ocr: Optional[bool] = None, #bool = False,
         page_range: Optional[str] = None,  #str = None  #Optional[str] = None,
         ):
         self.max_retries = max_retries  ## pass to __call__
         self.output_dir = output_dir    ## "output_dir": settings.DEBUG_DATA_FOLDER if debug else output_dir,
         self.use_llm = use_llm if use_llm else False  #use_llm[0] if isinstance(use_llm, tuple) else use_llm,  #False,  #True,
+        self.force_ocr = force_ocr if force_ocr else False
         #self.page_range = page_range[0] if isinstance(page_range, tuple) else page_range   ##SMY: iterating twice because self.page casting as hint type tuple!
         self.page_range = page_range if page_range else None
         # self.page_range = page_range[0] if isinstance(page_range, tuple) else page_range if isinstance(page_range, str) else None,  ##Example: "0,4-8,16"  ##Marker parses as List[int]  #]debug  #len(pdf_file)
         # 0) Instantiate the LLM Client (OPENAIChatClient): Get a provider‐agnostic chat function
         ##SMY: #future. Plan to integrate into Marker: uses its own LLM services (clients). As at 1.9.2, there's no huggingface client service.
+        '''
         try:
             self.client = OpenAIChatClient(
             model_id=model_id,
             tb = traceback.format_exc()   #exc.__traceback__
             logger.exception(f"✗ Error initialising OpenAIChatClient: {exc}\n{tb}")
             raise RuntimeError(f"✗ Error initialising OpenAIChatClient: {exc}\n{tb}")  #.with_traceback(tb)
+        '''
         # 1) # Define the custom configuration for the Hugging Face LLM.
                 # Use typing.Dict and typing.Any for flexible dictionary type hints
         try:
             self.config_dict: Dict[str, Any] = self.get_config_dict(model_id=model_id, llm_service=str(self.llm_service), output_format=output_format)
+            ##SMY: execute if page_range is none. `else None` ensures valid syntactic expression
             ##SMY: if falsely empty tuple () or None, pop the "page_range" key-value pair, else do nothing if truthy tuple value (i.e. keep as-is)
             self.config_dict.pop("page_range", None) if not self.config_dict.get("page_range") else None
             self.config_dict.pop("use_llm", None) if not self.config_dict.get("use_llm") or self.config_dict.get("use_llm") is False or self.config_dict.get("use_llm") == 'False'  else None
+            self.config_dict.pop("force_ocr", None) if not self.config_dict.get("force_ocr") or self.config_dict.get("force_ocr") is False or self.config_dict.get("force_ocr") == 'False'  else None
             logger.log(level=20, msg="✔️ config_dict custom configured:", extra={"service": "openai"})  #, "config": str(self.config_dict)})
             logger.exception(f"✗ Error parsing/processing custom config_dict: {exc}\n{tb}")
             raise RuntimeError(f"✗ Error parsing/processing custom config_dict: {exc}\n{tb}")  #.with_traceback(tb)
+        # 3) Load models if not already loaded in reload mode
         from globals import config_load_models
         try:
+            if config_load_models.model_dict:
+                model_dict = config_load_models.model_dict
+            #elif not config_load_models.model_dict or 'model_dict' not in globals():
+            else:
                 model_dict = load_models()
                 '''if 'model_dict' not in globals():
                     #model_dict = self.load_models()
                     model_dict = load_models()'''
         except OSError as exc_ose:
             tb = traceback.format_exc()   #exc.__traceback__
             logger.warning(f"⚠️ OSError: the paging file is too small (to complete reload): {exc_ose}\n{tb}")
             tb = traceback.format_exc()   #exc.__traceback__
             logger.exception(f"✗ Error loading models (reload): {exc}\n{tb}")
             raise RuntimeError(f"✗ Error loading models (reload): {exc}\n{tb}")  #.with_traceback(tb)
+        # 4) Instantiate Marker's MarkerConverter (PdfConverter) with config managed by config_parser
         try:  # Assign llm_service if api_token.  ##SMY: split and slicing  ##Gets the string value
             llm_service_str = None if api_token == '' or api_token is None or self.use_llm is False else str(self.llm_service).split("'")[1]  #
             # sets api_key required by Marker ## to handle Marker's assertion test on OpenAI
+            if llm_service_str:
+                os.environ["OPENAI_API_KEY"] = api_token if api_token and api_token != '' else os.getenv("OPENAI_API_KEY") or os.getenv("GEMINI_API_KEY") or os.getenv("HF_TOKEN") or os.getenv("HUGGINGFACEHUB_API_TOKEN")
+                #logger.log(level=20, msg="self.converter: instantiating MarkerConverter:", extra={"llm_service_str": llm_service_str, "api_token": api_token})  ##debug
             config_dict = config_parser.generate_config_dict()
+            #config_dict["pdftext_worker"] = self.max_workers  #1  ##SMY: moved to get_config_dicts()
+            #self.converter: marker.converters.pdf.PdfConverter
             self.converter = MarkerConverter(
                 #artifact_dict=create_model_dict(),
                 artifact_dict=model_dict if model_dict else create_model_dict(),
                 config=config_dict,
                 #config=config_parser.generate_config_dict(),
                 #llm_service=self.llm_service  ##SMY expecting str but self.llm_service, is service object marker.services of type BaseServices
                 llm_service=llm_service_str,    ##resolve
+                )
             logger.log(level=20, msg="✔️ MarkerConverter instantiated successfully:", extra={"converter.config": str(self.converter.config.get("openai_base_url")), "use_llm":self.converter.use_llm})
             #return self.converter  ##SMY: to query why did I comment out?. Bingo: "__init__() should return None, not 'PdfConverter'"
         # Define the custom configuration for HF LLM.
     def get_config_dict(self, model_id: str, llm_service=MarkerOpenAIService, output_format: Optional[str] = "markdown" ) -> Dict[str, Any]:
+        """ Define the custom configuration for the Hugging Face LLM: combining Markers cli_options and LLM. """
         try:
+            ## LLM Enable higher quality processing.  ## See MarkerOpenAIService,
             ##llm_service = llm_service.removeprefix("<class '").removesuffix("'>")  # e.g <class 'marker.services.openai.OpenAIService'>
             #llm_service  = str(llm_service).split("'")[1]  ## SMY: split and slicing
             self.use_llm = self.use_llm[0] if isinstance(self.use_llm, tuple) else self.use_llm
             self.page_range = self.page_range[0] if isinstance(self.page_range, tuple) else self.page_range #if isinstance(self.page_range, str) else None,  ##SMY: passing as hint type tuple!
+            ##SMY: TODO: convert to {inputs} and called from gradio_ui
             config_dict = {
                 "output_format" : output_format,     #"markdown",
                 "openai_model"   : self.model_id,    #self.client.model_id,  #"model_name"
+                "openai_api_key" : self.openai_api_key,   #self.client.openai_api_key,  #self.api_token,
                 "openai_base_url": self.openai_base_url,  #self.client.base_url,  #self.base_url,
                 "temperature"    : self.temperature,      #self.client.temperature,
                 "top_p"          : self.top_p,            #self.client.top_p,
                 "max_retries"    : self.max_retries,  #3,  ## pass to __call__
                 "output_dir"     : self.output_dir,
                 "use_llm"        : self.use_llm,      #False,  #True,
+                "force_ocr"      : self.force_ocr,    #False,
                 "page_range"     : self.page_range,   ##debug  #len(pdf_file)
             }
             return config_dict
             raise RuntimeError(f"✗ Error configuring custom config_dict: {exc}\n{tb}")  #").with_traceback(tb)
             #raise
     ##SMY: flagged for deprecation
     ##SMY: marker prefer default artifact dictionary (marker.models.create_model_dict) instead of overridding
     #def get_extraction_converter(self, chat_fn):

converters/pdf_to_md.py CHANGED Viewed

@@ -1,13 +1,13 @@
 # converters/pdf_to_md.py
 import os
 from pathlib import Path
-from typing import List, Dict, Optional, Union
 import traceback  ## Extract, format and print information about Python stack traces.
 import time
-#from llm.hf_client import HFChatClient
 from converters.extraction_converter import DocumentConverter  #, DocumentExtractor #as docextractor #ExtractionConverter  #get_extraction_converter  ## SMY: should disuse
-from file_handler.file_utils import collect_pdf_paths, collect_html_paths, collect_markdown_paths, create_outputdir, write_markdown, dump_images
 from utils import config
@@ -43,7 +43,9 @@ def init_worker(#self,
     output_format: str,  #: str = "markdown",
     output_dir: str,  #: Union | None = "output_dir",
     use_llm: bool,  #: bool | None = False,
     page_range: str,  #: str | None = None
     ):
     #'''
@@ -58,35 +60,6 @@ def init_worker(#self,
     # Define global variables
     global docconverter
     global converter
-    ##SMY: kept for future implementation. Replaced with DocumentConverter.
-    '''
-    # 1) Instantiate the DocumentExtractor
-    logger.log(level=20, msg="initialising docextractor:", extra={"model_id": model_id, "hf_provider": hf_provider})
-    try:
-        docextractor = DocumentExtractor(
-            provider=provider,
-            model_id=model_id,
-            hf_provider=hf_provider,
-            endpoint_url=endpoint_url,
-            backend_choice=backend_choice,
-            system_message=system_message,
-            max_tokens=max_tokens,
-            temperature=temperature,
-            top_p=top_p,
-            stream=stream,
-            api_token=api_token,
-        )
-        logger.log(level=20, msg="✔️ docextractor initialised:", extra={"model_id": model_id, "hf_provider": hf_provider})
-    except Exception as exc:
-        #logger.error(f"Failed to initialise DocumentExtractor: {exc}")
-        tb = traceback.format_exc()
-        logger.exception(f"init_worker: Error initialising DocumentExtractor → {exc}\n{tb}", exc_info=True)
-        return f"✗ init_worker: error initialising DocumentExtractor → {exc}\n{tb}"
-    self.docextractor = docextractor
-    '''
     #'''
     # 1) Instantiate the DocumentConverter
@@ -105,6 +78,7 @@ def init_worker(#self,
             output_format,  #: str = "markdown",
             output_dir,  #: Union | None = "output_dir",
             use_llm,  #: bool | None = False,
             page_range,  #: str | None = None
         )
         logger.log(level=20, msg="✔️ docextractor initialised:", extra={"docconverter model_id": docconverter.converter.config.get("openai_model"), "docconverter use_llm": docconverter.converter.use_llm, "docconverter output_dir": docconverter.output_dir})
@@ -127,8 +101,9 @@ class PdfToMarkdownConverter:
     #def __init__(self, options: Dict | None = None):
     def __init__(self, options: Dict | None = None): #extractor: DocumentExtractor, options: Dict | None = None):
-        self.options = options or {}
         self.output_dir_string = ''
         #self.OUTPUT_DIR = config.OUTPUT_DIR     ##flag unused
         #self.MAX_RETRIES = config.MAX_RETRIES   ##flag unused
         #self.docconverter = None  #DocumentConverter
@@ -197,25 +172,28 @@ class PdfToMarkdownConverter:
         return {"file": md_file.name, "images": images_count, "filepath": md_file, "image_path": image_path}  ####SMY should be Dict[str, int, str]. Dicts are not necessarily ordered.
     #def convert_files(src_path: str, output_dir: str, max_retries: int = 2) -> str:
-    def convert_files(self, src_path: str, output_dir_string: str = None, max_retries: int = 2) -> Union[Dict, str]:  #str:
     #def convert_files(self, src_path: str) -> str:
         """
         Worker task: use `extractor` to convert file with retry/backoff.
         Returns a short log line.
         """
-        try:
             output_dir = create_outputdir(root=src_path, output_dir_string=self.output_dir_string)
             logger.info(f"✓ output_dir created: {output_dir}")  #{create_outputdir(src_path)}"
         except Exception as exc:
             tb = traceback.format_exc()
             logger.exception("✗ error creating output_dir → {exc}\n{tb}", exc_info=True)
-            return f"✗ error creating output_dir → {exc}\n{tb}"
         try:
             #if Path(src_path).suffix.lower() not in {".pdf", ".html", ".htm"}:
             #if not Path(src_path).name.endswith(tuple({".pdf", ".html"})):  #,".docx", ".doc", ".pptx", ".ppt", ".xlsx", ".xls"})):
-            if not Path(src_path).name.endswith((".pdf", ".html")):  #,".docx", ".doc", ".pptx", ".ppt", ".xlsx", ".xls"})):
                 logger.log(level=20, msg=f"skipped {Path(src_path).name}", exc_info=True)
                 return f"skipped {Path(src_path).name}"
         except Exception as exc:
@@ -226,7 +204,8 @@ class PdfToMarkdownConverter:
         #max_retries = self.MAX_RETRIES
         for attempt in range(1, max_retries + 1):
             try:
-                info = self.extract(str(src_path), str(output_dir.stem))  #extractor.converter(str(src_path), str(output_dir))  #
                 logger.log(level=20, msg=f"✓ : info about extracted {Path(src_path).name}: ", extra={"info": str(info)})
                 '''  ##SMY: moving formating to calling Gradio
                 img_count = info.get("images", 0)
@@ -239,7 +218,7 @@ class PdfToMarkdownConverter:
             except Exception as exc:
                 if attempt == max_retries:
                     tb = traceback.format_exc()
-                    return f"✗ {info.get('file')} → {exc}\n{tb}"
                     #return f"✗ {md_filename} → {exc}\n{tb}"
                 #time.sleep(2 ** attempt)

 # converters/pdf_to_md.py
 import os
 from pathlib import Path
+from typing import List, Dict, Union, Optional
 import traceback  ## Extract, format and print information about Python stack traces.
 import time
+from ui.gradio_ui import gr
 from converters.extraction_converter import DocumentConverter  #, DocumentExtractor #as docextractor #ExtractionConverter  #get_extraction_converter  ## SMY: should disuse
+from file_handler.file_utils import write_markdown, dump_images, collect_pdf_paths, collect_html_paths, collect_markdown_paths, create_outputdir
 from utils import config
     output_format: str,  #: str = "markdown",
     output_dir: str,  #: Union | None = "output_dir",
     use_llm: bool,  #: bool | None = False,
+    force_ocr: bool,
     page_range: str,  #: str | None = None
+    progress: gr.Progress = gr.Progress(),
     ):
     #'''
     # Define global variables
     global docconverter
     global converter
     #'''
     # 1) Instantiate the DocumentConverter
             output_format,  #: str = "markdown",
             output_dir,  #: Union | None = "output_dir",
             use_llm,  #: bool | None = False,
+            force_ocr,
             page_range,  #: str | None = None
         )
         logger.log(level=20, msg="✔️ docextractor initialised:", extra={"docconverter model_id": docconverter.converter.config.get("openai_model"), "docconverter use_llm": docconverter.converter.use_llm, "docconverter output_dir": docconverter.output_dir})
     #def __init__(self, options: Dict | None = None):
     def __init__(self, options: Dict | None = None): #extractor: DocumentExtractor, options: Dict | None = None):
+        self.options = options or {}    ##SMY: TOBE implemented - bring all Marker's options
         self.output_dir_string = ''
+        self.output_dir = self.output_dir_string  ## placeholder
         #self.OUTPUT_DIR = config.OUTPUT_DIR     ##flag unused
         #self.MAX_RETRIES = config.MAX_RETRIES   ##flag unused
         #self.docconverter = None  #DocumentConverter
         return {"file": md_file.name, "images": images_count, "filepath": md_file, "image_path": image_path}  ####SMY should be Dict[str, int, str]. Dicts are not necessarily ordered.
     #def convert_files(src_path: str, output_dir: str, max_retries: int = 2) -> str:
+    #def convert_files(self, src_path: str, output_dir_string: str = None, max_retries: int = 2, progress = gr.Progress()) -> Union[Dict, str]:  #str:
+    def convert_files(self, src_path: str, max_retries: int = 2, progress = gr.Progress()) -> Union[Dict, str]:
     #def convert_files(self, src_path: str) -> str:
         """
         Worker task: use `extractor` to convert file with retry/backoff.
         Returns a short log line.
         """
+        '''try:   ##moved to gradio_ui. sets to PdfToMarkdownConverter.output_dir_string
             output_dir = create_outputdir(root=src_path, output_dir_string=self.output_dir_string)
             logger.info(f"✓ output_dir created: {output_dir}")  #{create_outputdir(src_path)}"
         except Exception as exc:
             tb = traceback.format_exc()
             logger.exception("✗ error creating output_dir → {exc}\n{tb}", exc_info=True)
+            return f"✗ error creating output_dir → {exc}\n{tb}"'''
+        output_dir = Path(self.output_dir)  ## takes the value from gradio_ui
         try:
             #if Path(src_path).suffix.lower() not in {".pdf", ".html", ".htm"}:
             #if not Path(src_path).name.endswith(tuple({".pdf", ".html"})):  #,".docx", ".doc", ".pptx", ".ppt", ".xlsx", ".xls"})):
+            #if not Path(src_path).name.endswith((".pdf", ".html", ".docx", ".doc")):  #,".docx", ".doc", ".pptx", ".ppt", ".xlsx", ".xls"})):
+            if not Path(src_path).name.endswith(config.file_types_tuple):  #,".docx", ".doc", ".pptx", ".ppt", ".xlsx", ".xls"})):
                 logger.log(level=20, msg=f"skipped {Path(src_path).name}", exc_info=True)
                 return f"skipped {Path(src_path).name}"
         except Exception as exc:
         #max_retries = self.MAX_RETRIES
         for attempt in range(1, max_retries + 1):
             try:
+                #info = self.extract(str(src_path), str(output_dir.stem))  #extractor.converter(str(src_path), str(output_dir))  #
+                info = self.extract(str(src_path), str(output_dir))  #extractor.converter(str(src_path), str(output_dir))  #
                 logger.log(level=20, msg=f"✓ : info about extracted {Path(src_path).name}: ", extra={"info": str(info)})
                 '''  ##SMY: moving formating to calling Gradio
                 img_count = info.get("images", 0)
             except Exception as exc:
                 if attempt == max_retries:
                     tb = traceback.format_exc()
+                    return f"✗ {info.get('file', 'UnboundlocalError: info is None')} → {exc}\n{tb}"
                     #return f"✗ {md_filename} → {exc}\n{tb}"
                 #time.sleep(2 ** attempt)

file_handler/file_utils.py CHANGED Viewed

@@ -6,15 +6,15 @@ import shutil
 import tempfile
 from itertools import chain
-from typing import List, Union, Any, Mapping
 from PIL import Image
 #import utils.config as config   ##SMY: currently unused
-##SMY: Might be deprecated vis duplicated. See marker/marker/config/parser.py  ~ https://github.com/datalab-to/marker/blob/master/marker/config/parser.py#L169
 #def create_outputdir(root: Union[str, Path], out_dir:Union[str, Path] = None) -> Path:  #List[Path]:
 def create_outputdir(root: Union[str, Path], output_dir_string:str = None) -> Path:  #List[Path]:
-    """ Create output dir under the input folder """
     '''  ##preserved for future implementation if needed again
     root = root if isinstance(root, Path) else Path(root)
@@ -24,10 +24,12 @@ def create_outputdir(root: Union[str, Path], output_dir_string:str = None) -> Pa
     out_dir = out_dir if out_dir else "output_md"  ## SMY: default to outputdir in config file = "output_md"
     output_dir = root.parent / out_dir  #"md_output"  ##SMY: concatenating output str with src Path
     '''
     ## map to img_path. Opt to putting output within same output_md folder rather than individual source folders
     output_dir_string = output_dir_string if output_dir_string else "output_dir"  ##redundant SMY: default to outputdir in config file = "output_md"
-    output_dir = Path("data") / output_dir_string  #"output_md"  ##SMY: concatenating output str with src Path
     output_dir.mkdir(mode=0o2755, parents=True, exist_ok=True)   #,mode=0o2755
     return output_dir
@@ -225,6 +227,17 @@ def check_create_file(filename: Union[str, Path]) -> Path:
     return filename_path
 def zip_processed_files(root_dir: str, file_paths: list[str], tz_hours=None, date_format='%d%b%Y_%H-%M-%S') -> Path:
     """
     Creates a zip file from a list of file paths (strings) and returns the Path object.
@@ -247,11 +260,14 @@ def zip_processed_files(root_dir: str, file_paths: list[str], tz_hours=None, dat
         raise ValueError(f"Root directory does not exist: {root_path}")
     # Create a temporary directory in a location where Gradio can access it.
-    gradio_output_dir = Path(tempfile.gettempdir()) / "gradio_temp_output"
     #gradio_output_dir.mkdir(exist_ok=True)
     file_utils.check_create_dir(gradio_output_dir)
     final_zip_path = gradio_output_dir / f"outputs_processed_{utils.get_time_now_str(tz_hours=tz_hours, date_format=date_format)}.zip"
     # Use a context manager to create the zip file: use zipfile() opposed to shutil.make_archive
     # 'w' mode creates a new file, overwriting if it already exists.
     zip_unprocessed = 0
@@ -442,7 +458,7 @@ def write_markdown(
     Notes
     -----
     The function is intentionally lightweight: it only handles path resolution,
-    directory creation, and file I/O. All rendering logic should be performed before
     calling this helper.
     """
     src = Path(src_path)
@@ -460,9 +476,11 @@ def write_markdown(
         ## Opt to putting output within same output_md folder rather than individual source folders
         #md_path = Path("data\\pdf") / "output_md" / f"{src.stem}" / md_name  ##debug
-        md_path = Path("data") / output_dir / f"{src.stem}" / md_name  ##debug
     ##SMY: [resolved] Permission Errno13 - https://stackoverflow.com/a/57454275
-    md_path.parent.mkdir(mode=0o2755, parents=True, exist_ok=True)  ##SMY: create nested md_path if not exists
     md_path.parent.chmod(0)
     try:
@@ -531,7 +549,8 @@ def dump_images(
             #img_path = Path(src.parent) / f"{Path(output_dir).stem}" / f"{src.stem}" / img_name
             #img_path = Path("data\\pdf") / "output_md" / f"{src.stem}" / img_name  ##debug
-            img_path = Path("data") / output_dir / f"{src.stem}" / img_name  ##debug
         #img_path.mkdir(mode=0o777, parents=True, exist_ok=True)  ##SMY: create nested img_path if not exists
         #img_path.parent.mkdir(parents=True, exist_ok=True)

 import tempfile
 from itertools import chain
+from typing import List, Optional, Union, Any, Mapping
 from PIL import Image
 #import utils.config as config   ##SMY: currently unused
+##SMY: flagged: deprecated vis duplicated. See create_temp_folder() and marker/marker/config/parser.py  ~ https://github.com/datalab-to/marker/blob/master/marker/config/parser.py#L169
 #def create_outputdir(root: Union[str, Path], out_dir:Union[str, Path] = None) -> Path:  #List[Path]:
 def create_outputdir(root: Union[str, Path], output_dir_string:str = None) -> Path:  #List[Path]:
+    """ Create output dir default to Temp """
     '''  ##preserved for future implementation if needed again
     root = root if isinstance(root, Path) else Path(root)
     out_dir = out_dir if out_dir else "output_md"  ## SMY: default to outputdir in config file = "output_md"
     output_dir = root.parent / out_dir  #"md_output"  ##SMY: concatenating output str with src Path
     '''
+    root = create_temp_folder()
     ## map to img_path. Opt to putting output within same output_md folder rather than individual source folders
     output_dir_string = output_dir_string if output_dir_string else "output_dir"  ##redundant SMY: default to outputdir in config file = "output_md"
+    #output_dir = Path("data") / output_dir_string  #"output_md"  ##SMY: concatenating output str with src Path
+    output_dir = Path(root) / output_dir_string  #"output_md"  ##SMY: concatenating output str with src Path
     output_dir.mkdir(mode=0o2755, parents=True, exist_ok=True)   #,mode=0o2755
     return output_dir
     return filename_path
+def create_temp_folder(tempfolder: Optional[str | Path] = ''):
+    """ Create a temp folder Gradio and output_dir if supplied"""
+    # Create a temporary directory in a location where Gradio can access it.
+    #gradio_output_dir = Path(tempfile.gettempdir()) / "gradio_temp_output"/ tempfolder if tempfolder else Path(tempfile.gettempdir()) / "gradio_temp_output"
+    #gradio_output_dir.mkdir(exist_ok=True)
+    #gradio_output_dir = check_create_dir(gradio_output_dir)
+    gradio_output_dir = check_create_dir(Path(tempfile.gettempdir()) / "gradio_temp_output"/ tempfolder if tempfolder else Path(tempfile.gettempdir()) / "gradio_temp_output")
+    return gradio_output_dir
 def zip_processed_files(root_dir: str, file_paths: list[str], tz_hours=None, date_format='%d%b%Y_%H-%M-%S') -> Path:
     """
     Creates a zip file from a list of file paths (strings) and returns the Path object.
         raise ValueError(f"Root directory does not exist: {root_path}")
     # Create a temporary directory in a location where Gradio can access it.
+    ##SMY: synced with create_temp_folder()
+    '''gradio_output_dir = Path(tempfile.gettempdir()) / "gradio_temp_output"
     #gradio_output_dir.mkdir(exist_ok=True)
     file_utils.check_create_dir(gradio_output_dir)
     final_zip_path = gradio_output_dir / f"outputs_processed_{utils.get_time_now_str(tz_hours=tz_hours, date_format=date_format)}.zip"
+    '''
+    final_zip_path = Path(root_dir).parent / f"outputs_processed_{utils.get_time_now_str(tz_hours=tz_hours, date_format=date_format)}.zip"
     # Use a context manager to create the zip file: use zipfile() opposed to shutil.make_archive
     # 'w' mode creates a new file, overwriting if it already exists.
     zip_unprocessed = 0
     Notes
     -----
     The function is intentionally lightweight: it only handles path resolution,
+    directory creation, and file I/O. All rendering logic are performed before
     calling this helper.
     """
     src = Path(src_path)
         ## Opt to putting output within same output_md folder rather than individual source folders
         #md_path = Path("data\\pdf") / "output_md" / f"{src.stem}" / md_name  ##debug
+        #md_path = Path("data") / output_dir / f"{src.stem}" / md_name  ##debug
+        md_path = Path(output_dir) / f"{src.stem}" / md_name  ##debug
     ##SMY: [resolved] Permission Errno13 - https://stackoverflow.com/a/57454275
+    #md_path.parent.mkdir(mode=0o2755, parents=True, exist_ok=True)  ##SMY: create nested md_path if not exists
+    md_path.parent.mkdir(parents=True, exist_ok=True)  ##SMY: md_path now resides in Temp
     md_path.parent.chmod(0)
     try:
             #img_path = Path(src.parent) / f"{Path(output_dir).stem}" / f"{src.stem}" / img_name
             #img_path = Path("data\\pdf") / "output_md" / f"{src.stem}" / img_name  ##debug
+            #img_path = Path("data") / output_dir / f"{src.stem}" / img_name  ##debug
+            img_path = Path(output_dir) / f"{src.stem}" / img_name
         #img_path.mkdir(mode=0o777, parents=True, exist_ok=True)  ##SMY: create nested img_path if not exists
         #img_path.parent.mkdir(parents=True, exist_ok=True)

llm/llm_login.py CHANGED Viewed

@@ -5,6 +5,7 @@ from time import sleep
 from typing import Optional
 from utils.logger import get_logger
 ## Get logger instance
 logger = get_logger(__name__)
@@ -14,6 +15,19 @@ def disable_immplicit_token():
     # Explicitly disable implicit token propagation; we rely on explicit auth or env var
     os.environ["HF_HUB_DISABLE_IMPLICIT_TOKEN"] = "1"
 def login_huggingface(token: Optional[str] = None):
     """
     Login to Hugging Face account. Prioritize CLI login for privacy and determinism.

 from typing import Optional
 from utils.logger import get_logger
+from ui.gradio_ui import gr
 ## Get logger instance
 logger = get_logger(__name__)
     # Explicitly disable implicit token propagation; we rely on explicit auth or env var
     os.environ["HF_HUB_DISABLE_IMPLICIT_TOKEN"] = "1"
+#def get_login_token( api_token_arg, oauth_token: gr.OAuthToken | None=None,):
+def get_login_token( api_token_arg, oauth_token):
+    """ Use user's supplied token or Get token from logged-in users, else from token stored on the  machine. Return token"""
+    #oauth_token = get_token() if oauth_token is not None else api_token_arg
+    if api_token_arg != '':  # or not None:  #| None:
+        oauth_token = api_token_arg
+    elif oauth_token:
+        oauth_token = oauth_token.token
+    else: oauth_token = '' if not get_token() else get_token()
+    #return str(oauth_token) if oauth_token else ''  ##token value or empty string
+    return oauth_token if oauth_token else ''  ##token value or empty string
 def login_huggingface(token: Optional[str] = None):
     """
     Login to Hugging Face account. Prioritize CLI login for privacy and determinism.

requirements.txt CHANGED Viewed

@@ -1,5 +1,8 @@
-gradio>=5.44.0
-marker-pdf[full]>=1.10.0           # pip install marker (GitHub: https://github.com/datalab-to/marker)
-weasyprint>=59.0       # optional fallback if pandoc is not available
-#pandoc==2.3            # for Markdown → PDF conversion
-python-magic==0.4.27    # file‑type detection

+gradio>=5.44.0              # gradio[mcp]>=5.44.0
+#mcp>=1.15.0                # MCP Python SDK (Model Coontext Protocol)
+marker-pdf[full]>=1.10.0    # pip install marker (GitHub: https://github.com/datalab-to/marker)
+weasyprint>=59.0            # optional fallback if pandoc is not available
+#pandoc==2.3                # for Markdown → PDF conversion
+python-magic==0.4.27        # file‑type detection
+#pdfdfium2                  # Python binding to PDFium for PDF rendering, inspection, manipution and creation
+#huggingface_hub>=0.34.0    # HuggingFace integration

ui/gradio_ui.py CHANGED Viewed

@@ -1,4 +1,5 @@
 # ui/gradio_ui.py
 import gradio as gr
 from concurrent.futures import ProcessPoolExecutor, as_completed
 import asyncio
@@ -7,23 +8,21 @@ from pathlib import Path, WindowsPath
 from typing import Optional, Union #, Dict, List, Any, Tuple
 from huggingface_hub import get_token
-from numpy import append, iterable
 #import file_handler
 import file_handler.file_utils
-from utils.config import TITLE, DESCRIPTION, DESCRIPTION_PDF_HTML, DESCRIPTION_PDF, DESCRIPTION_HTML, DESCRIPTION_MD
 from utils.utils import is_dict, is_list_of_dicts
 from file_handler.file_utils import zip_processed_files, process_dicts_data, collect_pdf_paths, collect_html_paths, collect_markdown_paths, create_outputdir  ## should move to handling file
 from file_handler.file_utils import find_file
 from utils.get_config import get_config_value
-#from llm.hf_client import HFChatClient  ## SMY: unused. See converters.extraction_converter
 from llm.provider_validator import is_valid_provider, suggest_providers
-from llm.llm_login import is_loggedin_huggingface, login_huggingface
 from converters.extraction_converter import DocumentConverter as docconverter  #DocumentExtractor #as docextractor
 from converters.pdf_to_md import PdfToMarkdownConverter, init_worker
-#from converters.md_to_pdf import MarkdownToPdfConverter
-#from converters.html_to_md import HtmlToMarkdownConverter  ##SMY: PENDING: implementation
 import traceback  ## Extract, format and print information about Python stack traces.
 from utils.logger import get_logger
@@ -32,7 +31,6 @@ logger = get_logger(__name__)   ##NB: setup_logging()  ## set logging
 # Instantiate converters class once – they are stateless
 pdf2md_converter = PdfToMarkdownConverter()
-#html2md_converter = HtmlToMarkdownConverter()
 #md2pdf_converter = MarkdownToPdfConverter()
@@ -42,25 +40,18 @@ from converters.extraction_converter import load_models
 from globals import config_load_models
 try:
     if not config_load_models.model_dict:
-        config_load_models.model_dict = load_models()
     '''if 'model_dict' not in globals():
         global model_dict
         model_dict = load_models()'''
 except Exception as exc:
     #tb = traceback.format_exc()   #exc.__traceback__
     logger.exception(f"✗ Error loading models (reload): {exc}")  #\n{tb}")
     raise RuntimeError(f"✗ Error loading models (reload): {exc}")  #\n{tb}")
-def get_login_token( api_token_arg, oauth_token: gr.OAuthToken | None=None,):
-    """ Use user's supplied token or Get token from logged-in users, else from token stored on the  machine. Return token"""
-    #oauth_token = get_token() if oauth_token is not None else api_token_arg
-    if api_token_arg != '':  # or not None:  #| None:
-        oauth_token = api_token_arg
-    elif oauth_token:
-        oauth_token = oauth_token
-    else: get_token()
-    return oauth_token.token if oauth_token else ''  ##token value or empty string
 # pool executor to convert files called by Gradio
 ##SMY: TODO: future: refactor to gradio_process.py and
@@ -90,6 +81,7 @@ def convert_batch(
     #output_dir: Optional[Union[str, Path]] = "output_dir",
     output_dir_string: str = "output_dir_default",
     use_llm: bool = False,   #Optional[bool] = False,  #True,
     page_range: str = None,  #Optional[str] = None,
     tz_hours: str = None,
     oauth_token: gr.OAuthToken | None=None,
@@ -103,15 +95,16 @@ def convert_batch(
     """
     # login: Update the Gradio UI to improve user-friendly eXperience - commencing
-    #yield gr.update(interactive=False), f"Commencing Processing ... Getting login", {"process": "Commencing Processing"}, f"dummy_log.log"
-    #progress((0,16), f"Commencing Processing ...")
     # get token from logged-in user:
     api_token = get_login_token(api_token_arg=api_token_gr, oauth_token=oauth_token)
     ##SMY: Strictly debug. Must not be live
-    #logger.log(level=30, msg="Commencing: get_login_token", extra={"api_token]": api_token, "api_token_gr": api_token_gr})
-    try:
         ##SMY: might deprecate. To replace with oauth login from Gradio ui or integrate cleanly.
         #login_huggingface(api_token)  ## attempt login if not already logged in. NB: HF CLI login prompt would not display in Process Worker.
@@ -131,9 +124,8 @@ def convert_batch(
         tb = traceback.format_exc()
         logger.exception(f"✗ Error during login_huggingface → {exc}\n{tb}", exc_info=True) # Log the full traceback
         return [gr.update(interactive=True), f"✗ An error occurred during login_huggingface → {exc}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log"]  # return the exception message
-    #progress((1,16), desc=f"Log in: {is_loggedin_huggingface}")
     ## debug
     #logger.log(level=30, msg="pdf_files_inputs", extra={"input_arg[0]:": pdf_files[0]})
@@ -143,22 +135,23 @@ def convert_batch(
         #outputs=[log_output, files_individual_JSON, files_individual_downloads],
         return [gr.update(interactive=True), "Initialising ProcessPool: No files uploaded.", {"Upload":"No files uploaded"}, f"dummy_log.log"]
-    #progress((2,16), desc=f"Getting configuration values")
     # Get config values if not provided
-    config_file = find_file("config.ini")  ##from file_handler.file_utils
-    model_id = get_config_value(config_file, "MARKER_CAP", "MODEL_ID") if not model_id else model_id
-    openai_base_url = get_config_value(config_file, "MARKER_CAP", "OPENAI_BASE_URL") if not openai_base_url else openai_base_url
-    openai_image_format = get_config_value(config_file, "MARKER_CAP", "OPENAI_IMAGE_FORMAT") if not openai_image_format else openai_image_format
-    max_workers = get_config_value(config_file, "MARKER_CAP", "MAX_WORKERS") if not max_workers else max_workers
-    max_retries = get_config_value(config_file, "MARKER_CAP", "MAX_RETRIES") if not max_retries else max_retries
-    output_format = get_config_value(config_file, "MARKER_CAP", "OUTPUT_FORMAT") if not output_format else output_format
-    output_dir_string = str(get_config_value(config_file, "MARKER_CAP", "OUTPUT_DIR") if not output_dir_string else output_dir_string)
-    use_llm = get_config_value(config_file, "MARKER_CAP", "USE_LLM") if not use_llm else use_llm
-    page_range = get_config_value(config_file,"MARKER_CAP", "PAGE_RANGE") if not page_range else page_range
-    #progress((3,16), desc="Retrieved configuration values")
     # Create the initargs tuple from the Gradio inputs: # 'files' is an iterable, and handled separately.
-    #progress((4,16), desc=f"Initialiasing init_args")
     yield gr.update(interactive=False), f"Initialising init_args", {"process": "Processing files ..."}, f"dummy_log.log"
     init_args = (
             provider,
@@ -180,83 +173,91 @@ def convert_batch(
             output_format,
             output_dir_string,
             use_llm,
             page_range,
         )
-    #global docextractor   ##SMY: deprecated.
     try:
         results = []  ## initialised pool result holder
-        # Create a pool with init_worker initialiser
         logger.log(level=30, msg="Initialising ProcessPoolExecutor: pool:", extra={"pdf_files": pdf_files, "files_len": len(pdf_files), "model_id": model_id, "output_dir": output_dir_string})  #pdf_files_count
-        #progress((5,16), desc=f"Initialising ProcessPoolExecutor: Processing Files ...")
         yield gr.update(interactive=False), f"Initialising ProcessPoolExecutor: Processing Files ...", {"process": "Processing files ..."}, f"dummy_log.log"
         with ProcessPoolExecutor(
             max_workers=max_workers,
             initializer=init_worker,
             initargs=init_args
         ) as pool:
-            #logger.log(level=30, msg="Initialising ProcessPoolExecutor: pool:", extra={"pdf_files": pdf_files, "files_len": len(pdf_files), "model_id": model_id, "output_dir": output_dir_string})  #pdf_files_count
-            #progress((6,16), desc=f"Starting ProcessPool queue: Processing Files ...")
-            # Update the Gradio UI to improve user-friendly eXperience
-            #outputs=[process_button, log_output, files_individual_JSON, files_individual_downloads],
             # Map the files (pdf_files) to the conversion function (pdf2md_converter.convert_file)
             # The 'docconverter' argument is implicitly handled by the initialiser
             #futures = [pool.map(pdf2md_converter.convert_files, f) for f in pdf_files]
             #logs = [f.result() for f in as_completed(futures)]
             #futures = [pool.submit(pdf2md_converter.convert_files, file) for file in pdf_files]
             #logs = [f.result() for f in futures]
             try:
-                #(7,16), desc=f"ProcessPoolExecutor: Creating output_dir")
-                yield gr.update(interactive=False), f"Creating output_dir ...", {"process": "Processing files ..."}, f"dummy_log.log"
-                pdf2md_converter.output_dir_string = output_dir_string   ##SMY: attempt setting directly to resolve pool.map iterable
-                #progress((8,16), desc=f"ProcessPoolExecutor: Created output_dir.")
-                yield gr.update(interactive=False), f"Created output_dir ...", {"process": "Processing files ..."}, f"dummy_log.log"
-            except Exception as exc:
-                            # Raise the exception to stop the Gradio app: exception to halt execution
-                            logger.exception("Error during creating output_dir", exc_info=True)  # Log the full traceback
-                            traceback.print_exc()  # Print the exception traceback
-                            #return f"An error occurred during pool.map: {str(exc)}", f"Error: {exc}", f"Error: {exc}"  ## return the exception message
-                            # Update the Gradio UI to improve user-friendly eXperience
-                            yield gr.update(interactive=True), f"An error occurred creating output_dir: {str(exc)}", {"Error":f"Error: {exc}"}, f"dummy_log.log"  ## return the exception message
-            try:
-                #progress((9,16), desc=f"ProcessPoolExecutor: Pooling file conversion ...")
                 yield gr.update(interactive=True), f"ProcessPoolExecutor: Pooling file conversion ...", {"process": "Processing files ..."}, f"dummy_log.log"
                 # Use progress.tqdm to integrate with the executor map
                 #results = pool.map(pdf2md_converter.convert_files, pdf_files)  ##SMY iterables  #max_retries #output_dir_string)
                 for result_interim in progress.tqdm(
-                    iterable=pool.map(pdf2md_converter.convert_files, pdf_files), total=len(pdf_files)
                     ):
                     results.append(result_interim)
-                    #progress((10,16), desc=f"ProcessPoolExecutor: Pooling file conversion result: [{str(result_interim)}[:20]]")
                     # Update the Gradio UI to improve user-friendly eXperience
                     yield gr.update(interactive=True), f"ProcessPoolExecutor: Pooling file conversion result: [{str(result_interim)}[:20]]", {"process": "Processing files ..."}, f"dummy_log.log"
-                #progress((11,16), desc=f"ProcessPoolExecutor: Got Results from files conversion")
-                yield gr.update(interactive=True), f"rocessPoolExecutor: Got Results from files conversion: [{str(result_interim)}[:20]]", {"process": "Processing files ..."}, f"dummy_log.log"
             except Exception as exc:
                 # Raise the exception to stop the Gradio app: exception to halt execution
                 logger.exception("Error during pooling file conversion", exc_info=True)  # Log the full traceback
-                traceback.print_exc()  # Print the exception traceback
-                return [gr.update(interactive=True), f"An error occurred during pool.map: {str(exc)}", {"Error":f"Error: {exc}"}, f"dummy_log.log"]  ## return the exception message
                 # Update the Gradio UI to improve user-friendly eXperience
-                #yield gr.update(interactive=True), f"An error occurred during pool.map: {str(exc)}", {"Error":f"Error: {exc}"}, f"dummy_log.log"  ## return the exception message
-            #'''
             try:
                 logger.log(level=20, msg="ProcessPoolExecutor pool result:", extra={"results": str(results)})
                 logs = []
                 logs_files_images = []
                 #logs.extend(results)   ## performant pythonic
                 #logs = list[results]  ##
                 logs = [result for result in results]  ## pythonic list comprehension
-                ## logs : [file , images , filepath, image_path]
                 #logs_files_images = logs_files.extend(logs_images)  #zip(logs_files, logs_images)   ##SMY: in progress
                 logs_count =  0
@@ -268,64 +269,48 @@ def convert_batch(
                     # Update the Gradio UI to improve user-friendly eXperience
                     #yield gr.update(interactive=False), f"Processing files: {logs_files_images[logs_count]}", {"process": "Processing files"}, f"dummy_log.log"
                     logs_count = i+i_image
-                #progress((12,16), desc="Processing results from files conversion")  ##rekickin
-                #logs_files_images.append(logs_filepath) ## to del
-                #logs_files_images.extend(logs_images)   ## to del
             except Exception as exc:
-                logger.exception("Error during processing results logs → {exc}\n{tb}", exc_info=True)  # Log the full traceback
-                traceback.print_exc()  # Print the exception traceback
                 return [gr.update(interactive=True), f"An error occurred during processing results logs: {str(exc)}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log"]  ## return the exception message
                 #yield gr.update(interactive=True), f"An error occurred during processing results logs: {str(exc)}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log"  ## return the exception message
-            #'''
     except Exception as exc:
         tb = traceback.format_exc()
         logger.exception(f"✗ Error during ProcessPoolExecutor → {exc}\n{tb}" , exc_info=True)  # Log the full traceback
         #traceback.print_exc()  # Print the exception traceback
-        yield gr.update(interactive=True), f"✗ An error occurred during ProcessPoolExecutor→ {exc}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log"  # return the exception message
-    '''
-    logger.log(level=20, msg="ProcessPoolExecutor pool result:", extra={"results": str(results)})
-    logs = []
-    #logs.extend(results)   ## performant pythonic
-    #logs = list[results]  ##
-    logs = [result for result in results]  ## pythonic list comprehension
-    '''
-    # Zip Processed md Files and images. Insert to first index
     try:  ##from file_handler.file_utils
-        #progress((13,16), desc="Zipping processed files and images")
-        zipped_processed_files = zip_processed_files(root_dir=f"data/{output_dir_string}", file_paths=logs_files_images, tz_hours=tz_hours, date_format='%d%b%Y_%H-%M-%S')  #date_format='%d%b%Y'
         logs_files_images.insert(0, zipped_processed_files)
-        #logs_files_images.insert(1, "====================")
-        #progress((14,16), desc="Zipped processed files and images")
         #yield gr.update(interactive=False), f"Processing zip and files: {logs_files_images}", {"process": "Processing files"}, f"dummy_log.log"
     except Exception as exc:
         tb = traceback.format_exc()
         logger.exception(f"✗ Error during zipping processed files → {exc}\n{tb}" , exc_info=True)  # Log the full traceback
         #traceback.print_exc()  # Print the exception traceback
-        #return gr.update(interactive=True), f"✗ An error occurred during zipping files → {exc}\n{tb}", f"Error: {exc}", f"Error: {exc}"  # return the exception message
         yield gr.update(interactive=True), f"✗ An error occurred during zipping files → {exc}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log"  # return the exception message
     # Return processed files log
     try:
-        #progress((15,16), desc="Formatting processed log results")
         ## # Convert logs list of dicts to formatted json string
         logs_return_formatted_json_string = file_handler.file_utils.process_dicts_data(logs)   #"\n".join(log for log in logs)  ##SMY outputs to gr.JSON component with no need for json.dumps(data, indent=)
-        #logs_files_images_return = "\n".join(path for path in logs_files_images)  ##TypeError: sequence item 0: expected str instance, WindowsPath found
-        ##convert the List of Path objects to List of string for gr.Files output
-        #logs_files_images_return = list(str(path) for path in logs_files_images)
         ## # Convert any Path objects to strings, but leave strings as-is
         logs_files_images_return = list(str(path) if isinstance(path, Path) else path for path in logs_files_images)
         logger.log(level=20, msg="File conversion complete. Sending outcome to Gradio:", extra={"logs_files_image_return": str(logs_files_images_return)})  ## debug: FileNotFoundError: [WinError 2] The system cannot find the file specified: 'Error or no image_path'
-        #progress((16,16), desc="Complete processing and formatting file processing results")
         #outputs=[process_button, log_output, files_individual_JSON, files_individual_downloads],
         #return "\n".join(logs), "\n".join(logs_files_images)    #"\n".join(logs_files)
@@ -338,8 +323,8 @@ def convert_batch(
         tb = traceback.format_exc()
         logger.exception(f"✗ Error during returning result logs → {exc}\n{tb}" , exc_info=True)  # Log the full traceback
         #traceback.print_exc()  # Print the exception traceback
-        #return [gr.update(interactive=True), f"✗ An error occurred during returning result logs→ {exc}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log"]  # return the exception message
-        yield  [gr.update(interactive=True), f"✗ An error occurred during returning result logs→ {exc}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log"]  # return the exception message
     #return "\n".join(log for log in logs), "\n".join(str(path) for path in logs_files_images)
     #print(f'logs_files_images: {"\n".join(str(path) for path in logs_files_images)}')
@@ -517,7 +502,7 @@ def build_interface() -> gr.Blocks:
         #message = f"Accumulated {len(updated_files)} file(s) total.\n\nAll file paths:\n{file_info}"
         message = f"Accumulated {len(updated_files)} file(s) total: \n{filename_info}"
-        return updated_files, message
     # with gr.Blocks(title=TITLE) as demo
     with gr.Blocks(title=TITLE, css=custom_css) as demo:
@@ -592,7 +577,7 @@ def build_interface() -> gr.Blocks:
                 )
         # Clean UI: Model parameters hidden in collapsible accordion
-        with gr.Accordion("⚙️ Marker Settings", open=False):
             gr.Markdown(f"#### **Marker Configuration**")
             with gr.Row():
                 openai_base_url_tb = gr.Textbox(
@@ -607,7 +592,7 @@ def build_interface() -> gr.Blocks:
                     value="webp",
                 )
                 output_format_dd = gr.Dropdown(
-                    choices=["markdown", "html"],  #, "json", "chunks"],  ##SMY: To be enabled later
                     #choices=["markdown", "html", "json", "chunks"],
                     label="Output Format",
                     value="markdown",
@@ -633,10 +618,15 @@ def build_interface() -> gr.Blocks:
                     value=2,
                     step=1  #0.01
                 )
-                use_llm_cb = gr.Checkbox(
-                    label="Use LLM for Marker conversion",
-                    value=False
-                )
                 page_range_tb = gr.Textbox(
                     label="Page Range (Optional)",
                     placeholder="Example: 0,1-5,8,12-15",
@@ -677,13 +667,14 @@ def build_interface() -> gr.Blocks:
                 btn_pdf_convert = gr.Button("Convert PDF(s)")
             '''
             with gr.Column(elem_classes=["file-or-directory-area"]):
                 with gr.Row():
                     file_btn = gr.UploadButton(
                     #file_btn = gr.File(
                         label="Upload Multiple Files",
                         file_count="multiple",
-                        file_types=["file"],
                         #height=25,  #"sm",
                         size="sm",
                         elem_classes=["gradio-upload-btn"]
@@ -692,7 +683,7 @@ def build_interface() -> gr.Blocks:
                     #dir_btn = gr.File(
                         label="Upload a Directory",
                         file_count="directory",
-                        #file_types=["file"],  #Warning: The `file_types` parameter is ignored when `file_count` is 'directory'
                         #height=25,  #"0.5",
                         size="sm",
                         elem_classes=["gradio-upload-btn"]
@@ -702,8 +693,8 @@ def build_interface() -> gr.Blocks:
                 output_textbox = gr.Textbox(label="Accumulated Files", lines=3) #, max_lines=4)  #10
             with gr.Row():
-                process_button = gr.Button("Process All Uploaded Files", variant="primary")
-                clear_button = gr.Button("Clear All Uploads", variant="secondary")
         # --- PDF → Markdown tab ---
@@ -890,8 +881,10 @@ def build_interface() -> gr.Blocks:
             """
             #msg = f"Files list cleared: {do_logout()}"  ## use as needed
             msg = f"Files list cleared."
-            yield [], msg, '', ''
             #return [], f"Files list cleared.", [], []
         #hf_login_logout_btn.click(fn=custom_do_logout, inputs=None, outputs=hf_login_logout_btn)
         ##unused
@@ -905,14 +898,14 @@ def build_interface() -> gr.Blocks:
         file_btn.upload(
             fn=accumulate_files,
             inputs=[file_btn, uploaded_file_list],
-            outputs=[uploaded_file_list, output_textbox]
         )
         # Event handler for the directory upload button
         dir_btn.upload(
             fn=accumulate_files,
             inputs=[dir_btn, uploaded_file_list],
-            outputs=[uploaded_file_list, output_textbox]
         )
         # Event handler for the "Clear" button
@@ -957,6 +950,7 @@ def build_interface() -> gr.Blocks:
             output_format_dd,
             output_dir_tb,
             use_llm_cb,
             page_range_tb,
             tz_hours_num,   #state_tz_hours
         ]

 # ui/gradio_ui.py
+from ast import Interactive
 import gradio as gr
 from concurrent.futures import ProcessPoolExecutor, as_completed
 import asyncio
 from typing import Optional, Union #, Dict, List, Any, Tuple
 from huggingface_hub import get_token
 #import file_handler
+from file_handler import file_utils
 import file_handler.file_utils
+from utils.config import TITLE, DESCRIPTION, DESCRIPTION_PDF_HTML, DESCRIPTION_PDF, DESCRIPTION_HTML, DESCRIPTION_MD, file_types_list, file_types_tuple
 from utils.utils import is_dict, is_list_of_dicts
 from file_handler.file_utils import zip_processed_files, process_dicts_data, collect_pdf_paths, collect_html_paths, collect_markdown_paths, create_outputdir  ## should move to handling file
 from file_handler.file_utils import find_file
 from utils.get_config import get_config_value
 from llm.provider_validator import is_valid_provider, suggest_providers
+from llm.llm_login import get_login_token, is_loggedin_huggingface, login_huggingface
 from converters.extraction_converter import DocumentConverter as docconverter  #DocumentExtractor #as docextractor
 from converters.pdf_to_md import PdfToMarkdownConverter, init_worker
+#from converters.md_to_pdf import MarkdownToPdfConverter  ##SMY: PENDING: implementation
 import traceback  ## Extract, format and print information about Python stack traces.
 from utils.logger import get_logger
 # Instantiate converters class once – they are stateless
 pdf2md_converter = PdfToMarkdownConverter()
 #md2pdf_converter = MarkdownToPdfConverter()
 from globals import config_load_models
 try:
     if not config_load_models.model_dict:
+        model_dict = load_models()
+        config_load_models.model_dict = model_dict
     '''if 'model_dict' not in globals():
         global model_dict
         model_dict = load_models()'''
+    logger.log(level=30, msg="Config_load_model: ", extra={"model_dict": str(model_dict)})
 except Exception as exc:
     #tb = traceback.format_exc()   #exc.__traceback__
     logger.exception(f"✗ Error loading models (reload): {exc}")  #\n{tb}")
     raise RuntimeError(f"✗ Error loading models (reload): {exc}")  #\n{tb}")
+#def get_login_token( api_token_arg, oauth_token: gr.OAuthToken | None=None,):  ##moved to llm_login
 # pool executor to convert files called by Gradio
 ##SMY: TODO: future: refactor to gradio_process.py and
     #output_dir: Optional[Union[str, Path]] = "output_dir",
     output_dir_string: str = "output_dir_default",
     use_llm: bool = False,   #Optional[bool] = False,  #True,
+    force_ocr: bool = True,  #Optional[bool] = False,
     page_range: str = None,  #Optional[str] = None,
     tz_hours: str = None,
     oauth_token: gr.OAuthToken | None=None,
     """
     # login: Update the Gradio UI to improve user-friendly eXperience - commencing
+    # [template]: #outputs=[process_button, log_output, files_individual_JSON, files_individual_downloads],
+    yield gr.update(interactive=False), f"Commencing Processing ... Getting login", {"process": "Commencing Processing"}, f"dummy_log.log"
+    progress((0,16), f"Commencing Processing ...")
     # get token from logged-in user:
     api_token = get_login_token(api_token_arg=api_token_gr, oauth_token=oauth_token)
     ##SMY: Strictly debug. Must not be live
+    #logger.log(level=30, msg="Commencing: get_login_token", extra={"api_token": api_token, "api_token_gr": api_token_gr})
+    '''try:
         ##SMY: might deprecate. To replace with oauth login from Gradio ui or integrate cleanly.
         #login_huggingface(api_token)  ## attempt login if not already logged in. NB: HF CLI login prompt would not display in Process Worker.
         tb = traceback.format_exc()
         logger.exception(f"✗ Error during login_huggingface → {exc}\n{tb}", exc_info=True) # Log the full traceback
         return [gr.update(interactive=True), f"✗ An error occurred during login_huggingface → {exc}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log"]  # return the exception message
+    '''
+    progress((1,16), desc=f"Log in: {is_loggedin_huggingface(api_token)}")
     ## debug
     #logger.log(level=30, msg="pdf_files_inputs", extra={"input_arg[0]:": pdf_files[0]})
         #outputs=[log_output, files_individual_JSON, files_individual_downloads],
         return [gr.update(interactive=True), "Initialising ProcessPool: No files uploaded.", {"Upload":"No files uploaded"}, f"dummy_log.log"]
+    progress((2,16), desc=f"Getting configuration values")
     # Get config values if not provided
+    config_file = find_file("config.ini")  ##from file_handler.file_utils  ##takes a bit of time to process. #NeedOptimise
+    model_id = model_id if model_id else get_config_value(config_file, "MARKER_CAP", "MODEL_ID")
+    openai_base_url = openai_base_url if openai_base_url else get_config_value(config_file, "MARKER_CAP", "OPENAI_BASE_URL")
+    openai_image_format = openai_image_format if openai_image_format else get_config_value(config_file, "MARKER_CAP", "OPENAI_IMAGE_FORMAT")
+    max_workers = max_workers if max_workers else get_config_value(config_file, "MARKER_CAP", "MAX_WORKERS")
+    max_retries = max_retries if max_retries else get_config_value(config_file, "MARKER_CAP", "MAX_RETRIES")
+    output_format = output_format if output_format else get_config_value(config_file, "MARKER_CAP", "OUTPUT_FORMAT")
+    output_dir_string = output_dir_string if output_dir_string else str(get_config_value(config_file, "MARKER_CAP", "OUTPUT_DIR"))
+    use_llm = use_llm if use_llm else get_config_value(config_file, "MARKER_CAP", "USE_LLM")
+    page_range = page_range if page_range else get_config_value(config_file,"MARKER_CAP", "PAGE_RANGE")
+    progress((3,16), desc=f"Retrieved configuration values")
     # Create the initargs tuple from the Gradio inputs: # 'files' is an iterable, and handled separately.
+    progress((4,16), desc=f"Initialiasing init_args")
     yield gr.update(interactive=False), f"Initialising init_args", {"process": "Processing files ..."}, f"dummy_log.log"
     init_args = (
             provider,
             output_format,
             output_dir_string,
             use_llm,
+            force_ocr,
             page_range,
         )
+    # create output_dir
+    try:
+        yield gr.update(interactive=False), f"Creating output_dir ...", {"process": "Processing files ..."}, f"dummy_log.log"
+        progress((5,16), desc=f"ProcessPoolExecutor: Creating output_dir")
+        #pdf2md_converter.output_dir_string = output_dir_string   ##SMY: attempt setting directly to resolve pool.map iterable
+        # Create Marker output_dir in temporary directory where Gradio can access it.
+        output_dir = file_utils.create_temp_folder(output_dir_string)
+        pdf2md_converter.output_dir = output_dir
+        logger.info(f"✓ output_dir created: ", extra={"output_dir": pdf2md_converter.output_dir.name, "in": str(pdf2md_converter.output_dir.parent)})
+        yield gr.update(interactive=False), f"Created output_dir ...", {"process": "Processing files ..."}, f"dummy_log.log"
+        progress((6,16), desc=f"✓ Created output_dir.")
+    except Exception as exc:
+            tb = traceback.format_exc()
+            tbp = traceback.print_exc()  # Print the exception traceback
+            logger.exception("✗ error creating output_dir → {exc}\n{tb}", exc_info=True)  # Log the full traceback
+            # Update the Gradio UI to improve user-friendly eXperience
+            yield gr.update(interactive=True), f"✗ An error occurred creating output_dir: {str(exc)}", {"Error":f"Error: {exc}"}, f"dummy_log.log"  ## return the exception message
+            return f"An error occurred creating output_dir: {str(exc)}", f"Error: {exc}", f"Error: {exc}"  ## return the exception message
+    # Process file conversion leveraging ProcessPoolExecutor for efficiency
     try:
         results = []  ## initialised pool result holder
         logger.log(level=30, msg="Initialising ProcessPoolExecutor: pool:", extra={"pdf_files": pdf_files, "files_len": len(pdf_files), "model_id": model_id, "output_dir": output_dir_string})  #pdf_files_count
         yield gr.update(interactive=False), f"Initialising ProcessPoolExecutor: Processing Files ...", {"process": "Processing files ..."}, f"dummy_log.log"
+        progress((7,16), desc=f"Initialising ProcessPoolExecutor: Processing Files ...")
+        # Create a pool with init_worker initialiser
         with ProcessPoolExecutor(
             max_workers=max_workers,
             initializer=init_worker,
             initargs=init_args
         ) as pool:
+            logger.log(level=30, msg="Initialising ProcessPoolExecutor: pool:", extra={"pdf_files": pdf_files, "files_len": len(pdf_files), "model_id": model_id, "output_dir": output_dir_string})  #pdf_files_count
+            progress((8,16), desc=f"Starting ProcessPool queue: Processing Files ...")
             # Map the files (pdf_files) to the conversion function (pdf2md_converter.convert_file)
             # The 'docconverter' argument is implicitly handled by the initialiser
             #futures = [pool.map(pdf2md_converter.convert_files, f) for f in pdf_files]
             #logs = [f.result() for f in as_completed(futures)]
             #futures = [pool.submit(pdf2md_converter.convert_files, file) for file in pdf_files]
             #logs = [f.result() for f in futures]
             try:
                 yield gr.update(interactive=True), f"ProcessPoolExecutor: Pooling file conversion ...", {"process": "Processing files ..."}, f"dummy_log.log"
+                progress((9,16), desc=f"ProcessPoolExecutor: Pooling file conversion ...")
                 # Use progress.tqdm to integrate with the executor map
                 #results = pool.map(pdf2md_converter.convert_files, pdf_files)  ##SMY iterables  #max_retries #output_dir_string)
                 for result_interim in progress.tqdm(
+                    iterable=pool.map(pdf2md_converter.convert_files, pdf_files)  #, max_retries), total=len(pdf_files)
                     ):
                     results.append(result_interim)
                     # Update the Gradio UI to improve user-friendly eXperience
                     yield gr.update(interactive=True), f"ProcessPoolExecutor: Pooling file conversion result: [{str(result_interim)}[:20]]", {"process": "Processing files ..."}, f"dummy_log.log"
+                    progress((10,16), desc=f"ProcessPoolExecutor: Pooling file conversion result: [{str(result_interim)}[:20]]")
+                yield gr.update(interactive=True), f"ProcessPoolExecutor: Got Results from files conversion: [{str(result_interim)}[:20]]", {"process": "Processing files ..."}, f"dummy_log.log"
+                progress((11,16), desc=f"ProcessPoolExecutor: Got Results from files conversion")
             except Exception as exc:
                 # Raise the exception to stop the Gradio app: exception to halt execution
                 logger.exception("Error during pooling file conversion", exc_info=True)  # Log the full traceback
+                tbp = traceback.print_exc()  # Print the exception traceback
                 # Update the Gradio UI to improve user-friendly eXperience
+                yield gr.update(interactive=True), f"An error occurred during pool.map: {str(exc)}", {"Error":f"Error: {exc}\n{tbp}"}, f"dummy_log.log"  ## return the exception message
+                return [gr.update(interactive=True), f"An error occurred during pool.map: {str(exc)}", {"Error":f"Error: {exc}\n{tbp}"}, f"dummy_log.log"]  ## return the exception message
+            # Process file conversion results
             try:
+                progress((12,16), desc="Processing results from files conversion")  ##rekickin
                 logger.log(level=20, msg="ProcessPoolExecutor pool result:", extra={"results": str(results)})
                 logs = []
                 logs_files_images = []
                 #logs.extend(results)   ## performant pythonic
                 #logs = list[results]  ##
                 logs = [result for result in results]  ## pythonic list comprehension
+                # [template]  ## logs : [file , images , filepath, image_path]
                 #logs_files_images = logs_files.extend(logs_images)  #zip(logs_files, logs_images)   ##SMY: in progress
                 logs_count =  0
                     # Update the Gradio UI to improve user-friendly eXperience
                     #yield gr.update(interactive=False), f"Processing files: {logs_files_images[logs_count]}", {"process": "Processing files"}, f"dummy_log.log"
                     logs_count = i+i_image
             except Exception as exc:
+                tbp = traceback.print_exc()  # Print the exception traceback
+                logger.exception("Error during processing results logs → {exc}\n{tbp}", exc_info=True)  # Log the full traceback
                 return [gr.update(interactive=True), f"An error occurred during processing results logs: {str(exc)}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log"]  ## return the exception message
                 #yield gr.update(interactive=True), f"An error occurred during processing results logs: {str(exc)}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log"  ## return the exception message
     except Exception as exc:
         tb = traceback.format_exc()
         logger.exception(f"✗ Error during ProcessPoolExecutor → {exc}\n{tb}" , exc_info=True)  # Log the full traceback
         #traceback.print_exc()  # Print the exception traceback
+        yield gr.update(interactive=True), f"✗ An error occurred during ProcessPoolExecutor→ {exc}", {"Error":f"Error: {exc}"}, f"dummy_log.log"  # return the exception message
+    # Zip Processed Files and images. Insert to first index
     try:  ##from file_handler.file_utils
+        progress((13,16), desc="Zipping processed files and images")
+        zipped_processed_files = zip_processed_files(root_dir=f"{output_dir}", file_paths=logs_files_images, tz_hours=tz_hours, date_format='%d%b%Y_%H-%M-%S')  #date_format='%d%b%Y'
         logs_files_images.insert(0, zipped_processed_files)
+        progress((14,16), desc="Zipped processed files and images")
         #yield gr.update(interactive=False), f"Processing zip and files: {logs_files_images}", {"process": "Processing files"}, f"dummy_log.log"
     except Exception as exc:
         tb = traceback.format_exc()
         logger.exception(f"✗ Error during zipping processed files → {exc}\n{tb}" , exc_info=True)  # Log the full traceback
         #traceback.print_exc()  # Print the exception traceback
         yield gr.update(interactive=True), f"✗ An error occurred during zipping files → {exc}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log"  # return the exception message
+        return gr.update(interactive=True), f"✗ An error occurred during zipping files → {exc}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log"  # return the exception message
     # Return processed files log
     try:
+        progress((15,16), desc="Formatting processed log results")
         ## # Convert logs list of dicts to formatted json string
         logs_return_formatted_json_string = file_handler.file_utils.process_dicts_data(logs)   #"\n".join(log for log in logs)  ##SMY outputs to gr.JSON component with no need for json.dumps(data, indent=)
+        #logs_files_images_return = "\n".join(path for path in logs_files_images)  ##TypeError: sequence item 0: expected str instance, WindowsPath found
         ## # Convert any Path objects to strings, but leave strings as-is
         logs_files_images_return = list(str(path) if isinstance(path, Path) else path for path in logs_files_images)
         logger.log(level=20, msg="File conversion complete. Sending outcome to Gradio:", extra={"logs_files_image_return": str(logs_files_images_return)})  ## debug: FileNotFoundError: [WinError 2] The system cannot find the file specified: 'Error or no image_path'
+        progress((16,16), desc="Complete processing and formatting file processing results")
+        # [templates]
         #outputs=[process_button, log_output, files_individual_JSON, files_individual_downloads],
         #return "\n".join(logs), "\n".join(logs_files_images)    #"\n".join(logs_files)
         tb = traceback.format_exc()
         logger.exception(f"✗ Error during returning result logs → {exc}\n{tb}" , exc_info=True)  # Log the full traceback
         #traceback.print_exc()  # Print the exception traceback
+        yield   gr.update(interactive=True), f"✗ An error occurred during returning result logs→ {exc}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log"  # return the exception message
+        return [gr.update(interactive=True), f"✗ An error occurred during returning result logs→ {exc}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log"]  # return the exception message
     #return "\n".join(log for log in logs), "\n".join(str(path) for path in logs_files_images)
     #print(f'logs_files_images: {"\n".join(str(path) for path in logs_files_images)}')
         #message = f"Accumulated {len(updated_files)} file(s) total.\n\nAll file paths:\n{file_info}"
         message = f"Accumulated {len(updated_files)} file(s) total: \n{filename_info}"
+        return updated_files, message, gr.update(interactive=True), gr.update(interactive=True)
     # with gr.Blocks(title=TITLE) as demo
     with gr.Blocks(title=TITLE, css=custom_css) as demo:
                 )
         # Clean UI: Model parameters hidden in collapsible accordion
+        with gr.Accordion("⚙️ Marker Converter Settings", open=False):
             gr.Markdown(f"#### **Marker Configuration**")
             with gr.Row():
                 openai_base_url_tb = gr.Textbox(
                     value="webp",
                 )
                 output_format_dd = gr.Dropdown(
+                    choices=["markdown", "html", "json"],  #, "json", "chunks"],  ##SMY: To be enabled later
                     #choices=["markdown", "html", "json", "chunks"],
                     label="Output Format",
                     value="markdown",
                     value=2,
                     step=1  #0.01
                 )
+                with gr.Column():
+                    use_llm_cb = gr.Checkbox(
+                        label="Use LLM for Marker conversion",
+                        value=False
+                    )
+                    force_ocr_cb = gr.Checkbox(
+                        label="force OCR on all pages",
+                        value=True,
+                    )
                 page_range_tb = gr.Textbox(
                     label="Page Range (Optional)",
                     placeholder="Example: 0,1-5,8,12-15",
                 btn_pdf_convert = gr.Button("Convert PDF(s)")
             '''
+            file_types_list.extend(file_types_tuple)
             with gr.Column(elem_classes=["file-or-directory-area"]):
                 with gr.Row():
                     file_btn = gr.UploadButton(
                     #file_btn = gr.File(
                         label="Upload Multiple Files",
                         file_count="multiple",
+                        file_types= file_types_list,  #["file"],  ##config.file_types_list
                         #height=25,  #"sm",
                         size="sm",
                         elem_classes=["gradio-upload-btn"]
                     #dir_btn = gr.File(
                         label="Upload a Directory",
                         file_count="directory",
+                        file_types= file_types_list,   #["file"],  #Warning: The `file_types` parameter is ignored when `file_count` is 'directory'
                         #height=25,  #"0.5",
                         size="sm",
                         elem_classes=["gradio-upload-btn"]
                 output_textbox = gr.Textbox(label="Accumulated Files", lines=3) #, max_lines=4)  #10
             with gr.Row():
+                process_button = gr.Button("Process All Uploaded Files", variant="primary", interactive=False)
+                clear_button = gr.Button("Clear All Uploads", variant="secondary", interactive=False)
         # --- PDF → Markdown tab ---
             """
             #msg = f"Files list cleared: {do_logout()}"  ## use as needed
             msg = f"Files list cleared."
+            #yield [], msg, '', ''
             #return [], f"Files list cleared.", [], []
+            yield [], msg, None, None
+            return [], f"Files list cleared.", None, None
         #hf_login_logout_btn.click(fn=custom_do_logout, inputs=None, outputs=hf_login_logout_btn)
         ##unused
         file_btn.upload(
             fn=accumulate_files,
             inputs=[file_btn, uploaded_file_list],
+            outputs=[uploaded_file_list, output_textbox, process_button, clear_button]
         )
         # Event handler for the directory upload button
         dir_btn.upload(
             fn=accumulate_files,
             inputs=[dir_btn, uploaded_file_list],
+            outputs=[uploaded_file_list, output_textbox, process_button, clear_button]
         )
         # Event handler for the "Clear" button
             output_format_dd,
             output_dir_tb,
             use_llm_cb,
+            force_ocr_cb,
             page_range_tb,
             tz_hours_num,   #state_tz_hours
         ]

utils/config.py CHANGED Viewed

@@ -28,6 +28,13 @@ DESCRIPTION_MD = (
     "Upload Markdown/LaTeX files and generate a polished PDF."
 )
 # Conversion defaults
 DEFAULT_MARKER_OPTIONS = {
     "include_images": True,

     "Upload Markdown/LaTeX files and generate a polished PDF."
 )
+# File types
+file_types_list  = []
+file_types_tuple = (".pdf", ".html", ".docx", ".doc")
+#file_types_list = list[file_types_tuple]
+#file_types_list.extend(file_types_tuple)
 # Conversion defaults
 DEFAULT_MARKER_OPTIONS = {
     "include_images": True,