new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Oct 29

CoT-Driven Framework for Short Text Classification: Enhancing and Transferring Capabilities from Large to Smaller Model

Short Text Classification (STC) is crucial for processing and understanding the brief but substantial content prevalent on contemporary digital platforms. The STC encounters difficulties in grasping the semantic and syntactic intricacies, an issue that is apparent in traditional pre-trained language models. Although Graph Convolutional Networks enhance performance by integrating external knowledge bases, these methods are limited by the quality and extent of the knowledge applied. Recently, the emergence of Large Language Models (LLMs) and Chain-of-Thought (CoT) has significantly improved the performance of complex reasoning tasks. However, some studies have highlighted the limitations of their application in fundamental NLP tasks. Consequently, this study first employs CoT to investigate and enhance the capabilities of LLMs in STC tasks. We propose the Syntactic and Semantic Enrichment CoT (SSE-CoT) method, effectively decomposing the STC tasks into four distinct steps: (i) essential concept identification, (ii) common-sense knowledge retrieval, (iii) text rewriting, and (iv) classification. Furthermore, recognizing resource constraints in sectors like finance and healthcare, we then introduce the CoT-Driven Multi-Task Learning (CDMT) framework to extend these capabilities to smaller models. This framework begins by extracting rationales from LLMs and subsequently fine-tunes smaller models to optimize their performance. Extensive experimentation across six short-text benchmarks validated the efficacy of the proposed methods. In particular, SSE-CoT achieved state-of-the-art performance with substantial improvements on all datasets, particularly on the Ohsumed and TagMyNews datasets.

  • 8 authors
·
Jan 6, 2024

The Zwicky Transient Facility Bright Transient Survey. III. BTSbot: Automated Identification and Follow-up of Bright Transients with Deep Learning

The Bright Transient Survey (BTS) aims to obtain a classification spectrum for all bright (m_peak,leq,18.5,mag) extragalactic transients found in the Zwicky Transient Facility (ZTF) public survey. BTS critically relies on visual inspection ("scanning") to select targets for spectroscopic follow-up, which, while effective, has required a significant time investment over the past sim5 yr of ZTF operations. We present BTSbot, a multi-modal convolutional neural network, which provides a bright transient score to individual ZTF detections using their image data and 25 extracted features. BTSbot is able to eliminate the need for daily human scanning by automatically identifying and requesting spectroscopic follow-up observations of new bright transient candidates. BTSbot recovers all bright transients in our test split and performs on par with scanners in terms of identification speed (on average, sim1 hour quicker than scanners). We also find that BTSbot is not significantly impacted by any data shift by comparing performance across a concealed test split and a sample of very recent BTS candidates. BTSbot has been integrated into Fritz and Kowalski, ZTF's first-party marshal and alert broker, and now sends automatic spectroscopic follow-up requests for the new transients it identifies. During the month of October 2023, BTSbot selected 296 sources in real-time, 93% of which were real extragalactic transients. With BTSbot and other automation tools, the BTS workflow has produced the first fully automatic end-to-end discovery and classification of a transient, representing a significant reduction in the human-time needed to scan. Future development has tremendous potential for creating similar models to identify and request follow-up observations for specific types of transients.

  • 13 authors
·
Jan 26, 2024

2D Matryoshka Sentence Embeddings

Common approaches rely on fixed-length embedding vectors from language models as sentence embeddings for downstream tasks such as semantic textual similarity (STS). Such methods are limited in their flexibility due to unknown computational constraints and budgets across various applications. Matryoshka Representation Learning (MRL) (Kusupati et al., 2022) encodes information at finer granularities, i.e., with lower embedding dimensions, to adaptively accommodate ad hoc tasks. Similar accuracy can be achieved with a smaller embedding size, leading to speedups in downstream tasks. Despite its improved efficiency, MRL still requires traversing all Transformer layers before obtaining the embedding, which remains the dominant factor in time and memory consumption. This prompts consideration of whether the fixed number of Transformer layers affects representation quality and whether using intermediate layers for sentence representation is feasible. In this paper, we introduce a novel sentence embedding model called Two-dimensional Matryoshka Sentence Embedding (2DMSE). It supports elastic settings for both embedding sizes and Transformer layers, offering greater flexibility and efficiency than MRL. We conduct extensive experiments on STS tasks and downstream applications. The experimental results demonstrate the effectiveness of our proposed model in dynamically supporting different embedding sizes and Transformer layers, allowing it to be highly adaptable to various scenarios.

  • 5 authors
·
Feb 22, 2024

SPRINT: Script-agnostic Structure Recognition in Tables

Table Structure Recognition (TSR) is vital for various downstream tasks like information retrieval, table reconstruction, and document understanding. While most state-of-the-art (SOTA) research predominantly focuses on TSR in English documents, the need for similar capabilities in other languages is evident, considering the global diversity of data. Moreover, creating substantial labeled data in non-English languages and training these SOTA models from scratch is costly and time-consuming. We propose TSR as a language-agnostic cell arrangement prediction and introduce SPRINT, Script-agnostic Structure Recognition in Tables. SPRINT uses recently introduced Optimized Table Structure Language (OTSL) sequences to predict table structures. We show that when coupled with a pre-trained table grid estimator, SPRINT can improve the overall tree edit distance-based similarity structure scores of tables even for non-English documents. We experimentally evaluate our performance across benchmark TSR datasets including PubTabNet, FinTabNet, and PubTables-1M. Our findings reveal that SPRINT not only matches SOTA models in performance on standard datasets but also demonstrates lower latency. Additionally, SPRINT excels in accurately identifying table structures in non-English documents, surpassing current leading models by showing an absolute average increase of 11.12%. We also present an algorithm for converting valid OTSL predictions into a widely used HTML-based table representation. To encourage further research, we release our code and Multilingual Scanned and Scene Table Structure Recognition Dataset, MUSTARD labeled with OTSL sequences for 1428 tables in thirteen languages encompassing several scripts at https://github.com/IITB-LEAP-OCR/SPRINT

  • 5 authors
·
Mar 14

TESS Discovers a Second System of Transiting Exocomets in the Extreme Debris Disk of RZ Psc

We present the TESS discovery of only the second system of transiting exocomets with a sufficient number of events to measure the size distribution in the RZ Psc system, enabling comparisons with the beta Pictoris and Solar System size distributions. Twenty-four transits with absorption depths (AD) of 1--20\% were observed across three TESS sectors of the 20-50 Myr K0V star, detected as part of our TESS survey of extreme debris disks identified by their IR excess. We discover that the ADs (and hence exocomet radii) follow a broken power-law cumulative frequency distribution not previously seen in extrasolar contexts but similar to that observed in Solar System Kuiper Belt Object sizes, with power-law slopes above and below the break of gamma_AD>break=2.32pm0.12 and gamma_AD<break=0.11pm0.04, respectively. We derive size distributions of 1--7~km from two independent lines of evidence. We use the RZ Psc exocomet rate to predict exocomet yields for the Early eVolution Explorer (EVE) NASA astrophysics Small Explorer (SMEX) mission concept to obtain simultaneous photometry of 10^4 young stars in NUV, optical, and NIR bands. Assuming occurrence rates scaled from RZ Psc, EVE would detect 590 exocomets from approx70 young systems in the optical band, with approx120 simultaneous 5sigma detections in all three bands. These data would enable grain sizes of 200--700~nm and graphite--olivine compositions of dozens of events to be distinguished at 2.5--3sigma, as well as a 4sigma determination of the accuracy of the Herschel-derived M-debris disk fraction.

  • 12 authors
·
Oct 10

NMIXX: Domain-Adapted Neural Embeddings for Cross-Lingual eXploration of Finance

General-purpose sentence embedding models often struggle to capture specialized financial semantics, especially in low-resource languages like Korean, due to domain-specific jargon, temporal meaning shifts, and misaligned bilingual vocabularies. To address these gaps, we introduce NMIXX (Neural eMbeddings for Cross-lingual eXploration of Finance), a suite of cross-lingual embedding models fine-tuned with 18.8K high-confidence triplets that pair in-domain paraphrases, hard negatives derived from a semantic-shift typology, and exact Korean-English translations. Concurrently, we release KorFinSTS, a 1,921-pair Korean financial STS benchmark spanning news, disclosures, research reports, and regulations, designed to expose nuances that general benchmarks miss. When evaluated against seven open-license baselines, NMIXX's multilingual bge-m3 variant achieves Spearman's rho gains of +0.10 on English FinSTS and +0.22 on KorFinSTS, outperforming its pre-adaptation checkpoint and surpassing other models by the largest margin, while revealing a modest trade-off in general STS performance. Our analysis further shows that models with richer Korean token coverage adapt more effectively, underscoring the importance of tokenizer design in low-resource, cross-lingual settings. By making both models and the benchmark publicly available, we provide the community with robust tools for domain-adapted, multilingual representation learning in finance.

  • 7 authors
·
Jul 13

RAR-b: Reasoning as Retrieval Benchmark

Semantic textual similartiy (STS) and information retrieval tasks (IR) tasks have been the two major avenues to record the progress of embedding models in the past few years. Under the emerging Retrieval-augmented Generation (RAG) paradigm, we envision the need to evaluate next-level language understanding abilities of embedding models, and take a conscious look at the reasoning abilities stored in them. Addressing this, we pose the question: Can retrievers solve reasoning problems? By transforming reasoning tasks into retrieval tasks, we find that without specifically trained for reasoning-level language understanding, current state-of-the-art retriever models may still be far from being competent for playing the role of assisting LLMs, especially in reasoning-intensive tasks. Moreover, albeit trained to be aware of instructions, instruction-aware IR models are often better off without instructions in inference time for reasoning tasks, posing an overlooked retriever-LLM behavioral gap for the research community to align. However, recent decoder-based embedding models show great promise in narrowing the gap, highlighting the pathway for embedding models to achieve reasoning-level language understanding. We also show that, although current off-the-shelf re-ranker models fail on these tasks, injecting reasoning abilities into them through fine-tuning still appears easier than doing so to bi-encoders, and we are able to achieve state-of-the-art performance across all tasks by fine-tuning a reranking model. We release Reasoning as Retrieval Benchmark (RAR-b), a holistic suite of tasks and settings to evaluate the reasoning abilities stored in retriever models. RAR-b is available at https://github.com/gowitheflow-1998/RAR-b.

  • 3 authors
·
Apr 9, 2024

Discourse-Aware Text Simplification: From Complex Sentences to Linked Propositions

Sentences that present a complex syntax act as a major stumbling block for downstream Natural Language Processing applications whose predictive quality deteriorates with sentence length and complexity. The task of Text Simplification (TS) may remedy this situation. It aims to modify sentences in order to make them easier to process, using a set of rewriting operations, such as reordering, deletion, or splitting. State-of-the-art syntactic TS approaches suffer from two major drawbacks: first, they follow a very conservative approach in that they tend to retain the input rather than transforming it, and second, they ignore the cohesive nature of texts, where context spread across clauses or sentences is needed to infer the true meaning of a statement. To address these problems, we present a discourse-aware TS approach that splits and rephrases complex English sentences within the semantic context in which they occur. Based on a linguistically grounded transformation stage that uses clausal and phrasal disembedding mechanisms, complex sentences are transformed into shorter utterances with a simple canonical structure that can be easily analyzed by downstream applications. With sentence splitting, we thus address a TS task that has hardly been explored so far. Moreover, we introduce the notion of minimality in this context, as we aim to decompose source sentences into a set of self-contained minimal semantic units. To avoid breaking down the input into a disjointed sequence of statements that is difficult to interpret because important contextual information is missing, we incorporate the semantic context between the split propositions in the form of hierarchical structures and semantic relationships. In that way, we generate a semantic hierarchy of minimal propositions that leads to a novel representation of complex assertions that puts a semantic layer on top of the simplified sentences.

  • 4 authors
·
Aug 1, 2023

Deep Synoptic Array Science: Searching for Long Duration Radio Transients with the DSA-110

We describe the design and commissioning tests for the DSA-110 Not-So-Fast Radio Burst (NSFRB) search pipeline, a 1.4 GHz image-plane single-pulse search sensitive to 134 ms-160.8 s radio bursts. Extending the pulse width range of the Fast Radio Burst (FRB) search by 3 orders of magnitude, the NSFRB search is sensitive to the recently-discovered Galactic Long Period Radio Transients (LPRTs). The NSFRB search operates in real-time, utilizing a custom GPU-accelerated search code, cerberus, implemented in Python with JAX. We summarize successful commissioning sensitivity tests with continuum sources and pulsar B0329+54, estimating the 6sigma flux (fluence) threshold to be ~290 mJy (~40 Jy ms). Future tests of recovery of longer timescale transients, e.g. CHIME J1634+44, are planned to supplement injection testing and B0329+54 observations. An offline DSA-110 NSFRB Galactic Plane Survey was conducted to search for LPRTs, covering -3.5^circ<b<5.7^circ and 141^circ<l<225^circ (~770 square degrees) in Galactic coordinates. We estimate an upper limit Poissonian burst rate ~1 hr^{-1} per square degree (~7 hr^{-1} per 3^circtimes3^circ survey grid cell) maximized across the inner |b|<0.25^circ of the surveyed region. By imposing the ~290 mJy flux limit on two representative models (the magnetar plastic flow model and the White Dwarf-M Dwarf binary model), we reject with 95% confidence the presence of White Dwarf-M Dwarf binary LPRTs with periods between ~10-70s within ~95% of the surveyed region. Combined with the prevalence of LPRTs in the Galactic Plane, our results motivate further consideration of both White Dwarf-M Dwarf binary models and isolated magnetar models. We will continue to explore novel LPRT search strategies during real-time operations, such as triggered periodicity searches and additional targeted surveys.

  • 13 authors
·
Oct 20

L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi

Sentence representation from vanilla BERT models does not work well on sentence similarity tasks. Sentence-BERT models specifically trained on STS or NLI datasets are shown to provide state-of-the-art performance. However, building these models for low-resource languages is not straightforward due to the lack of these specialized datasets. This work focuses on two low-resource Indian languages, Hindi and Marathi. We train sentence-BERT models for these languages using synthetic NLI and STS datasets prepared using machine translation. We show that the strategy of NLI pre-training followed by STSb fine-tuning is effective in generating high-performance sentence-similarity models for Hindi and Marathi. The vanilla BERT models trained using this simple strategy outperform the multilingual LaBSE trained using a complex training strategy. These models are evaluated on downstream text classification and similarity tasks. We evaluate these models on real text classification datasets to show embeddings obtained from synthetic data training are generalizable to real datasets as well and thus represent an effective training strategy for low-resource languages. We also provide a comparative analysis of sentence embeddings from fast text models, multilingual BERT models (mBERT, IndicBERT, xlm-RoBERTa, MuRIL), multilingual sentence embedding models (LASER, LaBSE), and monolingual BERT models based on L3Cube-MahaBERT and HindBERT. We release L3Cube-MahaSBERT and HindSBERT, the state-of-the-art sentence-BERT models for Marathi and Hindi respectively. Our work also serves as a guide to building low-resource sentence embedding models.

  • 5 authors
·
Nov 21, 2022