
The 25th ACM Symposium on Document Engineering
September 2, 2025 to September 5, 2025
University of Nottingham, Nottingham, UK
Programme
DocEng'25 will take place at the University of Nottingham's School of Computer Science on the University's Jubilee Campus.
DocEng'25 will offer a full programme of events. Tuesday is dedicated for tutorials and/or workshops, while the main academic programme will run from Wednesday till Friday lunchtime.
DocEng'25 is delighted to announce that two keynote speakers, from Debora Weber-Wulff (HTW Berlin) and Charles Nicholas (UMBC). Details of their keynotes can be found below.
The papers for DocEng'25 are available for download at the ACM Digital Library.
Tuesday 2nd September
Tutorials
08:00 Registration and Networking
09:00 Tutorial 1: LLM-assisted Automatic Feature Extraction for Document Understanding and Analytics
Sirisha Velampalli (LTIMindTree)
10:30 Coffee Break
11:00 Tutorial 1 continued
12:30 Lunch
14:00 Tutorial 2: Well-Tagged PDF and Universal Accessibility with LaTeX
Frank Mittelbach
Ulrike Fischer
David Carlisle
Joseph Wright (The LaTeX Project)
15:30 Coffee Break
16:00 Tutorial 2 continued
Wednesday 3rd September
Main Programme (Day 1)
08:00 Registration and Networking
09:00 Welcome Message
Conference Chairs
09:15 25 Years of DocEng
Ethan Munson (University of Wisconsin, Milwaukee)
Detecting and Documenting Plagiarism and GenAI Use
Debora Weber-Wulff (HTW Berlin)
Abstract
Despite there being many software systems that appear to detect plagiarism and AI-generated text, they do not actually work as many people suppose. Plagiarism comes in many varieties and not all are easy to detect. There are also multiple algorithms used that do not produce the same results. There are, however, interesting forensic indicators that can point to plagiarism. The software systems can be seen as a potential tool, but not as a decision system for determining plagiarism. It is, however, very easy to document some forms of plagiarism.
It is a different story for AI-generated texts. There is no proof to be found that a text was produced by a large language model, only probabilities. Depending on the use case, the amount of false positives and false negatives can also preclude the use of such systems for the detection of potential AI use. Here, too, there are forensic indicators that can show the probable use of large language models, but still cannot provide absolute proof of use.
Biography
Prof. Dr. Debora Weber-Wulff is a retired professor for Media and Computing at the University of Applied Sciences HTW Berlin in Germany. She studied applied physics at the University of California, San Diego and computer science at the University of Kiel in Germany. She received her doctorate in theoretical computer science on mechanical theorem proving at the University of Kiel. She is an active member of the working group of the Gesellschaft für Informatik (German computing society) on Ethics and Computing and Fellow of the GI. She has been researching in plagiarism since 2002 and is now using her time in retirement to discuss the use of AI in education. She published a well-received paper testing so-called AI detectors in December 2023 and has a paper on the use of AI in research currently under consideration.
10:30 Coffee Break
11:00 Paper Session 1: Document Information Retrieval Chair: Patrick Healy
11:00 Exploiting Query Reformulation and Reciprocal Rank Fusion in Math-Aware Search Engines
Abstract
Mathematical formulas introduce complications to the standard approaches used in information retrieval. By studying how traditional (sparse) search systems perform in matching queries to documents, we hope to gain insights into which features in the formulas and in the accompanying natural language text signal likely relevance.
In this paper, we focus on query rewriting for the ARQMath benchmarks recently developed as part of CLEF. In particular, we improve mathematical community question answering applications by using responses from a large language model (LLM) to reformulate queries. Beyond simply replacing the query by the LLM response or concatenating the response to the query, we explore whether improvements accrue from the LLM selecting a subset of the query terms, augmenting the query with additional terms, or re-weighting the query terms. We also examine whether such query reformulation is equally advantageous for math features extracted from formulas and for keyword terms. As a final step, we use reciprocal rank fusion to combine several component approaches in order to improve ranking results. In two experiments involving real-world mathematical questions, we show that combining four strategies for term selection, term augmentation, and term re-weighting improves nDCG'@1000 by 5%, MAP'@1000 by 7-8%, and P'@10 by 9-10% over using the question as given.
11:25 Mining a Century of Swiss Trademark Data
Abstract
This paper presents an approach for extracting trademark registration events from the Swiss Official Gazette of Commerce (SOGC), an official daily journal published by the Swiss Confederation since January 1883. Until 2001, the data is only available as scanned documents, which constitute the target dataset of this study. Our approach is composed of a chain of three steps based on state-of-the-art deep learning techniques. We leverage image classification to identify pages containing trademarks (macro segmentation); we apply object detection to identify the portion of the page corresponding to a registration event (micro segmentation); last, we perform information extraction using a document AI technique.
We obtain a dataset of ca. 500,000 trademark registration events, extracted from a corpus of 430,000 pages. Each step of our workflow has relatively high accuracy: the macro and micro segmentation steps show precision and recall greater than 95% on a manually constructed dataset. The dataset offers a unique historical perspective on trademark registrations in Switzerland that is not available from any other source. Showcasing what can be achieved with the extracted information, we provide answers to a set of preliminary economics questions.
11:50 OPERA : An Environment Extending Coreference Annotation to Relations Between Entities
Abstract
The availability of annotated corpora on coreference is a requirement for linguistics and NLP. This presupposes the availability of suitable annotation environments. Yet most of annotations tools for coreference are based on annotation models with limited expressiveness.
We present here the OPERA annotation tool, which is based on an extended model for coreference annotation that, in addition of allowing work on referring expression in the text (a widespread feature), enables the relations between entities to be annotated and characterized.
12:15 Topic Modeling and Link-Prediction for Material Property Discovery
Abstract
Link prediction is a key network analysis technique that infers missing or future relations between nodes in a graph, based on observed patterns of connectivity. Scientific literature networks and knowledge graphs are typically large, sparse, and noisy, and often contain missing links—potential but unobserved connections—between concepts, entities, or methods. Link prediction is widely used in domains such as recommender systems, biology, social networks, and knowledge graph completion to uncover previously unseen but plausible associations. Here, we present an AI-driven hierarchical link prediction framework that integrates matrix factorization, uncertainty quantification, and human-in-the-loop visualization to infer hidden associations and steer discovery in complex material domains. Our method combines Hierarchical Nonnegative Matrix Factorization (HNMFk) and Boolean matrix factorization (BNMFk) with automatic model selection, as well as Logistic matrix factorization (LMF), we use to construct a three-level topic tree from a 46,862-document corpus focused on 73 transition-metal dichalcogenides (TMDs). This fascinating class of materials has been studied in a variety of physics fields and has a multitude of current and potential applications. A perturbation-based uncertainty quantification (UQ) module provides calibrated confidence estimates and an abstention rule, while an ensemble BNMFk + LMF approach fuses discrete interpretability with probabilistic scoring. The resulting HNMFk clusters map each material onto coherent research themes—such as superconductivity, energy storage, and tribology—and highlight missing or weakly connected links between topics and materials, suggesting novel hypotheses for cross-disciplinary exploration. We validate our method by removing publications about superconductivity in well-known superconductors—and demonstrate that the model correctly predicts thier association with the superconducting TMD clusters. This highlights the ability of the method to find hidden connections in a graph and make predictions based on them. This is especially useful when examining a diverse corpus of scientific documents covering the same class of phenomena or materials but originating from distinct communities and perspectives. The inferred links generating new hypotheses, produced by our method, are exposed through an interactive Streamlit dashboard, designed for human-in-the-loop scientific discovery.
12:30 Lunch
14:00 Paper Session 2: Optical Character Recognition Chair: Steve Simske
14:00 Improving Lightweight Named Entity Recognition in Handwritten Documents by Predicting Pyramidal Histograms of Characters
Abstract
Named Entity Recoginiton (NER) consists of tagging parts of an unstructured text containing particular semantic information. When applied to handwritten documents, it is possible to do it as a two-step approach in which Handwritten Text Recognition (HTR) is performed prior to tagging the automatic transcription. However, it is also possible to do both tasks simultaneously by using an HTR model that learns to output the transcription and the tagging symbols. In this paper, we focus on improving the one-step approach by introducing the auxiliary task of predicting Pyramidal Histograms of Characters (PHOC) in a Convolutional Recurrent Neural Network (CRNN) model. Moreover, given the recent rise of models that digest large amounts of data, we also study the usage of synthetic data to pretrain the proposed architecture. Our experiments show that by pretraining the PHOC-based architecture on synthetic data substantial improvements can be made in both transcription and tagging quality without compromising the computational cost of the decoding step. The resulting model matches the NER performance of the state-of-the-art while keeping its lightweight nature.
14:25 Text Image Super-Resolution for Improved OCR in Real-Life Scenarios using Swin Transformers
Abstract
Text recognition in real-life images poses a difficult task due to elements such as blur, distortion, and low resolution. This work presents an innovative method that integrates image super-resolution, image restoration, and optical character recognition techniques to enhance text recognition in real-life photographs. We specifically reviewed the processing of the TextZoom dataset and utilized transfer learning on an improved version of the image super-resolution model, SwinIR.
The findings of our experiment show that our text recognition scores are better than the current best scores, and there is a significant rise in the peak signal-to-noise ratio while dealing with deformed low-resolution images from the TextZoom dataset. This approach outperforms earlier research in the domain of scene text image super-resolution and offers a promising resolution for text recognition in real-life images. The code can be accessed at this location: The URL will be included once it has been reviewed.
14:50 Lost in OCR Translation? Vision-Based Approaches to Robust Document Retrieval
Abstract
Retrieval-Augmented Generation (RAG) has become a popular technique for enhancing the reliability and utility of Large Language Models (LLMs) by grounding responses in external documents. Traditional RAG systems rely on Optical Character Recognition (OCR) to first process scanned documents into text. However, even state-of-the-art OCRs can introduce errors, especially in degraded or complex documents. Recent vision-language approaches, such as ColPali, propose direct visual embedding of documents, eliminating the need for OCR. This study presents a systematic comparison between a vision-based RAG system (ColPali) and more traditional OCR-based pipelines utilizing Llama 3.2 (90B) and Nougat OCR across varying document qualities. Beyond conventional retrieval accuracy metrics, we introduce a semantic answer evaluation benchmark to assess end-to-end question-answering performance. Our findings indicate that while vision-based RAG performs well on documents it has been fine-tuned on, OCR-based RAG is better able to generalize to unseen documents of varying quality. We highlight the key trade-offs between computational efficiency and semantic accuracy, offering practical guidance for RAG practitioners in selecting between OCR-dependent and vision-based document retrieval systems in production environments.
15:15 A Proposal of Post-OCR Spelling Correction Using Monolingual Byte-level Language Models
Abstract
This work presents a proposal for a spelling corrector using monolingual byte-level language models (Monobyte) for the post-OCR task in texts produced by Handwritten Text Recognition (HTR) systems. We evaluate three Monobyte models, based on Google’s ByT5 architecture, trained separately on English, French, and Brazilian Portuguese. The experiments evaluated three datasets with 21st century manuscripts: IAM, RIMES, and BRESSAY. In the IAM dataset, Monobyte achieves reductions of 2.24\% in character error rate (CER) and 26.37\% in word error rate (WER). In RIMES, reductions are 13.48\% (CER) and 33.34\% (WER), while in BRESSAY, Monobyte improves CER by 12.78\% and WER by 40.62\%. The BRESSAY results surpass previous results reported in previous works using a multilingual ByT5 model. Our findings demonstrate the effectiveness of byte-level tokenization for this task in noisy text and underscore the potential of computationally efficient, monolingual models.
15:30 Old Greek OCR Result Correction Using LLMs
Abstract
Recognition of historical documents is still an active research field due to the relatively low recognition accuracy achieved when processing old fonts or low-quality images. In this work, we investigate the use of Large Language Models (LLMs) for the correction of the OCR for old Greek documents. We examine two different old Greek datasets, one machine printed and one typewritten, using a Deep Network based OCR together with several known and easy-to-use LLMs for the correction of the result. Additionally, we synthetically produce erroneous texts and change the LLM prompts in order to further study the behavior of LLMs for correcting old Greek noisy text. Experimental results highlight the potential of LLMs for OCR correction of old Greek documents especially for the cases that the recognition results are relatively poor.
15:45 Coffee Break
16:00 Birds of a Feather
Charles Nicholas
18:00 Welcome Reception
Thursday 4th September
Main Programme (Day 2)
08:30 Welcome and Networking
Issues in Document Security
Charles Nicholas (UMBC)
Abstract
When hitherto separate areas of science intersect, research opportunities tend to pop up. So it is with the fields of Document Engineering and Cybersecurity. We present an overview of certain problems that are related to these fields. This overview includes a brief summary of recent and ongoing work in our lab, which in turn includes some that has appeared at previous DocEng conferences. We summarize recent work on detecting and dealing with malicious PDF files, construction of useful malware data sets, and certain applications of tensor decomposition in the analysis of such data. We also describe some ongoing work in malware clustering, and symbolic computation as applied to offensive and defensive cyber. In recent months the topic of AI-generated documents, especially software, has created much discussion, and we will comment on this. We will conclude by pointing out certain themes in our work, as well as certain outstanding problems.
Biography
Charles Nicholas has been a faculty member at UMBC since 1988. He received the B.S. degree from the University of Michigan-Flint, and the M.S. and Ph.D. degrees from The Ohio State University, all in computer science. He has served five times as General Chair of the Conference on Information and Knowledge Management, and twice as Chair of the ACM Symposium on Principles of Document Processing. He served as chair of the department of Computer Science and Electrical Engineering at UMBC from 2003 and 2010. He is author or co-author on nearly 200 scholarly papers, and he has mentored nearly 200 M.S. students, and 20 Ph.D. students. His work has been supported by several agencies within the U.S. Department of Defense as well as a few generous corporate sponsors.
10:00 Coffee Break
10:30 Paper Session 3: Document Organization and Generation Chair: Ethan Munson
10:30 A Hybrid, Neuro-symbolic Approach for Scholarly Knowledge Organization
Abstract
The rapid development of generative AI leveraging neural models, particularly with the introduction of large language models (LLMs), has fundamentally advanced natural language understanding and generation. However, such neural models are non-deterministic, opaque, and tend to confabulate. Knowledge Graphs (KGs) on the other hand contain factual information represented in a symbolic way for humans and machines following formal knowledge repre- sentation formalisms. However, the creation and curation of KGs is time-consuming, cumbersome, and resource-demanding. A key research challenge now is how to synergistically combine both formalisms with the human in the loop (Hybrid AI) to obtain struc- tured and machine-processable knowledge in a scalable way. We introduce an approach for a tight integration of Humans, Neural Models (LLM), and Symbolic Representations (KG) for the semi- automatic creation and curation of Scholarly Knowledge Graphs. We implement and integrate our approach comprising an intelligent user interface and prompt templates for interaction with an LLM in the Open Research Knowledge Graph. We perform a thorough analysis of our approach and implementation with a user evalua- tion to assess the merits of the neuro-symbolic, hybrid approach for organizing scholarly knowledge.
10:55 Preserving Measurement Data Records Long-term: A Field Study on Information Management in the Wake of the 1986 Chernobyl Disaster
Abstract
The long-term preservation of digital documents containing scientifically relevant data from measurements and observations remains challenging. In this paper, we focus on two critical aspects: first, the use of highly specialized---sometimes proprietary and obsolete---formats; and second, the limited usefulness of raw data without contextual information.
In the context of a field study on the preservation of gamma-spectroscopic food radioactivity measurements collected after the Chernobyl disaster in 1986, we examine these aspects and describe the problems encountered in an archival environment.
Specifically, we analyze current concepts and tools and demonstrate how memory organizations can improve long-term digital preservation by expanding file format repositories and developing new repositories for contextual information.
These repositories support the implementation of the FAIR guiding principles for scientific data management and stewardship, in particular the principle that ``data are described with rich metadata,'' including ``descriptive information about the context.'' We use the generic conceptual reference model OAIS (Open Archival Information System) as a methodological framework. OAIS is also widely applied to scientific data and related documents.
11:10 Towards More Homogeneous Paragraphs
Abstract
Paragraph justification is based primarily on shrinking or stretching the interword blanks. While the blanks on a line are all scaled by the same amout, the amount in question varies from line to line. Unfortunately, the quality of a paragraph’s typographic grey largely depends on the aforementioned variation being as small as possible.
In spite of its notoriously high quality, TeX’s paragraph justification algorithm, addresses this problem in a rather coarse fashion. In this paper, we propose a refinement to the algorithm allowing to improve the situation without disturbing the general behavior of the algorithm too much, and without the need for manual intervention.
We analyze the impact of our refinement on a large number of experiments through several statistical estimators. We also exhibit a number of typographical traits related to whitespace distribution that we believe may contribute to our perception of homogeneousness.
11:35 MathML and other XML Technologies for Accessible PDF from LaTeX
Abstract
In this paper we describe the current approach to using MathML within Tagged PDF to enhance the accessibility of mathematical (STEM) documents. While MathML is specified by the PDF 2.0 specification as a standard namespace for PDF Structure Elements, the interaction of MathML, which is defined as an XML vocabulary, and PDF Structure Elements (which are not defined as XML) is left unspecified by the PDF standard. This has necessitated the development of formalizations to interpret and validate PDF Structure Trees as XML, which are also introduced in this paper.
11:50 Measuring Temporal Gains in assisted document transcription
Abstract
Transcribing ancient manuscripts needs a lot of manual effort and time even using computers. This work is mainly done by historians and paleographs whose time is precious. There are many tools that assist this transcription process, but to our Knowledge there has been no study to evaluate their actual benefits in terms of human time spent on the transcription. This paper presents a study on quantifying the temporal efficiency of different transcription and segmentation workflows. The main contribution of our work is the addition of a tracing layer to a popular existing open-source text transcription web platform: eScriptorium. We explore and compare the efficiency of three workflows: fully manual, using a default annotation model, and using a fine tuned model for both segmentation and transcription. In our experiments, we aim to observe and compare the temporal gains achieved by each workflow, highlighting the trade-offs between manual and automated processing of transcriptions. In each workflow, we record the user's actions to measure the time spent in performing the transcriptions and error corrections. The paper describes the design of the tracing layer, presents the methodology and the results. This work is conducted as part of the ChEDiL French ANR project.
12:10 Demonstration Lightning Session Chair: Cerstin Mahlow
12:10 A Comprehensive AI-Powered Editing and Typesetting Platform for Enhancing Academic Writing
Abstract
This demo introduces Doenba Edit, a user-friendly, AI-powered platform developed by Librum Technologies, Inc., designed for seamless editing and typesetting of academic writing within a word processor-like interface. It supports the entire academic writing workflow, from outlining and idea development to drafting, revising, typesetting, and cross-referencing, assisted by integrated AI tools at every stage. This offers a comprehensive solution for enhancing both the quality and efficiency of producing scholarly work. The interface supports figures, tables, mathematical expressions, cross-references, and citations while generating native LaTeX output. For advanced typesetting features supported by LaTeX but not yet available through the interface's visual tools, users can insert custom LaTeX snippets to achieve the desired results. Snippets and draft sections can be individually compiled and previewed in real time. The platform also includes robust version history, an integrated file system, and automated conversion to various LaTeX templates, making the final document ready for submission to their intended venue.
12:15 Use Case Demonstration @ DocEng2025: Conversation-Driven Multi-LLM Framework for Web Document Sentiment Analysis
Abstract
In this use case demonstration we show how a system of collaborative Large Language Models (LLMs) can be applied to the task of analyzing the sentiment of online news articles. The emergence of LLMs has proven to be highly valuable in interpreting unstructured text, offering nuanced and context-aware insights. While they can not fully replace traditional machine learning approaches for sentiment analysis, our approach illustrates how collaborative LLM architectures can enrich the explainability and trustworthiness of the outcomes.
12:20 The Di2Win Document Intelligence Platform
Abstract
We present the Di2Win Document Intelligence Platform (DIP), a modular, AI-driven pipeline that transforms raw document images—captured by scanners or mobile phones—into structured data and business actions in a single pass. The system comprises five loosely-coupled micro-services: (1) image-quality verification using a contrast-invariant model that flags blur, skew, and illumination issues above 100 ms per page; (2) document classification via a Transformer-base model with layout embeddings, delivering top-k types with calibrated confidence; (3) information extraction through i) Dilbert, a multimodal Token-Layout-Language model fine-tuned on weakly-labeled forms or ii) Delfos, a Large Language Model Mixture of Experts fine-tuned with well-defined prompts; (4) DataDrift, a powerful rules engine to avoid inconsistent outputs concerning the business rules; and (5) process automation orchestrated by a BPMN-aware RPA engine that routes results to databases, APIs, or human-review queues. All AI components are orchestrated through a messaging service to control the information flow, and the application exposes REST/gRPC endpoints to communicate with outside consumers. This enables the hot-swapping of models without downstream code changes by just plugging a new message consumer into the messaging system. This also provides horizontal scalability since, to increase the application throughput, we only need to add new AI engine consumers to the messaging system. Deployed in banking, insurance, and healthcare, the Di2Win DIP has processed more than 30 million pages, reducing average handling time by 79% and re-keying errors by 86 %, speeding up the workflows up to ten times. Our DocEng demonstration allows attendees to upload documents, observe live quality and confidence dashboards, and edit extracted fields with immediate feedback to the active-learning loop.
12:30 Lunch
13:45 Interactive Demonstrations
14:15 Competition report: Binarizing Photographed Document Images - 2025 Quality, Time and Space Assessment
Gustavo P. Chaves
Thaylor Vieira
Gabriel de F. P E Silva
Rafael Dueire Lins
Steven J. Simske
14:30 ACM SigWeb Town Hall
Alexandra Bonnici
15:00 Coffee Break
15:30 Paper Session 4: Document Classification Chair: Besat Kassaie
15:30 SoAC and SoACer: A Sector-Based Corpus and LLM-Based Framework for Sectoral Website Classification
Abstract
One approach to understanding the vastness and complexity of the web is to categorize websites into sectors that reflect the specific industries or domains in which they operate. However, existing website classification approaches often struggle to handle the noisy, unstructured, and lengthy nature of web content, and current datasets lack a universal sector classification labeling system specifically designed for the web. To address these issues, we introduce SoAC (Sector of Activity Corpus), a large-scale corpus comprising 195,495 websites categorized into 10 broad sectors tailored for web content, which serves as the benchmark for evaluating our proposed classification framework, SoACer (Sector of Activity Classifier). Building on this resource, SoACer is a novel end-to-end classification framework that first fetches website information, then incorporates extractive summarization to condense noisy and lengthy content into a concise representation, and finally employs large language model (LLM) embeddings (Llama3-8B) combined with a classification head to achieve accurate sectoral prediction. Through extensive experiments, including ablation studies and detailed error analysis, we demonstrate that SoACer achieves an overall accuracy of 72.6\% on our proposed SoAC dataset. Our ablation study confirms that extractive summarization not only reduces computational overhead but also enhances classification performance, while our error analysis reveals meaningful sector overlaps that underscore the need for multi-label and hierarchical classification frameworks. These findings provide a robust foundation for future exploration of advanced classification techniques that better capture the complex nature of modern website content.
15:55 Robust Image Classifiers Fail Under Shifted Adversarial Perturbations
Abstract
Non-robustness of image classifiers to subtle, adversarial perturbations is a well-known failure mode. Defenses against such attacks are typically evaluated by measuring the error rate on perturbed versions of the natural test set, quantifying the worst-case performance within a specified perturbation budget. However, these evaluations often isolate specific perturbation types, underestimating the adaptability of real-world adversaries who can modify or compose attacks in unforeseen ways. In this work, we show that models considered robust to strong attacks, such as AutoAttack, can be compromised by a simple modification of the weaker FGSM attack, where the adversarial perturbation is slightly transformed prior to being added to the input. Despite the attack's simplicity, robust models that perform well against standard FGSM become vulnerable to this variant. These findings suggest that current defenses may generalize poorly beyond their assumed threat models and can achieve inflated robustness scores under narrowly defined evaluation settings.
16:20 Document Classification using File Names
Abstract
Rapid document classification is critical in several time-sensitive applications like digital forensics and large-scale media classification. Traditional approaches that rely on heavy-duty deep learning models fall short due to high inference times over vast input datasets and computational resources associated with analyzing whole documents. In this paper, we present a method using lightweight supervised learning models, combined with a TF-IDF feature extraction-based tokenization method, to accurately and efficiently classify documents based solely on their file names, which substantially reduces inference time. Experiments on two datasets introduced in this paper show that our file name classifiers correctly predict more than 90% of in-scope documents with 99.63% and 96.57% accuracy while being 442x faster than more complex models such as DiT. Our results demonstrate that incorporating lightweight file name classification as a front-end to document analysis pipelines can efficiently process vast document datasets in critical scenarios, enabling fast and more reliable document classification.
16:45 Spurious Cues in RVL-CDIP and Tobacco3482 Document Classification: The Case of ID Codes
Abstract
RVL-CDIP and Tobacco3482 are commonly used document classification benchmarks, but recent work on explainability has revealed that ID codes stamped on the documents in these datasets may be used by machine learning models to learn shortcuts on the classification task. In this paper, we present an in-depth investigation into the influence and impact of these ID codes on model performance. We annotate ID codes in documents from RVL-CDIP and Tobacco3482 and find that shallow learning models can achieve classification accuracy scores of roughly 40% on RVL-CDIP and 60% on Tobacco3482 using only features derived from the ID codes. We also find that a state-of-the-art document classifier sees a performance drop of 11 accuracy points on RVL-CDIP when ID codes are removed from the data. Finally, we train an ID code detection model in order to remove ID codes from RVL-CDIP and Tobacco3482 and make this data publicly available.
18:00 DocEng Dinner and Best Paper Award Ceremony
Friday 5th September
Main Programme (Day 3)
08:30 Welcome and Networking
09:00 Paper Session 5: Document Analysis and Generation Chair: Didier Verna
09:00 BioReadNet: A Transformer-Driven Hybrid Model for Target Audience-Aware Biomedical Text Readability Assessment
Abstract
The perception of the readability of biomedical texts varies depending on the reader's profile, a disparity further amplified by the intrinsic complexity of these documents and the unequal distribution of health literacy within the population. Although 72% of Internet users consult medical information online, a significant proportion have difficulty understanding it. To ensure that texts are accessible to a diverse audience, it is crucial to assess their readability. However, conventional readability formulas, designed for general texts, do not take this diversity into account, underlining the need to adapt evaluation tools to the specific needs of biomedical texts and the heterogeneity of readers. To address this gap, we propose a novel readability assessment method tailored to three distinct audiences: expert adults, non‑expert adults, and children. Our approach is built upon a structured, bilingual biomedical corpus of 20,008 documents (8,854 in French, 11,154 in English), compiled from multiple sources to ensure diversity in both content and audience. Specifically, the French corpus combines texts from Cochrane and Wikipedia/Vikidia, both of which are subsets of the CLEAR corpus, while the English corpus merges documents from the Cochrane Library, Plaba, and Science Journal for Kids. For each original expert-level text, domain specialists produced simplified variants calibrated specifically to the comprehension abilities of non‑expert adults or children. Every document is therefore explicitly labeled by its target audience. Leveraging this resource, we trained a diverse suite of classifiers, from classical approaches (e.g., XGBoost, SVM) to classifiers built upon language models (e.g., BERT, CamemBERT, BioBERT, DrBERT). We then designed a hybrid architecture "BioReadNet" that integrates transformer embeddings with expert‑driven linguistic features, achieving a macro‑averaged F1 score of 0.987.
09:25 Visual Large Language Models for Graphics Understanding: a Case Study on Floorplan Images
Abstract
This study explores the use of Vision Large Language Models (VLLMs) for identifying items in complex graphical documents. In particular, we focus on looking for furniture objects (e.g. beds, tables, and chairs) and structural items (doors and windows) in floorplan images. We evaluate one object detection model (YOLO) and state-of-the-art VLLMs on two datasets featuring diverse floorplan layouts and symbols. The experiments with VLLMs are performed with a zero-shot setting, meaning the models are tested without any training or fine-tuning, as well as with a few-shot approach, where examples of items to be found in the image are given to the models in the prompt. The results highlight the strengths and limitations of VLLMs in recognizing architectural elements, providing guidance for future research in the use multimodal vision-language models for graphics recognition.
09:40 Designing Visual Tools for Writing Process Analysis
Abstract
Understanding how texts are produced is crucial. However, the writing process itself---including intermediate versions, copy-paste actions, input from co-authors or LLMs---remains invisible in the final text. This study addresses this gap by visualizing fine-grained keystroke logging data to capture both the product (final text) and the process (writer’s actions) at sentence and text level. We design and implement custom JavaScript visualizations of linguistically processed keystroke logging data. Our pilot study examines data from nine students writing under identical conditions; we analyze temporal, spatial, and structural aspects of writing. The results reveal diverse, non-linear writing strategies and suggest that individualized process visualizations can inform both document engineering and writing analytics. The novel visualization types we present demonstrate how process and product can be meaningfully integrated.
09:55 Synthetic Document Generation with Full Annotation: A Framework Utilizing Open-Weight Large Language Models
Abstract
Advances in generative AI have made creating synthetic semi-structured documents much easier. Models like GPT-4o can now generate realistic receipts with a simple prompt; however, even the most realistic receipts are not enough to train document understanding models without good annotations. This work presents a framework that uses open-source large language models to create fully-annotated receipts. The framework can be extended or modified, and includes a step for self-assessment. We use the open dataset SROIE to test the usefulness of the generated receipts, and show that mixing both datasets can improve global information extraction up to 8%, and up to 32.9% for specific fields of the SROIE dataset.
10:10 An Adaptive Agentic Tool Building Architecture leveraging Expert-in-the-Loop Guidance, applied to Document Generation
Abstract
We introduce a general-purpose LLM agentic architecture with expert-in-the-loop guidance that iteratively learns to create tools for searching information and generating documents while minimizing time for task domain adaptation and human feedback. We illustrate it with preliminary experiments on scientific synthesis processes (e.g. generating state-of-the-art research papers or patents given an abstract) but it could be applied to very different non-factual questions expecting long nuanced answers.
10:30 Coffee Break
11:00 Paper Session 6: Document Trust and Security Chair: Charlotte Curtis
11:00 Reinforcing Document Privacy in Nigeria: A Framework for Trust in National Data Systems
Abstract
Nigeria has seen gradual growth in this era of digital transformation. Currently, it is being used to improve governance and drive business processes within the country. There have been significant efforts from the Nigerian government to encourage businesses to adopt technology in running day-to-day operations. This does not come without challenges.
In this paper, we examine the constraints surrounding the impact of digital infrastructure on documents and privacy in Nigeria. In this study, we analyse the inefficiencies surrounding document processes. Instead of focusing solely on expenses, we examine factors such as (1) policies - the objective of the Nigerian Data Protection Act and its influence,(2) corruption and insider threats, (3) the over-reliance on paper-based processing, and distrust surrounding the emergence of digital infrastructure,(4) and the legal framework in the country.
We analyze various articles that validate or contradict the idea that Nigeria currently faces a regression in securing documents and data processing. We evaluated the references, examined the current policies and frameworks the country is implementing to address these issues, and also explored potential solutions to mitigate them.
11:15 Document Encryption in Practice: A Comparative Framework and Evaluation
Abstract
Document files with sensitive information are used across nearly every industry. In recent years, cyberattacks have resulted in millions of sensitive documents being exposed. Although document encryption methods exist, they are often flawed in terms of usability, security, or deployability. We present a structured framework for evaluating document encryption methods, adapting the usability-deployability-security (“UDS”) model to the document encryption context. We apply this framework to compare current methods, performing a comprehensive evaluation of nine document protection methods, including password-based, passwordless, and cloud-based approaches. Our analysis across 15 design properties highlights the benefits and limitations of current methods. We propose strategies and design recommendations to address key limitations such as memory-wise effort, granular protection, and shareability.
11:30 Hierarchical Clustering of the SOREL Malware Corpus
Abstract
We discuss the use of hierarchical clustering to identify similar specimens in a large malware corpus. Clustering of any kind re- quires the use of a distance function, and evaluation of clustering algorithms requires criteria that involve some sort of ground truth. We use Jaccard Distance as the ground truth, and we compare the results of clustering when using MinHash and SuperMinHash, both of which are approximations of Jaccard, while supposedly being faster. This work therefore is a study of this tradeoff between speed and clustering quality.
11:45 DocEng Book Series
Steve Simske
12:00 DocEng 2025 Closing Ceremony
12:30 End of DocEng 2025
Note for presentors: Long papers are allocated 20 minutes for the presentation and 5 minutes for questions. Short papers are allocated 10 minutes for the presentation and 5 minutes for questions.