Richard Cook
MA Linguistics - EN/ES/DE/FR/PT - Artificial Intelligence Analyst/Developer
- Report this post
The Iberian dataset featured here as part of a paper by Jiaming Luo, Frederik Hartmann, Enrico Santus, Yuan Cao and Regina Barzilay plays a crucial role in advancing the field of historical linguistics and computational linguistics by enabling the automated decipherment and interpretation of complex undersegmented scripts using phonetic priors and machine learning techniques.https://lnkd.in/dQavAVFDMost undeciphered lost languages exhibit two characteristics that pose significant decipherment challenges: (1) the scripts are not fully segmented into words; (2) the closest known language is not determined. The authors propose a decipherment model that handles both of these challenges by building on rich linguistic constraints reflecting consistent patterns in historical sound change. They capture the natural phonological geometry by learning character embeddings based on the International Phonetic Alphabet (IPA). The resulting generative framework jointly models word segmentation and cognate alignment, informed by phonological constraints. They evaluate the model on both deciphered languages (Gothic, Ugaritic) and an undeciphered one (Iberian). The experiments show that incorporating phonetic geometry leads to clear and consistent gains. Additionally, they propose a measure for language closeness which correctly identifies related languages for Gothic and Ugaritic. For Iberian, the method does not show strong evidence supporting Basque as a related language, concurring with the favored position by the current scholarship. I have provided an overview of the main features of this dataset, such as accessibility, suggested uses, labelled features etc. at https://lnkd.in/dCRuGHdc
To view or add a comment, sign in
More Relevant Posts
-
Richard Cook
MA Linguistics - EN/ES/DE/FR/PT - Artificial Intelligence Analyst/Developer
- Report this post
CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus, a dataset by Ye Jia, Michelle Tadmor Ramanovich, Quan Wang and Heiga Zen (Byungha Chun) for sentence-level parallel speech-to-speech translation, was used as part of "FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs".https://lnkd.in/dqvE5TiZFunAudioLLM is a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity. SenseVoice-Small delivers exceptionally low-latency ASR for 5 languages, and SenseVoice-Large supports high-precision ASR for over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub. By integrating these models with LLMs, FunAudioLLM enables applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, thereby pushing the boundaries of voice interaction technology. Demos are available at https://lnkd.in/dGcAHRCZ, and the code can be accessed at https://lnkd.in/d6DFngb4.I have provided an overview of the main features of this dataset, such as accessibility, suggested uses, labelled features etc. at https://lnkd.in/dCRuGHdc
Like CommentTo view or add a comment, sign in
-
Richard Cook
MA Linguistics - EN/ES/DE/FR/PT - Artificial Intelligence Analyst/Developer
See AlsoUbuntu20.04安装colmap从零开始全过程记录(包括CUDA/CUDNN/ceres/anaconda) - 张士玉小黑屋Richard Cook on LinkedIn: GitHub - guanhaisu/OBSD: [ACL 2024 Oral] Deciphering Oracle Bone Language…Studies on thyroid immunity. VII. Splenectomy and monkey immune thyroiditis: thyroidal function and thyroxine transport- Report this post
The UCF101 action recognition dataset by Khurram Soomro, Amir Roshan Zamir and Mubarak Shah was used as part of "CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers".https://lnkd.in/dkj9xRpNLarge-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation. Its application to video generation is still facing many challenges: The potential huge computation cost makes the training from scratch unaffordable; The scarcity and weak relevance of text-video datasets hinder the model understanding complex movement semantics. The authors of this work present 9B-parameter transformer CogVideo, trained by inheriting a pretrained text-to-image model, CogView2. They also propose multi-frame-rate hierarchical training strategy to better align text and video clips. As (probably) the first open-source large-scale pretrained text-to-video model, CogVideo outperforms all publicly available models at a large margin in machine and human evaluations.I have provided an overview of the main features of this dataset, such as accessibility, suggested uses, labelled features etc. at https://lnkd.in/dCRuGHdc
Like CommentTo view or add a comment, sign in
-
Richard Cook
MA Linguistics - EN/ES/DE/FR/PT - Artificial Intelligence Analyst/Developer
- Report this post
Introducing the Bamboogle dataset by Chia Yew Ken, which was used as part of "MindSearch: Mimicking Human Minds Elicits Deep AI Searcher", by Zehui Chen,Kuikun Liu,Qiuchen Wang,Jiangning Liu,Wenwei Zhang,Kai Chen andFeng Zhao.https://lnkd.in/dPhHX8BBInformation seeking and integration is a complex cognitive task that consumes enormous time and effort. Inspired by the remarkable progress of Large Language Models, recent works attempt to solve this task by combining LLMs and search engines. However, these methods still obtain unsatisfying performance due to three challenges: (1) complex requests often cannot be accurately and completely retrieved by the search engine once (2) corresponding information to be integrated is spread over multiple web pages along with massive noise, and (3) a large number of web pages with long contents may quickly exceed the maximum context length of LLMs. Inspired by the cognitive process when humans solve these problems, the authors introduce MindSearch to mimic the human minds in web information seeking and integration, which can be instantiated by a simple yet effective LLM-based multi-agent framework. The WebPlanner models the human mind of multi-step information seeking as a dynamic graph construction process: it decomposes the user query into atomic sub-questions as nodes in the graph and progressively extends the graph based on the search result from WebSearcher. Tasked with each sub-question, WebSearcher performs hierarchical information retrieval with search engines and collects valuable information for WebPlanner. The multi-agent design of MindSearch enables the whole framework to seek and integrate information parallelly from larger-scale (e.g., more than 300) web pages in 3 minutes, which is worth 3 hours of human effort. MindSearch demonstrates significant improvement in the response quality in terms of depth and breadth, on both close-set and open-set QA problems. Besides, responses from MindSearch based on InternLM2.5-7B are preferable by humans to ChatGPT-Web and Perplexity.ai applications, which implies that MindSearch can already deliver a competitive solution to the proprietary AI search engine.I have provided an overview of the main features of this dataset, such as accessibility, suggested uses, labelled features etc. at https://lnkd.in/dCRuGHdc
1
Like CommentTo view or add a comment, sign in
-
Richard Cook
MA Linguistics - EN/ES/DE/FR/PT - Artificial Intelligence Analyst/Developer
- Report this post
The Translated Wikipedia Biographies dataset has been designed by Google to analyze common gender errors in machine translation like incorrect gender choices in pro-drop, possessives and gender agreement.Each instance of the dataset represents a person (identified in the biographies as feminine or masculine), a rock band or a sport team (considered genderless). Each entity is represented by a long text translation (8 to 15 connected sentences referring to that central entity). Articles are written in native English and have been professionally translated to Spanish and German. For Spanish, translations were optimized for pronoun-drop, so the same set could be used to analyze pro-drop (Spanish → English) and gender agreement (English → Spanish).https://lnkd.in/dtCtfETYI have provided an overview of the main features of this dataset, such as accessibility, suggested uses, labelled features etc. at https://lnkd.in/dCRuGHdc
Like CommentTo view or add a comment, sign in
-
Richard Cook
MA Linguistics - EN/ES/DE/FR/PT - Artificial Intelligence Analyst/Developer
- Report this post
The EVOBC dataset (used as part of "An open dataset for the evolution of oracle bone characters: EVOBC" by Haisu Guan, Jinpeng Wan et al.) plays a crucial role in advancing research into the evolution of Oracle Bone characters, fostering interdisciplinary collaboration, and preserving cultural heritage through digital means.https://lnkd.in/dmwnNfFdThe earliest extant Chinese characters originate from oracle bone inscriptions, which are closely related to other East Asian languages. These inscriptions hold immense value for anthropology and archaeology. However, deciphering oracle bone script remains a formidable challenge, with only approximately 1,600 of the over 4,500 extant characters elucidated to date. Further scholarly investigation is required to comprehensively understand this ancient writing system. Artificial Intelligence technology is a promising avenue for deciphering oracle bone characters, particularly concerning their evolution. However, one of the challenges is the lack of datasets mapping the evolution of these characters over time. In this study, the authors systematically collected ancient characters from authoritative texts and websites spanning six historical stages: Oracle Bone Characters - OBC (15th century B.C.), Bronze Inscriptions - BI (13th to 221 B.C.), Seal Script - SS (11th to 8th centuries B.C.), Spring and Autumn period Characters - SAC (770 to 476 B.C.), Warring States period Characters - WSC (475 B.C. to 221 B.C.), and Clerical Script - CS (221 B.C. to 220 A.D.). Subsequently, they constructed an extensive dataset, namely EVolution Oracle Bone Characters (EVOBC), consisting of 229,170 images representing 13,714 distinct character categories. They conducted validation and simulated deciphering on the constructed dataset, and the results demonstrate its high efficacy in aiding the study of oracle bone script. This openly accessible dataset aims to digitalize ancient Chinese scripts across multiple eras, facilitating the decipherment of oracle bone script by examining the evolution of glyph forms.I have provided an overview of the main features of this dataset, such as accessibility, suggested uses, labelled features etc. at https://lnkd.in/dCRuGHdc
Like CommentTo view or add a comment, sign in
-
Richard Cook
MA Linguistics - EN/ES/DE/FR/PT - Artificial Intelligence Analyst/Developer
- Report this post
A dataset of modern Chinese characters used by Haisu Guan, Huanxin Yang, Xinyu Wang, Shengwei Han, Yongge Liu, Lianwen Jin, Xiang Bai and Yuliang Liu for research in the field of computational linguistics and historical linguistics, specifically focusing on the ancient Chinese script known as Oracle Bone Script. Overall, the dataset is a valuable resource for advancing the study of ancient scripts through modern computational techniques, preserving historical information, and fostering interdisciplinary research between computer science and the humanities.https://lnkd.in/d8Jnj6Z5Originating from China's Shang Dynasty approximately 3,000 years ago, the Oracle Bone Script (OBS) is a cornerstone in the annals of linguistic history, predating many established writing systems. Despite the discovery of thousands of inscriptions, a vast expanse of OBS remains undeciphered, casting a veil of mystery over this ancient language. The emergence of modern AI technologies presents a novel frontier for OBS decipherment, challenging traditional NLP methods that rely heavily on large textual corpora, a luxury not afforded by historical languages. This paper introduces a novel approach by adopting image generation techniques, specifically through the development of Oracle Bone Script Decipher (OBSD). Utilizing a conditional diffusion-based strategy, OBSD generates vital clues for decipherment, charting a new course for AI-assisted analysis of ancient languages. To validate its efficacy, extensive experiments were conducted on an oracle bone script dataset, with quantitative results demonstrating the effectiveness of OBSD.I have provided an overview of the main features of this dataset, such as accessibility, suggested uses, labelled features etc. at https://lnkd.in/dCRuGHdc
Like CommentTo view or add a comment, sign in
-
Richard Cook
MA Linguistics - EN/ES/DE/FR/PT - Artificial Intelligence Analyst/Developer
- Report this post
An interesting dataset by Soares, Felipe, and Martin Krallinger designed to serve as a comprehensive resource for advancing multilingual translation research, improving biomedical text processing, and supporting global health communication by providing a robust dataset of parallel biomedical texts in multiple languages.https://lnkd.in/dZb8AEcdThe BVS database (Health Virtual Library) is a centralized source of biomedical information for Latin America and Carib, created in 1998 and coordinated by BIREME in agreement with the Pan American Health Organization (OPAS). Abstracts are available in English, Spanish, and Portuguese, with a subset in more than one language, thus being a possible source of parallel corpora. In this article, they present the development of parallel corpora from BVS in three languages: English, Portuguese, and Spanish. Their parallel corpus is freely available, with complementary information regarding article metadata.I have provided an overview of the main features of this dataset, such as accessibility, suggested uses, labelled features etc. at https://lnkd.in/dCRuGHdc
2
Like CommentTo view or add a comment, sign in
-
Richard Cook
MA Linguistics - EN/ES/DE/FR/PT - Artificial Intelligence Analyst/Developer
- Report this post
An interesting dataset designed to provide a high-quality, standardized benchmark for the development, testing, and evaluation of structure-from-motion (SfM) and related computer vision algorithms.https://lnkd.in/dPAVjnUyRecovering 3D structure and camera motion from images has been a long-standing focus of computer vision research and is known as Structure-from-Motion (SfM). Solutions to this problem are categorized into incremental and global approaches. Until now, the most popular systems follow the incremental paradigm due to its superior accuracy and robustness, while global approaches are drastically more scalable and efficient. With this work, we revisit the problem of global SfM and propose GLOMAP as a new general-purpose system that outperforms the state of the art in global SfM. In terms of accuracy and robustness, we achieve results on-par or superior to COLMAP, the most widely used incremental SfM, while being orders of magnitude faster. I have provided an overview of the main features of this dataset, such as accessibility, suggested uses, labelled features etc. at https://lnkd.in/dCRuGHdc
3
Like CommentTo view or add a comment, sign in
1,423 followers
- 372 Posts
View Profile
FollowExplore topics
- Sales
- Marketing
- Business Administration
- HR Management
- Content Management
- Engineering
- Soft Skills
- See All