🔧 other

Chinese Internet corpus resource platform

Provide high-quality Chinese corpus resources to facilitate pre-training of large artificial intelligence models.

#Artificial Intelligence
#Data security
#pre-training
#corpus
#Build and share together
Chinese Internet corpus resource platform

Product Details

The Chinese Internet Corpus Resource Platform is a professional website hosted by the China Cyberspace Security Association. It aims to provide high-quality, safe and compliant Chinese corpus resources for the pre-training of large artificial intelligence models. The platform brings together the synergistic advantages from enterprises, universities and scientific research units, and relies on the "co-construction and sharing" mechanism to form multiple high-quality corpora including the Chinese Internet Basic Corpus 2.0, the People's Daily Online Mainstream Value Dataset, and the National Version Library's Ming and Qing literature corpora. These corpora have gone through strict source screening, format cleaning, language filtering, data deduplication, content filtering, privacy filtering and other processing steps to ensure the legality, authenticity, accuracy and objectivity of the data. The resources of the platform are of great significance in promoting national artificial intelligence technology innovation and industrial development. They can help large models better understand and generate Chinese content and improve their knowledge capabilities and value alignment.

Main Features

1
Provides a variety of high-quality Chinese corpora to meet different pre-training needs.
2
Strict data processing procedures ensure the safety and compliance of corpus.
3
Covering multiple fields, such as culture, politics, economy, etc., it is comprehensive and outstanding.
4
Support the co-construction and sharing mechanism to promote the continuous updating and enrichment of corpus resources.
5
The corpus format is standardized, making it easy for users to download and use.
6
New corpora are released regularly to continuously empower the development of artificial intelligence.
7
Provide policy information to help users understand industry trends.
8
Demonstrate the results of joint construction and sharing and promote industry-university-research cooperation.

How to Use

1
1. Visit the platform website https://corpus.cybersac.cn/#/home.
2
2. Register and log in to the platform to obtain more resources and services.
3
3. Browse and select the required corpus on the homepage or dataset page.
4
4. Click on the corpus of interest to view detailed information and data samples.
5
5. Download the corpus as needed and use it according to the format and instructions provided by the platform.
6
6. Refer to the policy information page to understand industry trends and relevant policies and regulations to ensure that research and development work meets requirements.
7
7. Participate in co-construction and sharing activities, contribute your own data or research results, and jointly promote the development of the platform.

Target Users

The target audience is mainly researchers and developers from enterprises, universities and scientific research institutions engaged in the research and development of large artificial intelligence models. For them, the platform provides rich, rigorously screened and processed Chinese corpus resources, which can effectively improve the training effect of large models, help solve problems such as ideological security, knowledge and ability cultivation and value alignment, thereby promoting the innovation and development of artificial intelligence technology in the Chinese environment.

Examples

An artificial intelligence company used Chinese Internet Basic Corpus 2.0 to train its natural language processing model, which significantly improved the model's ability to understand and generate Chinese text.

University scientific research teams use the People's Daily Online mainstream value data set to carry out research on the construction of knowledge graphs in specific fields, providing strong support for the application of artificial intelligence in this field.

Scientific research institutions use the Ming and Qing literature corpus of the National Edition Library to conduct digital research on ancient documents, promoting the integration of traditional culture and modern technology.

Quick Access

Visit Website →

Categories

🔧 other
› AI model
› Development and Tools

Related Recommendations

Discover more similar quality AI tools

gpt oss

gpt oss

GPT OSS is an open source language model launched by OpenAI, with powerful reasoning capabilities and Apache 2.0 license. This model has the characteristics of high efficiency, security, API compatibility, etc., and is a pioneer of future open source language models.

Artificial Intelligence Open source model
🔧 other
Dyad

Dyad

Dyad is a powerful application building tool that uses open source technology so that users can freely customize and build AI applications. Its main advantages include high flexibility, powerful functions, and support for local development and customization.

Open source plug-in
🔧 other
SandboxAQ

SandboxAQ

SandboxAQ uses technologies such as AI simulation, encryption management, and AI perception of global organizations to solve major challenges affecting society. It is an advanced computing product of great significance.

AI simulation
🔧 other
Dia AI

Dia AI

Dia is a text-to-speech (TTS) model developed by Nari Labs with 160 million parameters capable of generating highly realistic dialogue directly from text. The model supports emotion and intonation control and is able to generate non-verbal communications such as laughter and coughs. Its pre-trained model weights are hosted on Hugging Face and are suitable for English generation. This product is critical for research and educational use, enabling the advancement of conversation generation technology.

AI Open source
🔧 other
GenPRM

GenPRM

GenPRM is an emerging process reward model (PRM) that improves computational efficiency at test time by generating inferences. This technology can provide more accurate reward evaluation when processing complex tasks and is suitable for a variety of applications in the field of machine learning and artificial intelligence. Its main advantage is the ability to optimize model performance under limited resources and reduce computational costs in practical applications.

Artificial Intelligence machine learning
🔧 other
EasyControl Ghibli

EasyControl Ghibli

EasyControl Ghibli is a newly released model based on the Hugging Face platform designed to simplify controlling and managing various artificial intelligence tasks. The model combines advanced technology with a user-friendly interface, allowing users to interact with the AI ​​in a more intuitive way. Its main advantages are its ease of use and powerful functions, making it suitable for users from different backgrounds, whether beginners or professionals.

AI Model
🔧 other
Hunyuan T1

Hunyuan T1

Hunyuan T1 is a very large-scale inference model launched by Tencent. It is based on reinforcement learning technology and significantly improves inference capabilities through extensive post-training. It performs outstandingly in long text processing and context capture, while optimizing the consumption of computing resources and having efficient reasoning capabilities. It is suitable for all kinds of reasoning tasks, especially in mathematics, logical reasoning and other fields. This product is based on deep learning and continuously optimized based on actual feedback. It is suitable for applications in scientific research, education and other fields.

Artificial Intelligence educate
🔧 other
MC-Bench

MC-Bench

MC-Bench is an online platform designed to evaluate and compare different AI-generated buildings through the Minecraft gaming environment. It allows users to vote and participate in AI evaluation, promoting the development of AI technology. The platform’s main advantage is its fun and interactive nature, providing users with an easy and fun way to learn about the capabilities of AI.

AI interactive
🔧 other
SpatialLM

SpatialLM

SpatialLM is a large-scale language model designed for processing 3D point cloud data, capable of producing structured 3D scene understanding output, including semantic categories of architectural elements and objects. It is capable of processing point cloud data from a variety of sources including monocular video sequences, RGBD images, and LiDAR sensors without the need for specialized equipment. SpatialLM has important application value in autonomous navigation and complex 3D scene analysis tasks, significantly improving spatial reasoning capabilities.

machine learning spatial reasoning
🔧 other
Mistral Small 3.1

Mistral Small 3.1

Mistral-Small-3.1-24B-Base-2503 is an advanced open source model with 24 billion parameters, supports multi-language and long context processing, and is suitable for text and vision tasks. It is the basic model of Mistral Small 3.1, has strong multi-modal capabilities and is suitable for enterprise needs.

Artificial Intelligence Open source
🔧 other
Agent Network Protocol

Agent Network Protocol

Agent Network Protocol (ANP) aims to define how intelligent agents connect and communicate with each other. It ensures data security and privacy protection through decentralized identity authentication and end-to-end encrypted communication. Its dynamic protocol negotiation function can automatically organize agent networks to achieve efficient collaboration. The goal of ANP is to break down data silos and enable AI to access complete contextual information, thus promoting the era of intelligent agents. This technology has the advantages of openness, security and efficiency, and is suitable for a variety of scenarios that require intelligent agent collaboration.

Intelligent agent Decentralization
🔧 other
Meta FAIR AI Demos

Meta FAIR AI Demos

This product showcases Meta's latest AI research results, covering many fields such as vision and language. The advantage is that it explores the future possibilities of AI, is free for users to experience, and is positioned to showcase cutting-edge AI technology.

AI demo Multi-field applications
🔧 other
Project Aria

Project Aria

Project Aria is a project launched by Meta that focuses on first-person perspective research and aims to promote the development of augmented reality (AR) and artificial intelligence (AI) through innovative technologies. This project collects information from the user's perspective through devices such as Aria Gen 2 glasses to support machine perception and AR research. Its key strengths include innovative hardware design, rich open source datasets and challenges, and close collaboration with global research partners. The project comes amid Meta’s long-term investment in future AR technology and aims to drive industry progress through open research.

Artificial Intelligence augmented reality
🔧 other
Scira AI

Scira AI

Scira AI is a powerful AI platform that provides users with a wide range of application support by integrating multiple API interfaces. It supports a variety of data processing and analysis functions and can meet the needs of different users in different scenarios. The main advantages of this platform are its high flexibility, rich functionality, and ability to be quickly deployed and used. It is suitable for users and businesses that require support for multiple AI capabilities, and pricing and specific positioning may vary based on user needs.

Data processing Multifunctional
🔧 other
Elimination Game

Elimination Game

Elimination Game is an innovative benchmarking framework for evaluating the performance of large language models (LLMs) in complex social environments. It simulates a multi-player competition scenario similar to 'Werewolf' and tests the model's social reasoning, strategy selection and deception capabilities through public discussions, private communication and voting elimination mechanisms. This framework not only provides an important tool for studying the intelligence of AI in social games, but also provides developers with the opportunity to gain insights into the potential of models in real-life social scenarios. Its main advantages include multi-round interaction design, dynamic alliance and defection mechanisms, and detailed evaluation indicators that can comprehensively measure the social ability of AI.

Artificial Intelligence Benchmark
🔧 other
Evo 2

Evo 2

Evo 2 is an AI basic model launched by NVIDIA, designed to analyze the genetic code of biomolecules through deep learning technology. Developed on the NVIDIA DGX Cloud platform, the model is capable of processing large-scale genomic data and provides a powerful tool for biomedical research. The main advantage of Evo 2 is its ability to process gene sequences of up to 1 million tokens, allowing for a more complete understanding of the complexity of the genome. The model has broad application prospects in the biomedical field, including disease diagnosis, drug development and gene editing. Evo 2 was developed with support from the Arc Institute and Stanford University with the goal of driving innovation and breakthroughs in biomedical research.

AI high performance computing
🔧 other