Mobvoiannounced that it will open its hyperscale language model to the public"serial monkey"The partial training dataset, named "Sequence MonkeyOpen SourceDataset 1.0".
Sequence Monkey, as one of the core technologies of Going Out, has a powerful generalized representation and inference capability, and has demonstrated its excellent performance in many fields such as Q&A system, natural language processing, machine translation, text summarization, etc., which greatly improves the productivity and data processing capability.
In order to promote the continuous progress of large language modeling technology, GoDoQ decided to open source some of its training datasets. The open source "Sequence Monkey Open Source Dataset 1.0" includes Chinese general text corpus, ancient poetry and modern translation corpus, and text generation corpus, which have been carefully selected and organized to ensure their high quality and easy-to-use data format. At the same time, the company has adopted a generous license agreement, which provides easy access for developers and researchers.
Through this action, Going Out hopes to attract more talents and teams to participate in the research and application of big language modeling, and jointly promote the continuous progress of this cutting-edge technology. The company firmly believes that the release of the open source dataset will promote academic exchanges and cooperation and accelerate the pace of innovation in related fields.
Project address:https://github.com/mobvoi/seq-monkey-data