AB2013 Documentation

MIDJOURNEY TRAINING DATA DOCUMENTATION 

Published pursuant to California Civil Code Section 3111 (AB2013)

Last Updated: January 20, 2026

1. Purpose of Training Data 

Midjourney’s mission is to explore new mediums of thought and expand the imaginative powers of the human species. We offer publicly available generative AI models in the state of California that are designed to create beautiful images and videos in response to our user’s prompts. We select our training data to enable the model to learn visual patterns, artistic styles, and the relationship between visual characteristics and textual descriptions across a wide range of subjects and categories. Our training dataset supports our models in learning diverse visual content and translating words into works of art, unlocking the creativity of our users.

2. Dataset Sources and Ownership, Synthetic Data  

Midjourney models are trained on a dataset that consists of a mixture of the following source categories: 

  • Publicly available data: Content crawled from the public web and those from publicly accessible repositories.

  • Data from third parties and Midjourney users: Non-public data from third party providers and data our users provide through use of the Midjourney service.

  • Internally generated data: Data generated and annotated internally by Midjourney, including synthetic data to help support various training objectives and supplement real-world data.

3. Dataset Characteristics

The training data used to train Midjourney models encompass a diverse range of content types, including images across a variety of subjects and categories, the textual metadata associated with these images, and human-provided annotations, ratings, and preferences to help reinforce model learning. 

4. Dataset Scale 

Midjourney models are trained on datasets comprised of billions of images, text, and audiovisual content. The exact number of data points vary depending on the model version and phase of training.  

5. Intellectual Property Status 

Given the varying source of our training data, Midjourney’s training dataset includes a mixture of data, including licensed data, data that may be protected by copyright and is used under fair use, data in the public domain, and data that is not eligible for copyright protection. 

6. Personal Information 

A large portion of Midjourney training dataset include data from the internet, which often relate to people. Although we take steps to reduce the amount of personal information included in our training dataset, some of our data may incidentally contain personal information as defined in California Civil Code Section 1798.140. 

7. Dataset Processing 

Midjourney training data undergo several processing steps during training, including: 

  • Deduplication
  • Removal of low quality images
  • Safety filtering to remove certain data with known risk of containing child sexual abuse material (CSAM) and other categories of sensitive or disallowed content
  • Privacy processing to filter or remove sensitive personal information
  • Categorization based on relevance, quality, or image formats

8. Training Timeline 

Midjourney began collecting data to develop Midjourney models in 2022, and continues to collect data today. These datasets were first incorporated into model development in 2022.