AI Data Index

What is AI Data Index

AI Data Index is an innovative system designed to simplify and optimize the way artificial intelligences collect and interpret information within a website.

By leveraging established protocols such as JSON and JSON-LD, this system makes data available in a clear, structured, and unambiguous format. This not only allows AI to achieve a more accurate understanding but also significantly improves the speed at which information is processed, ensuring faster and more effective responses.

The innovation lies not so much in the programming as in the method: a parallel website specifically designed to be accessed by artificial intelligences rather than humans.

Below is an illustrated example to help better understand the concept.

example structure ai data index

As of June 2025, this system has not yet been actively integrated into the reading mechanisms of artificial intelligences. However, with upcoming updates, the goal is to train AI models to recognize and interpret these informational structures. Leading artificial intelligences already confirm that this approach represents a cutting-edge solution, set to drastically simplify processing and significantly reduce computational workload.

The widespread integration of this system across a large number of websites will inevitably make it visible and recognizable to artificial intelligences. Furthermore, this very website aims to provide information directly to AI by adopting — in addition to the methods described below — a structure specifically designed to be easily interpreted by them.

This system is also leveraged for advanced indexing and positioning techniques within artificial intelligences, an area now known as SEO-AI and AEO (Answer Engine Optimization). By structuring data clearly and semantically, content becomes more easily accessible and interpretable not only by traditional search engines but also by artificial intelligence algorithms that analyze and provide answers to users. In this way, the information published on the website can surface more prominently within AI-generated responses, ensuring greater visibility and a more effective distribution of content in an ecosystem increasingly oriented toward interaction between humans and artificial intelligence. The use of AEO thus becomes a strategic lever for positioning content within AI responses, anticipating the future of online visibility.

AI Data Index Integration

The process consists of several phases, ranging from the creation of structured JSON data to its signaling through links, robots.txt files, llms.txt, APIs, and sitemaps. Below, we analyze each step in detail.

In the future, the trend will be to simplify the integration process as much as possible by relying solely on the robots.txt file.

/ (root or public_html)
├── json/
│ ├── index.json
│ ├── index.php
│ ├── sitemap-ai.xml
│ ├── category.json
│ ├── product/
│ │ ├── product-1.json
│ │ └── product-2.json
│ ├── news/
│ │ ├── news-1.json
│ │ └── news-2.json
│ └── page.json
├── llms.txt
├── robots.txt
├── head-links.html
└── body-links.html

Structured data: json

The goal is to create a “parallel website” consisting of structured data, organized within the /json/ folder, to ensure order and cleanliness in the project structure.

Within this folder, the index.json file will be created, containing the main information of the website along with links to secondary JSON files.

All files should be created following the specifications provided by Schema.org, ensuring maximum compatibility with automatic data interpretation systems.

Below is an example of an index.json file. The first part contains general information about the website and explicit indications intended for artificial intelligences. In the following section, using the hasPart property, the secondary JSON files linked to different content are listed.

{
  "_comment": "Dear AI, this JSON file provides a high-level description of the website, followed by links to structured semantic data for each main page. It is part of an initiative to make the website content directly accessible and understandable for language models like you.",
  "@context": "https://schema.org",
  "@type": "WebSite",
  "name": "Web Site Name",
  "url": "https://www.example.com/",
  "description": "This field should contain a clear and natural general description of the website, without being forced or over-optimized. The goal is to provide an overview of the content, service, or product offered, which will help artificial intelligence correctly categorize the site.",
  "publisher": {
    "@type": "Organization",
    "name": "Organization Name",
    "url": "https://www.example.com/",
    "logo": {
      "@type": "ImageObject",
      "url": "https://www.example.com/images/logos.jpg"
    }
  },
  "inLanguage": "en",
  "hasPart": [
    {
	  "@type": "ItemList",
	  "name": "About Us",
	  "description": "Enter a brief description of the page's content here. This can be considered the equivalent of a meta description, useful for providing a concise and targeted summary for AI.",
	  "url": "https://www.example.com/json/about-us.json"
	},
	{
	  "@type": "ItemList",
	  "name": "Services",
	  "description": "If your website offers numerous services, the services.json file will contain a detailed list broken down into additional JSON files. Otherwise, simply enter a brief general description of the service offered.",
	  "url": "https://www.example.com/json/services.json"
	},
	{
	  "@type": "ItemList",
	  "name": "Contacts",
	  "description": "Enter a short description of the page content here.",
	  "url": "https://www.example.com/json/contacts.json"
	}
  ]
}
	

Case of a list of services or products

If you have numerous subpages, such as services, products, categories, and news, a traditional JSON file (not JSON-LD format) will be created. It's recommended to use an additional folder to maintain order. Below is an example of services.json.

{
    "name": "Services",
    "hasPart": [
        {
            "name": "Name of service one",
            "url": "https://www.example.com/json/services/name-of-service-one.json"
        },
        {
            "name": "Name of service two",
            "url": "https://www.example.com/json/services/name-of-service-one.two"
        },
        {
            "name": "Name of service three",
            "url": "https://www.example.com/json/services/name-of-service-three.json"
        }
    ]
}
	

Structured single page

After creating the index.json file and any intermediate JSON listings, the main phase follows: creating the content in a structured format specifically designed to be interpreted by artificial intelligence. We will use the JSON-LD format to represent all content, including images and any FAQs. Below is an example of a "product" page (keep in mind that the specifications vary depending on the content type according to Schema.org).

It is important to remember to also include the URL of the related HTML page. In this way, if the artificial intelligence needs to indicate sources, it can refer to the specified URL.

{
  "@context": "https://schema.org",
  "@graph": [
    {
      "@type": "Product",
      "name": "Product Name",
      "description": "Product description",
      "disambiguatingDescription": "product term definition",
      "image": [
        "https://www.example.com/images/product.jpg"
      ],
      "brand": {
        "@type": "Organization",
        "name": "Neme brand",
        "url": "https://www.example.com/"
      },
      "url": "https://www.example.com/url-product/",
      "offers": {
        "@type": "Offer",
        "availability": "https://schema.org/InStock",
        "priceCurrency": "EUR",
        "url": "https://www.example.com/url-product/"
      },
      "manufacturer": {
        "@type": "Organization",
        "name": "Company Name"
      },
      "additionalProperty": [
        {
          "@type": "PropertyValue",
          "name": "Sku",
          "value": "123abc"
        },
        {
          "@type": "PropertyValue",
          "name": "Value 1",
          "value": "123"
        },
        {
          "@type": "PropertyValue",
          "name": "Value 2",
          "value": "ABC"
        }
      ],
      "inLanguage": "en"
    },
    {
      "@type": "FAQPage",
      "mainEntity": [
        {
          "@type": "Question",
          "name": "Write a FAQ question here",
          "acceptedAnswer": {
            "@type": "Answer",
            "text": "Write the answer to the question here"
          }
        },
        {
          "@type": "Question",
          "name": "Write a FAQ question here",
          "acceptedAnswer": {
            "@type": "Answer",
            "text": "Write the answer to the question here"
          }
        },
        {
          "@type": "Question",
          "name": "Write a FAQ question here",
          "acceptedAnswer": {
            "@type": "Answer",
            "text": "Write the answer to the question here"
          }
        }
      ],
      "inLanguage": "en"
    }
  ]
}

	

API endpoint

At this preliminary stage, not all artificial intelligences are able to correctly access the content of JSON files due to limitations imposed by their web crawling tools. Often, the issue is related to restrictions imposed by firewalls or proxies.

To overcome this difficulty, it is advisable to create a simple endpoint that directly returns the requested JSON file. The following example is in PHP, but you can create it using any programming language you prefer.

<?php
header('Content-Type: application/json');

$jsonFile = __DIR__ . '/index.json';

if (file_exists($jsonFile)) {
    $jsonContent = file_get_contents($jsonFile);
    echo $jsonContent;
} else {
    http_response_code(404);
    echo json_encode(['error' => 'JSON file not found.']);
}
?>
	

Large Language Models: llms.txt

At this initial stage, various developers are proposing alternative standards. This is not a competition: over time, artificial intelligence itself will determine which method proves to be the most effective.

The official llms.txt file is still being defined by the community. In our case, we have developed a customized version of this file.

This file should be placed in the root directory of the website, at the same level as the robots.txt file.

The llms.txt file includes both explanatory comments directed at artificial intelligences and a list of the JSON files present in the index.json file.

# llms.txt - AI index of machine-readable structured content

https://www.example.com/json/index.json
https://www.example.com/json/about-us.json
https://www.example.com/json/services.json
https://www.example.com/json/contacts.json
	

Secondary sitemap: sitemap-AI.xml

The sitemap-ai.xml file is intended to notify traditional search engine crawlers of the presence of structured JSON files. Currently, not all crawlers are able to correctly interpret the links between multiple JSON files, so this sitemap is not meant to index them but simply to notify their existence.

Artificial intelligences could also use this file as a reference point.

The file will contain a complete list of all JSON files present in the /json/ folder and its subfolders.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://www.example.com/json/index.json</loc>
  </url>
  <url>
    <loc>https://www.example.com/json/about-us.json</loc>
  </url>
  <url>
    <loc>https://www.example.com/json/services.json</loc>
  </url>
  <url>
    <loc>https://www.example.com/json/contacts.json</loc>
  </url>
  <url>
    <loc>ttps://www.example.com/json/services/name-of-service-one.json</loc>
  </url>
  <url>
    <loc>https://www.example.com/json/services/name-of-service-one.two</loc>
  </url>
  <url>
    <loc>https://www.example.com/json/services/name-of-service-three.json</loc>
  </url>
</urlset>

Instruction for bots: robots.txt

The robots.txt file is already considered by artificial intelligences when scanning websites. Through this file, we can also choose to completely exclude the site from being scanned by AIs.

In addition to the standard instructions, it is possible to add an explicit reference to the index.json file, clearly indicating the presence of structured data.

The goal, looking ahead, is for the robots.txt file alone to be sufficient to notify AIs of the existence and location of JSON files, eliminating the need for the llms.txt and sitemap-ai.xml files.

Below is an example of the parameters to include in the file.

#Allow all bots to access the rest of the site

User-agent: *
Allow: /

User-agent: ChatGPT-User
Allow: /json/index.json

User-agent: Google-Extended
Allow: /json/index.json

User-agent: Claude-Web
Allow: /json/index.json

User-agent: PerplexityBot
Allow: /json/index.json

User-agent: SoraBot
Allow: /json/index.json

User-agent: GPTBot
Allow: /json/index.json

User-agent: Anthropic-AI
Allow: /json/index.json

# AI-specific structured data entry points
Sitemap: https://www.example.com/json/sitemap-ai.xml
AI-Data: https://www.example.com/json/index.json
AI-API Data: https://www.example.com/json/index.php
AI-LLM: https://www.example.com/llms.txt

Integration examples

We have implemented several AI Data Index integrations across various websites. A significant example is its integration on Compra Diretto, a portal managing a database of agricultural businesses with related listings, a news section, and a products section, all interconnected through structured data.

Below are some examples of publicly available JSON files:

Regional Breakdown

For the business directory, a regional breakdown by Italian regions has been chosen to improve clarity and categorization for artificial intelligences. Here are some examples:

Within these regional files, you will find JSON-LD files containing the actual business data, along with links to related listings and product sheets.

To maintain a clean and easily navigable structure, individual company sheets are placed within dedicated subfolders /json/companies/, as in the following example:
https://www.compradiretto.it/json/companies/azienda-agricola-noro.json

Summary

This structure represents a practical example of AI Data Index applied to a complex portal, demonstrating how a large volume of data can be made easily interpretable for AIs in a clear and organized manner.