starmorph logo
Published on

A Guide to Formatting Documents for Indexers, Langchain Document Loaders, Embeddings, and HNSW Vector Storage

In today's digital world, it's essential to format documents in a way that ensures they can be easily ingested by various tools and systems, such as indexers, Langchain document loaders, embeddings, and HNSW vector storage. Proper formatting not only improves the quality of the output but also enhances the overall user experience. In this guide, we'll discuss three ways to improve the quality of your output by formatting documents well and three things to avoid in document formatting.

3 Ways to Improve Quality of Output by Formatting Documents Well

1. Use Semantic HTML

Semantic HTML is a markup language that provides meaning to the structure of your content. By using semantic HTML tags, you can ensure that your content is easily understood by indexers, document loaders, and other tools. Some examples of semantic HTML tags include <header>, <nav>, <article>, and <section>.

<article>
  <header>
    <h1>A Guide to Formatting Documents</h1>
  </header>
  <section>
    <h2>Use Semantic HTML</h2>
    <p>...</p>
  </section>
</article>

2. Provide Clear and Concise Text

Well-written, easy-to-read text is crucial for ensuring that your content is easily ingested by various tools. Make sure to use proper grammar, spelling, and punctuation. Additionally, break your content into smaller, logical chunks using paragraphs and subheadings.

## 3 Ways to Improve Quality of Output by Formatting Documents Well

### 1. Use Semantic HTML

Semantic HTML is a markup language that provides meaning to the structure of your content...

3. Add Structured Data

Structured data helps indexers and other tools better understand the content of your document. Use JSON-LD, Microdata, or RDFa to add structured data to your content. This can include information such as author, publication date, and article type.

<script type="application/ld+json">
  {
    "@context": "https://schema.org",
    "@type": "Article",
    "headline": "A Guide to Formatting Documents",
    "author": "John Doe",
    "datePublished": "2022-01-01"
  }
</script>

3 Things to Avoid in Document Formatting

1. Inconsistent Formatting

Inconsistent formatting can make it difficult for indexers and other tools to understand your content. Stick to a consistent style throughout your document, such as using the same heading levels, font sizes, and spacing.

2. Overuse of Non-Semantic HTML

Avoid using non-semantic HTML tags, such as <div> and <span>, for content that should be marked up with semantic HTML. Non-semantic tags can make it harder for indexers and other tools to understand the structure and meaning of your content.

3. Embedding Text in Images or Videos

Text embedded in images or videos is not easily accessible by indexers, document loaders, and other tools. Instead, provide a textual alternative for any visual content, such as using the alt attribute for images or providing a transcript for videos.

<img src="example.jpg" alt="An example image with a description" />

By following these best practices and avoiding common pitfalls, you can ensure that your documents are well-formatted and easily ingested by indexers, Langchain document loaders, embeddings, and HNSW vector storage. This will ultimately lead to higher-quality output and a better user experience.

General Machine Learning Data Structuring Guide

  1. Clean and structure the text: Ensure that the text is free of unnecessary characters, such as HTML tags, excessive whitespace, or special characters that don't contribute to the meaning. Structure the text using headings, bullet points, and numbered lists to make it easier for the model to identify key points and sections.

  2. Use consistent formatting: Maintain a consistent format for similar types of information throughout the document. For example, if you're transcribing an interview, use a consistent format for questions and answers, such as:

    Q: What is your background in machine learning? A: I have a PhD in computer science and have been working in the field for over 10 years.

  3. Break down complex sentences: Simplify long and complex sentences into shorter, more straightforward sentences. This can help the model better understand the meaning and context of each sentence.

  4. Include context: Provide sufficient context for the model to understand the document. For example, if the document is an interview, include an introduction that explains the purpose of the interview, the participants, and the topics discussed.

  5. Preprocess the text: Before feeding the text to the model, preprocess it by tokenizing, lowercasing, and removing stop words and punctuation. This can help the model focus on the most important words and phrases in the document.

  6. Use metadata: If your documents have metadata, such as titles, authors, or publication dates, include this information in the text fed to the model. This can help the model better understand the context and relevance of the document.

  7. Experiment with different models and settings: Different models and settings may yield better results for your specific use case. Experiment with different models, such as GPT-3.5-turbo, and adjust settings like temperature and max tokens to find the best configuration for your task.

By following these best practices, you can improve the ability of machine learning models like GPT-4 to understand your documents, create accurate embeddings, and perform high-functioning vector searches.

Here are 8 more sources that focus on formatting data specifically for large language model embeddings and vector storage input:

  1. OpenAI Cookbook: OpenAI Cookbook - Working with Embeddings This Jupyter Notebook from OpenAI provides examples and explanations on how to work with embeddings, including generating and using embeddings for various tasks.

  2. Pinecone: Pinecone - Preprocessing Text Data for Embeddings Pinecone's guide on preprocessing text data for embeddings covers techniques such as tokenization, stemming, and stopword removal, which are essential for preparing text data for large language models and vector storage systems.

  3. Towards Data Science: Towards Data Science - Text Preprocessing for BERT This article discusses text preprocessing techniques specifically for BERT, a popular large-scale language model. The techniques covered include tokenization, masking, and segment IDs, which are crucial for preparing data for embeddings and vector storage input.

  4. Semantic HTML: Mozilla Developer Network (MDN) - HTML elements reference This resource provides an extensive list of HTML elements, including semantic tags, and their usage.

  5. Structured Data: Google Search Central - Introduction to Structured Data This guide from Google Search Central explains the importance of structured data, how to implement it, and the various formats available.

  6. JSON-LD: JSON-LD - JSON for Linking Data The official JSON-LD website provides an introduction to JSON-LD, examples, and a playground to test your structured data.

  7. Web Content Accessibility: Web Accessibility Initiative (WAI) - Web Content Accessibility Guidelines (WCAG) Overview The WAI provides an overview of the WCAG, which includes guidelines for creating accessible web content, including the use of alternative text for images and transcripts for videos.

  8. Writing for the Web: Nielsen Norman Group - Writing Digital Copy for Domain Experts This article from the Nielsen Norman Group discusses best practices for writing clear and concise text for the web, which is essential for creating content that can be easily ingested by various tools and systems.