AI Document Processing & Parser

Parse Any Document LLM-Ready in Seconds

99.99%Parse Accuracy

5sPer 100 PDF Pages

1000Pages Per PDF Max

100+File Formats

Get your documents ready for gen AI

Drop your document here

or browse files · PDF, Word, Excel, Image & more

PDFDOCXXLSXPNGJPGTIFFHTMLTXT

Free & unlimited — sign up to get started

Create a free account to instantly parse PDFs and images with no usage limits. Supports all formats including scanned documents and handwritten forms.

Try Playground

Sample use cases

7-1_merged.pdf

Parsed · PDF Parser

CHAPTER

7Large Language Models

“How much do we know at any time? Much more, or so I believe, than we know we know.”

Agatha Christie,The Moving Finger

The literature of the fantastic abounds in inanimate objects magically endowed with the gift of speech. From Ovid's statue of Pygmalion to Mary Shelley's story about

Frankenstein, we continually reinvent stories about creating something and then having a chat with it. Legend has it that after finishing his sculpture Moses, Michelangelo thought it so lifelike that he tapped it on the knee and commanded it to speak. Perhaps this shouldn't be surprising. Language is the mark of humanity and sentience. conversation is the most fundamental arena of language, the first kind of lan-guage we learn as children, and the kind we engage in constantly, whether we are teaching or learning, or-dering lunch, or talking with our families or friends.

This chapter introduces the Large Language

Model, or LLM, a computational agent that can in-teract conversationally with people. The fact that LLMs are designed for interaction with people has strong implications for their design and use.

Many of these implications already became clear in a computational systemfrom 60 years ago, ELIZA (Weizenbaum, 1966). ELIZA, designed to simulate a Rogerian psychologist,illustrates a number of important issues with chatbots. For example people became deeply emotionally involved and conducted very personal conversa-tions,even to the extent of asking Weizenbaum to leave the room while they were typing. These issues of emotional engagement and privacy mean we need to think carefully about how we deploy language models and consider their effect on the people who are interacting with them.

In this chapter we begin by introducing thecomputational principles of LLMs; we'll discuiss their implementation in the transformer architecture in the following chapter. The central new idea that makes LLMs possible is the idea of pretraining, so let's begin by thinking about the idea of learning from text,the basic way that LLMs are trained.

We know that fluent speakers of a language bring an enormous amount of knowl-edge to bear during comprehension and production. This knowledge is embodied in many forms, perhaps most obviously in the vocabulary, the rich representations we have of words and their meanings and usage. This makes the vocabulary a useful lens to explore the acquisition of knowledge from text, by both people and machines.

Estimates of the size of adult vocabularies vary widlely both within and across languages. For example, estimates of the vocabulary size of young adult speakers of American English range from 30,000 to 100,000 depending on the resources used to make the estimate and the definition of what it means to know a word. A sim-ple consequence of these facts is that children have to learn about 7 to 10 words a day,every single day, to arrive at observed vocabulary levels by the time they are 20years of age. And indeed empirical estimates of vocabulary growth in late elemen-tary through high school are consistent with this rate. How do children achieve this rate of vocabulary growth? Research suggests that the bulk of this knowledge acqui-sition happens as a by-product of reading. Reading is a process of rich contextual processing; we don't learn words one at a time in isolation. In fact, at some points during learning the rate of vocabulary growth exceeds the rate at which new words are appearing to the learner! That suggests that every time we read a word, we are also strengthening our understanding of other words that are associated with it.

Such facts are consistent with the distributional hypothesis of Chapter 5,which proposes that some aspects of meaning can be learned solely from the texts we en-counter over our lives, based on the complex association of words with the words they co-occur with (and with the words that those words occur with). The distribu-tional hypothesis suggests both that we can acquire remarkable amounts of knowl-edge from text, and that this knowledge can be brought to bear long after its initial acquisition. Of course, grounding from real-world interaction or other modalities can help build even more powerful models, but even text alone is remarkably useful.

pretraining

What made the modern NLP revolution possible is that large language models can learn all this knowledge of language, context, and the world simply by being taught to predict the next word, again and again, based on context, in a (very) large corpus of text. In this chapter and the next we formalize this idea that we'll call pretraining-learning knowledge about language and the world from iteratively predicting tokens in vast amounts of text-and call the resulting pretrained models large language models. Large language models exhibit remarkable performance on natural language tasks because of the knowledge they learn in pretraining.

What can language models learn from word prediction? Consider the examples below. What kinds of knowledge do you think the model might pick up from learn-ing to predict what word fills the underbar (the correct answer is shown in blue)? Think about this for each example before you read ahead to the next paragraph:.

With roses, dahlias, and peonies, I was surrounded byflowers

The room wasn't just big it wasenormous

The square root of 4 is2

The author of "A Room of One's ODwn"isVirginia Woolf

The professor said thathe

From the first sentence a model can learn ontological facts like that roses and dahlias and peonies are all kinds of flowers. From the second, a model could learn that “enormous” means something on the same scale as big but further along on the scale. From the third sentence, the system could learn math, while from the 4th sentence facts about the world and historical authors. Finally,the last sentence, if a model was exposed to such sentences repeatedly, it might learn to associate professors only with mnale pronouns, or other kinds of associations that might cause models to act unfairly to different people.

What is a large language model? As we saw back in Chapter 3, a language model is simply a computational system that can predict the next word from previous words. That is, given a context or prefix of words, a language model assigns a probability distribution over the possible next words. Fig. 7.1 sketches this idea.

Of course we've already seen language models! We saw n-gram language mod-els in Chapter 3 and briefly touched on the feedforward network applied to language

Figure 7.1 A large language modelis a neural network that takes as input a context or prefix, and outputs a distribution over possible next words.

modeling in Chapter 6. A large language model is just a (much) larger version of these. For example, in Chapter 3 we introduced bigram and trigram language mod-els that can predict words from the previous word or handful of words. By contrast, large language models can predict words given contexts of thousands or even tens of thousands of words!

The fundamental intuition of language models is that a model that can predict text (assigning a distribution over following words) can also be used to generate text by sampling from the distribution. Recall from Chapter 3 that sampling means to choose a word from a distribution.

Figure 7.2 Turning a predictive model that gives a probability distribution over next words into a generative model by repeatedly sampling from the distribution. The result is a left-to-right (also called autoregressive) language model. As each token is generated, it gets added onto the context as a prefix for generating the next token.

Fig. 7.2 shows the same example from Fig. 7.1, in which a language model is given a text prefix and generates a possible completion. The model selects the word all, adds that to the context, uses the updated context to get a new predictive distribution, and then selects the from that distribution and generates it, and so on. Notice that the model is conditioning on both the priming context and its own subsequently generated outputs.

This kind of setting in which we iteratively predict and generate words left-to-right from earlier words is often called causal or autoregressive language mod-els.(We will introduce alternative non-autoregressive models, like BERT and other masked language models that predict words using information from both the left and the right, in Chapter 9.)

generative AI

This idea of using computational models to generate text, as well as code,speech, and images, constitutes the important new area called generative AI. Applying LLMs to generate text has vastly broadened the scope of NLP, which historically was focused more on algorithms for parsing or understanding text rather than gen-eratingit.

In the rest of the chapter, we'll seethat almost any NLP task can be modeled as word prediction in a large language model, if we think about it in the right way, and we'll motivate and introduce the idea of prompting language models. We'll introduce specific algorithms for generating text from a language model, like greedy decoding and sampling. We'll introduce the details of pretraining,the way that language models are self-trained by iteratively being taught to guess the next word in the text from the prior words. We'll sketch out the other two stages of language model training: instruction tuning (also called supervised finetuning or SFT),and alignment, concepts that we'll return to in Chapter 10. And we'll see how to evaluate these models. Let's begin, though, by talking about different kinds of language models.

7.1 Three architectures for language models

The architecture we sketched above for a left-to-right or autoregressive language model, which is the language model architecture we will define in this chapter, is actually only one of three common LM architectures.

The three architectures are the encoder, the decoder, and the encoder-decoder. Fig. 7.3 gives a schematic picture of the three.

WW W WW W W W
WW W W W W W W W W W W W
Decoder Encoder Encoder-Decoder

Figure 7.3Three architectures for language models: decoders, encoders, and encoder-decoders. The arrows sketch out the information flow in the three architectures. Decoders take tokens as input and generate tokens as output. Encoders take tokens as input and produce an encoding (a vector representation of each token) as output. Encoder-decoders take tokens as input and generate a series of tokens as output.

decoder

The decoder is the architecture we've introduced above. It takes as input a series of tokens, and iteratively generates an output token one at a time. The decoder is the architecture used to create large language models like GPT, Claude,Llama,and Mistral. The information flow in decoders goes left-to-right, meaning that the model

Figure 7.5 Answering a question by computing the probabilities of the tokens after a prefix stating the question; in this example the correct token Charles has the highest probability.

follow instructions. This extra training is called instruction-tuning. In instruction-tuning we take a base language model that has been trained to predict words, and continue training it on a special dataset of instructions together with the appropriate response to each. The dataset has many examples of questions together with their answers, commands with their responses, and other examples of how to carry on a conversation. We'll discuss the details of instruction-tuning in Chapter 10.

prompt

Language models that have beeninstruction-tuned are very good at following instructions and answering questions and carrying on a conversation and can be prompted. A prompt is a text string that a user issues to a language model to get the model to do something useful. In prompting, the user's prompt string is passed to the language model, which iteratively generates tokens conditioned on the prompt. The process of finding effective prompts for a task is known as prompt engineering.

As suggested above when we introduced conditional generation, a prompt can be a question (like “What is a transformer network?"), possibly in a struc-tured format (like “Q: What is a transformer network? A:"). A prompt can also be an instruction (like “Translate the following sentence into Hindi: 'Chop the garlic finely'").

More explicit prompts that specify the set of possible answers lead to better performance. For example, here is a prompt template to do sentiment analysis that prespecifies the potential answers:

A prompt consisting of a review plus an incomplete statement

Human: Do you think that “input” has negative or positive sentiment?
Choices:
(P) Positive
(N)Negative
Assistant: I believe the best answer is:(

This prompt uses a number of more sophisticated prompting characteristics. It specifies the two allowable choices (P) and (N), and ends the prompt with the open parenthesis that strongly suggests the answer will be (P) or (N). Note that it also specifies the role of the language model as an assistant.

Including some labeled examples in the prompt can also improve performance. We call such examples demonstrations. The task of prompting with examples is sometimes called few-shot prompting, as contrasted with zero-shot prompting which means instructions that don't include labeled examples. For example Fig. 7.6

common crawl

Web text is usualy taken from corpora of automatically-crawled web pages like the common crawl, a series of snapshots of the entire web produced by the non-profit Common Crawl (https://commoncrawl.org/) that each have billions of webpages. Various versions of common crawl data exist, suich as the Colossal Clean Crawled Corpus (C4; Raffel et al. 2020), a corpus of 156 billion tokens of English that is filtered in various ways (deduplicated, removing non-natural language like code,sentences with offensive words from a blocklist). This C4 corpus seems to consist in large part of patent text documents, Wikipedia, and news sites (Dodge et al.,2021).

The Pile

Wikipedia plays a role in lots of language model training, as do corpora of books. The Pile (Gao et al., 2020) is an 825 GB English text corpus that is constructed by publicly released code, containing again a large amount of text scraped from the web as well as books and Wikipedia; Fig. 7.14 shows its composition. Dolma is a larger open corpus of English, created with public tools, containing three trillion tokens, which similarly consists of web text, academic papers, code, books, encyclopedic materials, and social media (Soldaini et al., 2024).

Figure 7.14 The Pile corpus, showing the size of different components, color coded as academic (articles from PubMed and ArXiv, patents from the USPTA; internet (webtext in-cluding a subset of the common crawl as well as Wikipedia), prose (a large corpus of books), dialogue (including movie subtitles and chat data), and misc.. Figure from Gao et al. (2020).

Filtering for quality and safety Pretraining data drawn from the web is filtered for both quality and safety. Quality filters are classifiers that assign a score to each document. Quality is of course subjective, so different quality filters are trained in different ways, but often to value high-quality reference corpora like Wikipedia, PII books,and particular websites and to avoid websites with lots ofPII (Personal Iden-tifiable Information) or adult content. Filters also remove boilerplate text which is very frequent on the web. Another kind of quality filtering is deduplication,which can be done at various levels, so as to remove duplicate documents, duplicate web pages, or duplicate text. Quality filtering generally improves language model per-formance (Longpre et al., 2024b; Llama Team, 2024).

Safety filtering is again a subjective decision, and often includes toxicity detec-tion based on running off-the-shelf toxicity classifiers. This can have mixed results. One problem is that current toxicity classifiers mistakenly flag non-toxic data if it

69 elements·2 tables·8 images

PDF Parser

Supported Input Formats

📄PDF

📝Word

📊Excel

📋PowerPoint

🖼️JPEG / PNG

🗂️TIFF

🌐HTML

📃TXT

✍️Markdown

📚EPUB

📄PDF

📝Word

📊Excel

📋PowerPoint

🖼️JPEG / PNG

🗂️TIFF

🌐HTML

📃TXT

✍️Markdown

📚EPUB

Processing Pipeline

01ParseOCR + Layout Analysis

02ChunkSemantic Segmentation

03EmbedVector Embedding

04ExtractStructured Output

Use Cases

AI Document Parser for Every Workflow, Document Data Extraction

From extracting data from PDFs to parsing resumes and bank statements — one API handles every document type your application needs.

💰

Finance & Banking

Annual reports, 10-K filings, balance sheets

🧾

Invoice & Receipts

VAT invoices, purchase orders, receipts

⚕️

Healthcare

Medical records, lab reports, prescriptions

⚖️

Legal & Contracts

Agreements, NDAs, regulatory filings

PDF Parser & Data Extraction

Extract structured data from any PDF — financial reports, research papers, scanned invoices, and more. Output as JSON, Markdown, CSV, or XML with normalized coordinates for every element. Ideal for building RAG pipelines, LLM applications, and automated document workflows.

pdf parsingextract data from pdfpdf to json converterpdf to csvpdf data extractionpdf parser

Try this use case free

output.json

{

"document_type": "financial_report",

"pages": 12,

"confidence": 0.995,

"tables_detected": 3,

"elements": [Title, Table, NarrativeText, ...],

"output_formats": ["markdown", "json", "csv", "xml"]

}

HR & Recruiting

AI Resume Parser & CV Extraction

Automatically extract candidate information from resumes and CVs in any format — PDF, Word, or image scans. Structured output includes name, contact, skills, work experience, and education fields. Plug directly into your ATS or HR automation pipeline.

resume parserai resume parserresume data extractioncv extraction

Try this use case free

output.json

{

"name": "Sarah Chen",

"email": "sarah.chen@email.com",

"skills": ["Python", "Machine Learning", "SQL", ...],

"experience_years": 6,

"current_title": "Senior Data Scientist",

"education": ["M.S. Computer Science, MIT"]

}

Finance & Compliance

Bank Statement Parser

Convert bank statements from any institution into clean, structured CSV or JSON. Automatically extract transaction dates, descriptions, debit/credit amounts, and running balances. Supports multi-page statements, foreign currency accounts, and scanned PDFs.

bank statement parserbank statement to csv

Try this use case free

Extracted Table — CSV Preview✓ Parsed

Date	Description	Debit	Credit	Balance
2024-09-01	STRIPE PAYOUT		$12,400.00	$48,320.00
2024-09-03	AWS INVOICE	$2,340.00		$45,980.00
2024-09-05	PAYROLL — ACME	$18,500.00		$27,480.00
2024-09-07	CUSTOMER REFUND	$450.00		$27,030.00
2024-09-10	VENDOR PAYMENT	$3,200.00		$23,830.00

Enterprise

Document Extraction Software & Automation

Scale document extraction across your entire organization. Batch process thousands of documents via API, trigger webhooks on completion, and integrate with your existing data pipelines. Supports Word, Excel, PowerPoint, HTML, and 50+ formats alongside PDF.

document extractiondocument parserdocument extraction softwarepdf document automation

Try this use case free

output.mdMarkdown

## Batch Processing Pipeline

POST /api/xparse/pipeline

✓ 1,240 documents queued

✓ Processing: 12 concurrent workers

✓ Completed: 1,198 / 1,240

PDFDOCXXLSXPPTXHTML

OCR & Digitization

Handwriting to Text Converter

Digitize handwritten forms, notes, prescriptions, and receipts with high accuracy. Our multi-engine OCR handles mixed printed and handwritten content on the same page, supporting both English and multilingual documents.

handwriting to text converterocr handwritinghandwriting recognition

Try this use case free

output.json

{

"type": "HandwrittenText",

"text": "Patient: John D. / DOB: 1985-03-12",

"confidence": 0.97,

"language": "en",

"mixed_content": true,

"coordinates": [0.08, 0.12, 0.72, 0.18]

}

Legal & Compliance

Legal Document Data Extraction

Extract key clauses, parties, dates, obligations, and defined terms from contracts, NDAs, and regulatory filings. Accelerate legal review and due diligence with structured, machine-readable output ready for your LLM or contract analysis platform.

legal document data extractioncontract extractionnda parser

Try this use case free

output.json

{

"document_type": "NDA",

"parties": ["ACME Corp.", "Vendor Inc."],

"effective_date": "2024-01-15",

"term_years": 3,

"governing_law": "State of Delaware",

"key_clauses": ["Confidentiality", "Non-Compete", ...]

}

Ready to parse your first document?

Start Free — 100 Credits