AI 문서 처리 및 문서 파서

모든 문서 파싱 수초 만에 LLM 연동 준비 완료

99.99%파싱 정확도

5초100페이지 처리 속도

1000PDF당 최대 페이지

100+지원 파일 형식

생성형 AI를 위한 완벽한 문서 처리 및 데이터 추출을 시작하세요

여기로 문서를 드래그하세요

or 파일 찾아보기 · PDF, Word, Excel, 이미지 등 지원

PDFDOCXXLSXPNGJPGTIFFHTMLTXT

무료 및 무제한 — 가입하고 바로 시작하세요

무료 계정을 생성하여 사용량 제한 없이 PDF 및 이미지를 파싱하세요. 스캔한 문서와 자필 양식을 포함한 모든 형식을 지원합니다.

플레이그라운드 체험

Google 또는 이메일로 로그인무제한 무료 사용신용카드 등록 불필요

활용 사례 샘플

7-1_merged.pdf

Parsed · PDF Parser

CHAPTER

7Large Language Models

“How much do we know at any time? Much more, or so I believe, than we know we know.”

Agatha Christie,The Moving Finger

The literature of the fantastic abounds in inanimate objects magically endowed with the gift of speech. From Ovid's statue of Pygmalion to Mary Shelley's story about

Frankenstein, we continually reinvent stories about creating something and then having a chat with it. Legend has it that after finishing his sculpture Moses, Michelangelo thought it so lifelike that he tapped it on the knee and commanded it to speak. Perhaps this shouldn't be surprising. Language is the mark of humanity and sentience. conversation is the most fundamental arena of language, the first kind of lan-guage we learn as children, and the kind we engage in constantly, whether we are teaching or learning, or-dering lunch, or talking with our families or friends.

This chapter introduces the Large Language

Model, or LLM, a computational agent that can in-teract conversationally with people. The fact that LLMs are designed for interaction with people has strong implications for their design and use.

Many of these implications already became clear in a computational systemfrom 60 years ago, ELIZA (Weizenbaum, 1966). ELIZA, designed to simulate a Rogerian psychologist,illustrates a number of important issues with chatbots. For example people became deeply emotionally involved and conducted very personal conversa-tions,even to the extent of asking Weizenbaum to leave the room while they were typing. These issues of emotional engagement and privacy mean we need to think carefully about how we deploy language models and consider their effect on the people who are interacting with them.

In this chapter we begin by introducing thecomputational principles of LLMs; we'll discuiss their implementation in the transformer architecture in the following chapter. The central new idea that makes LLMs possible is the idea of pretraining, so let's begin by thinking about the idea of learning from text,the basic way that LLMs are trained.

We know that fluent speakers of a language bring an enormous amount of knowl-edge to bear during comprehension and production. This knowledge is embodied in many forms, perhaps most obviously in the vocabulary, the rich representations we have of words and their meanings and usage. This makes the vocabulary a useful lens to explore the acquisition of knowledge from text, by both people and machines.

Estimates of the size of adult vocabularies vary widlely both within and across languages. For example, estimates of the vocabulary size of young adult speakers of American English range from 30,000 to 100,000 depending on the resources used to make the estimate and the definition of what it means to know a word. A sim-ple consequence of these facts is that children have to learn about 7 to 10 words a day,every single day, to arrive at observed vocabulary levels by the time they are 20years of age. And indeed empirical estimates of vocabulary growth in late elemen-tary through high school are consistent with this rate. How do children achieve this rate of vocabulary growth? Research suggests that the bulk of this knowledge acqui-sition happens as a by-product of reading. Reading is a process of rich contextual processing; we don't learn words one at a time in isolation. In fact, at some points during learning the rate of vocabulary growth exceeds the rate at which new words are appearing to the learner! That suggests that every time we read a word, we are also strengthening our understanding of other words that are associated with it.

Such facts are consistent with the distributional hypothesis of Chapter 5,which proposes that some aspects of meaning can be learned solely from the texts we en-counter over our lives, based on the complex association of words with the words they co-occur with (and with the words that those words occur with). The distribu-tional hypothesis suggests both that we can acquire remarkable amounts of knowl-edge from text, and that this knowledge can be brought to bear long after its initial acquisition. Of course, grounding from real-world interaction or other modalities can help build even more powerful models, but even text alone is remarkably useful.

pretraining

What made the modern NLP revolution possible is that large language models can learn all this knowledge of language, context, and the world simply by being taught to predict the next word, again and again, based on context, in a (very) large corpus of text. In this chapter and the next we formalize this idea that we'll call pretraining-learning knowledge about language and the world from iteratively predicting tokens in vast amounts of text-and call the resulting pretrained models large language models. Large language models exhibit remarkable performance on natural language tasks because of the knowledge they learn in pretraining.

What can language models learn from word prediction? Consider the examples below. What kinds of knowledge do you think the model might pick up from learn-ing to predict what word fills the underbar (the correct answer is shown in blue)? Think about this for each example before you read ahead to the next paragraph:.

With roses, dahlias, and peonies, I was surrounded byflowers

The room wasn't just big it wasenormous

The square root of 4 is2

The author of "A Room of One's ODwn"isVirginia Woolf

The professor said thathe

From the first sentence a model can learn ontological facts like that roses and dahlias and peonies are all kinds of flowers. From the second, a model could learn that “enormous” means something on the same scale as big but further along on the scale. From the third sentence, the system could learn math, while from the 4th sentence facts about the world and historical authors. Finally,the last sentence, if a model was exposed to such sentences repeatedly, it might learn to associate professors only with mnale pronouns, or other kinds of associations that might cause models to act unfairly to different people.

What is a large language model? As we saw back in Chapter 3, a language model is simply a computational system that can predict the next word from previous words. That is, given a context or prefix of words, a language model assigns a probability distribution over the possible next words. Fig. 7.1 sketches this idea.

Of course we've already seen language models! We saw n-gram language mod-els in Chapter 3 and briefly touched on the feedforward network applied to language

Figure 7.1 A large language modelis a neural network that takes as input a context or prefix, and outputs a distribution over possible next words.

modeling in Chapter 6. A large language model is just a (much) larger version of these. For example, in Chapter 3 we introduced bigram and trigram language mod-els that can predict words from the previous word or handful of words. By contrast, large language models can predict words given contexts of thousands or even tens of thousands of words!

The fundamental intuition of language models is that a model that can predict text (assigning a distribution over following words) can also be used to generate text by sampling from the distribution. Recall from Chapter 3 that sampling means to choose a word from a distribution.

Figure 7.2 Turning a predictive model that gives a probability distribution over next words into a generative model by repeatedly sampling from the distribution. The result is a left-to-right (also called autoregressive) language model. As each token is generated, it gets added onto the context as a prefix for generating the next token.

Fig. 7.2 shows the same example from Fig. 7.1, in which a language model is given a text prefix and generates a possible completion. The model selects the word all, adds that to the context, uses the updated context to get a new predictive distribution, and then selects the from that distribution and generates it, and so on. Notice that the model is conditioning on both the priming context and its own subsequently generated outputs.

This kind of setting in which we iteratively predict and generate words left-to-right from earlier words is often called causal or autoregressive language mod-els.(We will introduce alternative non-autoregressive models, like BERT and other masked language models that predict words using information from both the left and the right, in Chapter 9.)

generative AI

This idea of using computational models to generate text, as well as code,speech, and images, constitutes the important new area called generative AI. Applying LLMs to generate text has vastly broadened the scope of NLP, which historically was focused more on algorithms for parsing or understanding text rather than gen-eratingit.

In the rest of the chapter, we'll seethat almost any NLP task can be modeled as word prediction in a large language model, if we think about it in the right way, and we'll motivate and introduce the idea of prompting language models. We'll introduce specific algorithms for generating text from a language model, like greedy decoding and sampling. We'll introduce the details of pretraining,the way that language models are self-trained by iteratively being taught to guess the next word in the text from the prior words. We'll sketch out the other two stages of language model training: instruction tuning (also called supervised finetuning or SFT),and alignment, concepts that we'll return to in Chapter 10. And we'll see how to evaluate these models. Let's begin, though, by talking about different kinds of language models.

7.1 Three architectures for language models

The architecture we sketched above for a left-to-right or autoregressive language model, which is the language model architecture we will define in this chapter, is actually only one of three common LM architectures.

The three architectures are the encoder, the decoder, and the encoder-decoder. Fig. 7.3 gives a schematic picture of the three.

WW W WW W W W
WW W W W W W W W W W W W
Decoder Encoder Encoder-Decoder

Figure 7.3Three architectures for language models: decoders, encoders, and encoder-decoders. The arrows sketch out the information flow in the three architectures. Decoders take tokens as input and generate tokens as output. Encoders take tokens as input and produce an encoding (a vector representation of each token) as output. Encoder-decoders take tokens as input and generate a series of tokens as output.

decoder

The decoder is the architecture we've introduced above. It takes as input a series of tokens, and iteratively generates an output token one at a time. The decoder is the architecture used to create large language models like GPT, Claude,Llama,and Mistral. The information flow in decoders goes left-to-right, meaning that the model

Figure 7.5 Answering a question by computing the probabilities of the tokens after a prefix stating the question; in this example the correct token Charles has the highest probability.

follow instructions. This extra training is called instruction-tuning. In instruction-tuning we take a base language model that has been trained to predict words, and continue training it on a special dataset of instructions together with the appropriate response to each. The dataset has many examples of questions together with their answers, commands with their responses, and other examples of how to carry on a conversation. We'll discuss the details of instruction-tuning in Chapter 10.

prompt

Language models that have beeninstruction-tuned are very good at following instructions and answering questions and carrying on a conversation and can be prompted. A prompt is a text string that a user issues to a language model to get the model to do something useful. In prompting, the user's prompt string is passed to the language model, which iteratively generates tokens conditioned on the prompt. The process of finding effective prompts for a task is known as prompt engineering.

As suggested above when we introduced conditional generation, a prompt can be a question (like “What is a transformer network?"), possibly in a struc-tured format (like “Q: What is a transformer network? A:"). A prompt can also be an instruction (like “Translate the following sentence into Hindi: 'Chop the garlic finely'").

More explicit prompts that specify the set of possible answers lead to better performance. For example, here is a prompt template to do sentiment analysis that prespecifies the potential answers:

A prompt consisting of a review plus an incomplete statement

Human: Do you think that “input” has negative or positive sentiment?
Choices:
(P) Positive
(N)Negative
Assistant: I believe the best answer is:(

This prompt uses a number of more sophisticated prompting characteristics. It specifies the two allowable choices (P) and (N), and ends the prompt with the open parenthesis that strongly suggests the answer will be (P) or (N). Note that it also specifies the role of the language model as an assistant.

Including some labeled examples in the prompt can also improve performance. We call such examples demonstrations. The task of prompting with examples is sometimes called few-shot prompting, as contrasted with zero-shot prompting which means instructions that don't include labeled examples. For example Fig. 7.6

common crawl

Web text is usualy taken from corpora of automatically-crawled web pages like the common crawl, a series of snapshots of the entire web produced by the non-profit Common Crawl (https://commoncrawl.org/) that each have billions of webpages. Various versions of common crawl data exist, suich as the Colossal Clean Crawled Corpus (C4; Raffel et al. 2020), a corpus of 156 billion tokens of English that is filtered in various ways (deduplicated, removing non-natural language like code,sentences with offensive words from a blocklist). This C4 corpus seems to consist in large part of patent text documents, Wikipedia, and news sites (Dodge et al.,2021).

The Pile

Wikipedia plays a role in lots of language model training, as do corpora of books. The Pile (Gao et al., 2020) is an 825 GB English text corpus that is constructed by publicly released code, containing again a large amount of text scraped from the web as well as books and Wikipedia; Fig. 7.14 shows its composition. Dolma is a larger open corpus of English, created with public tools, containing three trillion tokens, which similarly consists of web text, academic papers, code, books, encyclopedic materials, and social media (Soldaini et al., 2024).

Figure 7.14 The Pile corpus, showing the size of different components, color coded as academic (articles from PubMed and ArXiv, patents from the USPTA; internet (webtext in-cluding a subset of the common crawl as well as Wikipedia), prose (a large corpus of books), dialogue (including movie subtitles and chat data), and misc.. Figure from Gao et al. (2020).

Filtering for quality and safety Pretraining data drawn from the web is filtered for both quality and safety. Quality filters are classifiers that assign a score to each document. Quality is of course subjective, so different quality filters are trained in different ways, but often to value high-quality reference corpora like Wikipedia, PII books,and particular websites and to avoid websites with lots ofPII (Personal Iden-tifiable Information) or adult content. Filters also remove boilerplate text which is very frequent on the web. Another kind of quality filtering is deduplication,which can be done at various levels, so as to remove duplicate documents, duplicate web pages, or duplicate text. Quality filtering generally improves language model per-formance (Longpre et al., 2024b; Llama Team, 2024).

Safety filtering is again a subjective decision, and often includes toxicity detec-tion based on running off-the-shelf toxicity classifiers. This can have mixed results. One problem is that current toxicity classifiers mistakenly flag non-toxic data if it

69 elements·2 tables·8 images

PDF Parser

지원되는 입력 형식

📄PDF

📝Word

📊Excel

📋PowerPoint

🖼️JPEG / PNG

🗂️TIFF

🌐HTML

📃TXT

✍️Markdown

📚EPUB

📄PDF

📝Word

📊Excel

📋PowerPoint

🖼️JPEG / PNG

🗂️TIFF

🌐HTML

📃TXT

✍️Markdown

📚EPUB

처리 파이프라인

01파싱OCR + 레이아웃 분석

02청크 분할의미론적 세그멘테이션

03임베딩벡터 임베딩

04추출구조화된 결과물

활용 사례

모든 워크플로우를 위한 AI 문서 파서 및 문서 데이터 추출

PDF 데이터 추출부터 이력서 및 은행 거래 내역서 파싱까지 — 단 하나의 API로 애플리케이션에 필요한 모든 문서 처리 작업을 해결하세요.

💰

금융 및 뱅킹

연례 보고서, 10-K 공시, 대차대조표

🧾

영수증 및 청구서

부가세 영수증, 구매 발주서, 결제 영수증

⚕️

헬스케어

의료 기록, 연구소 결과 보고서, 처방전

⚖️

법무 및 계약

계약서, 비밀유지협약서(NDA), 규제 신고서

PDF 파서 및 데이터 추출

재무 보고서, 연구 논문, 스캔한 영수증 등 모든 PDF 데이터 추출을 수행합니다. 모든 요소의 정규화된 좌표가 포함된 JSON, Markdown, CSV 또는 XML로 결과물을 추출하세요. RAG 파이프라인, LLM 애플리케이션 및 자동화된 문서 워크플로우 구축에 이상적입니다.

PDF 파싱PDF 데이터 추출PDF JSON 변환기PDF CSV 변환PDF 데이터 추출PDF 파서PDF 추출

이 활용 사례 무료로 체험하기

output.json

{

"document_type": "financial_report",

"pages": 12,

"confidence": 0.995,

"tables_detected": 3,

"elements": [Title, Table, NarrativeText, ...],

"output_formats": ["markdown", "json", "csv", "xml"]

}

HR 및 채용

AI 이력서 파서 및 CV 추출

PDF, Word, 이미지 스캔 등 모든 형식의 이력서와 CV에서 지원자 정보를 자동으로 추출합니다. 이름, 연락처, 보유 기술, 경력 및 학력 필드가 구조화된 데이터로 제공됩니다. ATS 또는 HR 자동화 파이프라인에 직접 연결하세요.

이력서 파서AI 이력서 파서이력서 데이터 추출CV 추출

이 활용 사례 무료로 체험하기

output.json

{

"name": "Sarah Chen",

"email": "sarah.chen@email.com",

"skills": ["Python", "Machine Learning", "SQL", ...],

"experience_years": 6,

"current_title": "Senior Data Scientist",

"education": ["M.S. Computer Science, MIT"]

}

재무 및 컴플라이언스

은행 거래 내역서 파서

모든 금융기관의 은행 거래 내역서를 깔끔하게 구조화된 CSV 또는 JSON으로 변환합니다. 거래 날짜, 내역 설명, 입출금 금액 및 잔액을 자동으로 추출합니다. 다중 페이지 내역서, 외화 계좌 및 스캔된 PDF를 지원합니다.

은행 거래 내역서 파서은행 거래 내역서 CSV 변환

이 활용 사례 무료로 체험하기

Extracted Table — CSV Preview✓ Parsed

Date	Description	Debit	Credit	Balance
2024-09-01	STRIPE PAYOUT		$12,400.00	$48,320.00
2024-09-03	AWS INVOICE	$2,340.00		$45,980.00
2024-09-05	PAYROLL — ACME	$18,500.00		$27,480.00
2024-09-07	CUSTOMER REFUND	$450.00		$27,030.00
2024-09-10	VENDOR PAYMENT	$3,200.00		$23,830.00

엔터프라이즈

문서 추출 소프트웨어 및 자동화

조직 전체의 문서 처리 및 문서 파서 환경을 확장하세요. API를 통해 수천 개의 문서를 일괄 처리하고, 완료 시 웹훅을 트리거하며, 기존 데이터 파이프라인과 통합할 수 있습니다. PDF뿐만 아니라 Word, Excel, PowerPoint, HTML 및 50개 이상의 형식을 지원합니다.

문서 추출문서 파서문서 추출 소프트웨어PDF 문서 자동화문서 처리

이 활용 사례 무료로 체험하기

output.mdMarkdown

## Batch Processing Pipeline

POST /api/xparse/pipeline

✓ 1,240 documents queued

✓ Processing: 12 concurrent workers

✓ Completed: 1,198 / 1,240

PDFDOCXXLSXPPTXHTML

OCR 및 디지털화

필기체 텍스트 변환

수기 작성 양식, 메모, 처방전 및 영수증을 높은 정확도로 디지털화합니다. 당사의 멀티 엔진 OCR은 동일한 페이지에 인쇄된 텍스트와 필기체가 혼합된 문서를 처리하며, 영어 및 다국어 문서를 모두 지원합니다.

필기체 텍스트 변환기OCR 필기체필기 인식

이 활용 사례 무료로 체험하기

output.json

{

"type": "HandwrittenText",

"text": "Patient: John D. / DOB: 1985-03-12",

"confidence": 0.97,

"language": "en",

"mixed_content": true,

"coordinates": [0.08, 0.12, 0.72, 0.18]

}

법무 및 컴플라이언스

법률 문서 데이터 추출

계약서, 비밀유지협약서(NDA) 및 규제 공시에서 핵심 조항, 당사자, 날짜, 의무 사항 및 정의된 용어를 추출합니다. LLM 또는 계약 분석 플랫폼에 즉시 사용할 수 있는 기계 판독형 구조화 데이터를 통해 법률 검토 및 실사 속도를 높이세요.

법률 문서 데이터 추출계약서 추출NDA 파서

이 활용 사례 무료로 체험하기

output.json

{

"document_type": "NDA",

"parties": ["ACME Corp.", "Vendor Inc."],

"effective_date": "2024-01-15",

"term_years": 3,

"governing_law": "State of Delaware",

"key_clauses": ["Confidentiality", "Non-Compete", ...]

}

첫 번째 문서를 파싱할 준비가 되셨나요?

로그인하고 100 무료 크레딧을 받으세요. 신용카드는 필요하지 않습니다.

무료로 시작하기 — 100 크레딧