Lens

Pipelines

Transform raw data into structured intelligence through modular processing steps.

Overview

In Lens, a **Pipeline** is a series of transformations applied to the collection results. Think of them as middleware for research data. Each pipeline step receives the current set of results, performs an operation, and passes them to the next step.

yaml
# Example pipeline configuration
pipelines:
  - name: "clean_data"
    steps: ["dedupe", "remove_boilerplate"]
  - name: "synthesize"
    steps: ["summarize", "cluster"]

Core Pipeline Steps

Deduplication

Removes identical or near-identical results retrieved from different sources (e.g., the same paper on arXiv and Crossref).

Clustering

Groups related results into conceptual clusters, allowing the synthesis engine to identify primary themes in the research.

Summarization

Utilizes advanced LLMs to generate concise summaries of individual items or the entire research cluster.

Custom Pipelines

You can build your own pipeline steps by implementing a simple Python function that accepts a list of ResearchItem objects.

python
from lens.pipelines import pipeline_step

@pipeline_step
def my_custom_filter(items):
    # Keep only items with a high authority score
    return [i for i in items if i.authority_score > 0.8]

Pipeline Performance

While collection is I/O bound, pipelines are often CPU or LLM-inference bound. For large datasets, consider using our built-in async pipeline processors.