Weaviate vdb

2024. 10. 30. 10:37

https://weaviate.io/developers/weaviate/introduction
- 우리는 날라다닌다...
  - Weaviate is a low-latency Vector Database for different media types. (text, images, etc)
  - It offers Semantic Search, Question-Answer Extraction, Classification, etc.
  - Built from scratch in Go.
  - Combining "vector-search" with "structured-filtering".
  - The fault tolerance of a cloud-native database.
- 특징
  - Fast queries
  - Ingest any media type with Weaviate Modules
  - Combine vector and scalar search
  - Real-time and persistent
  - Horizontal Scalability
  - High-Availability
  - Cost-Effectiveness
  - Graph-like connections between objects

https://weaviate.io/developers/weaviate/manage-data

Collection

# 테이블 생성
{db_client}.collections.create(
    "이름",
    vectorizer_config=Configure.Vectorizer.text2vec_openai(
        model='text-embedding-3-large', 
        vectorize_collection_name=True,  # vectorize the collection name.
        dimensions=3072),
    # vectorizer_config=[
    #    Configure.NamedVectors.text2vec_openai(
    #        name="title_country",
    #        source_properties=["title", "country"]),
    #    식으로 특정 속성들 을 NamedVector로 설정 할 수 있음!
    # ],
		
    # vector_index_config=Configure.VectorIndex.hnsw(
    #   quantizer=Configure.VectorIndex.Quantizer.bq(),
    #   ef_construction=300,
    #   distance_metric=VectorDistances.COSINE,
    #   filter_strategy=VectorFilterStrategy.SWEEPING),  #HNSW index
    # vector_index_config=Configure.VectorIndex.flat(),  #FLAT index
    vector_index_config=Configure.VectorIndex.dynamic(),  #DYNAMIC index
    
    inverted_index_config=Configure.inverted_index(
        index_null_state=True,
        index_property_length=True,
        index_timestamps=True),
    
    #reranker_config=Configure.Reranker.cohere(),  # Optional
    
    generative_config=Configure.Generative.openai(
        model='gpt-4o-mini'),
    multi_tenancy_config=Configure.multi_tenancy(
        enabled=True,
        auto_tenant_creation=True,
        auto_tenant_activation=True
    ),
    properties=[
        Property(
            name="title",
            data_type=DataType.TEXT,
            vectorize_property_name=True,  # Use "title" as part of the value to vectorize
            tokenization=Tokenization.LOWERCASE,
            index_filterable=True,
            index_searchable=True,
            index_range_filters=False,
        ),
        Property(
            name="body",
            data_type=DataType.TEXT,
            skip_vectorization=True,  # Don't vectorize this property
            tokenization=Tokenization.WHITESPACE
        ),
    ]
)

# 각 사용자별 tenant 생성
{collection}.tenants.create(
    tenants=[Tenant(name="사용자ID")]
)

inverted_index_config :
- https://weaviate.io/developers/weaviate/config-refs/schema#invertedindexconfig
- index_timestamps :
  - To perform queries that are filtered by timestamps → objects' internal timestamps
  - `creationTimeUnix` and `lastUpdateTimeUnix`
multi_tenancy_config :
- Each tenant is stored on a separate shard.
- If your application serves many different users, multi-tenancy keeps their data private and makes database operations more efficient.
- auto_tenant_creation & auto_tenant_activation :
  - https://weaviate.io/developers/academy/py/multitenancy/setup#-enable-multi-tenancy
property.tokenization :
- WORD : 토크나이저는 영숫자를 유지하고 소문자로 변환하고 공백을 분할합니다.
  - 예) “Test_domain_weaviate" → ‘test’, ‘domain’, ‘weaviate’
- KAGOME_KR : 한국어 특화 !!
- https://weaviate.io/developers/weaviate/config-refs/schema#tokenization

Tenant State
- https://weaviate.io/developers/weaviate/manage-data/multi-tenancy
- https://weaviate.io/developers/weaviate/manage-data/tenant-states
- Active : 활성이 되고, Mem 혹은 SSD 으로 데이터 이동.
- Inactive : 비활성이 되고, SSD 으로 데이터 이동.
- Offloaded : 비활성이 되고, S3 으로 데이터 이동. (재활성시 지연시간 발생)
...

https://weaviate.io/developers/weaviate/concepts
- Multi-tenancy
  - Sharding has several benefits
    - Data isolation
    - Fast, efficient querying
    - Easy and robust setup and clean up
  - DB서버 노드당 50,000 개 이상의 활성 샤드를 가질수 있음.
  - Each tenant has a dedicated, high-performance vector index.
  - Multi-tenancy is especially useful when you wnt to store data for multiple customers!
  - IDs : Tenant_ID + Object_UUID 가 유니크 함!
  - Cross-References :
    - multi-tenancy → non-multi-tenancy (O)
    - multi-tenancy → same multi-tenancy (O)
    - non-multi-tenancy → multi-tenancy (X)
    - multi-tenancy → diff multi-tenancy (X)
- Compression
  - BQ :
  - PQ :
  - SQ :
- Indexing
  - Vector indexes (vector-search)
    - HNSW :
      - RAM{Hot} ↔ SSD{Warm} ↔ S3{Cold}
      - ApproximateNearestNeighbor(ANN) search based vector index.
      - scale well with large datasets.
    - Flat :
      - SSD{Warm} ↔ SSD{Warm} ↔ S3{Cold}
      - for brute-force searches. (무차별 검색)
      - useful for small datasets.
    - Dynamic :
      - RAM{Hot}/SSD{Warm} ↔ SSD{Warm} ↔ S3{Cold}
      - when the dataset is small ←[switch]→ when the dataset is large.
  - Inverted indexes (keyword-search)
    - Collection 의 각 Property 단에서 셋팅
      - https://weaviate.io/developers/weaviate/concepts/indexing#inverted-index-types-summary
      - https://weaviate.io/developers/weaviate/concepts/indexing#inverted-index-for-timestamps
    - indexSearchable : BM25 | hybrid search. (필터링에도 사용가능 / Filterable 보단 성능down)
    - indexFilterable : a match-based index for fast filtering by matching criteria.
    - indexRangeFilters : a range-based index for filtering by numerical ranges.
- Vector Indexing (심화)
  - ...
  - ASYNC_INDEXING : https://weaviate.io/developers/weaviate/concepts/vector-index#dynamic-index
  - ...
- Filtering
  - Efficient Pre-Filtered Search
    - Each shard contains an inverted index right next to the HNSW index.
    - This allows for efficient pre-filtering.
  - Filter strategy
    - ACORN :
      - 필터가 쿼리 벡터와 상관 관계가 낮을 때 특히 유용.
      - (즉, 필터가 쿼리 벡터와 가장 유사한 그래프 영역에서 많은 객체를 제외할 때)
      - 대규모 데이터에서 더 강함.
    - Sweeping :
      - The existing and current default filter strategy in Weaviate.
- Reranking
  - ...
...

-끝-

저작자표시

'NoSQL' 카테고리의 다른 글

CQL (0)	2020.03.02
NoSQL 모델링 이란? (0)	2020.02.29
NoSQL 이란? (0)	2020.02.29
분산시스템 이란? (0)	2020.02.29
DynamoDB (0)	2019.05.18

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

기술블로그 바이수

Weaviate vdb

'NoSQL' 카테고리의 다른 글

+ Recent posts

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역