Weaviate vdb
2024. 10. 30. 10:37
- Weaviate is a low-latency Vector Database for different media types. (text, images, etc)
- It offers Semantic Search, Question-Answer Extraction, Classification, etc.
- Built from scratch in Go.
- Combining "vector-search" with "structured-filtering".
- The fault tolerance of a cloud-native database.
- 특징
- Fast queries
- Ingest any media type with Weaviate Modules
- Combine vector and scalar search
- Real-time and persistent
- Horizontal Scalability
- High-Availability
- Cost-Effectiveness
- Graph-like connections between objects
- Collection
# 테이블 생성 {db_client}.collections.create( "이름", vectorizer_config=Configure.Vectorizer.text2vec_openai( model='text-embedding-3-large', vectorize_collection_name=True, # vectorize the collection name. dimensions=3072), # vectorizer_config=[ # Configure.NamedVectors.text2vec_openai( # name="title_country", # source_properties=["title", "country"]), # 식으로 특정 속성들 을 NamedVector로 설정 할 수 있음! # ], # vector_index_config=Configure.VectorIndex.hnsw( #, # ef_construction=300, # distance_metric=VectorDistances.COSINE, # filter_strategy=VectorFilterStrategy.SWEEPING), #HNSW index # vector_index_config=Configure.VectorIndex.flat(), #FLAT index vector_index_config=Configure.VectorIndex.dynamic(), #DYNAMIC index inverted_index_config=Configure.inverted_index( index_null_state=True, index_property_length=True, index_timestamps=True), #reranker_config=Configure.Reranker.cohere(), # Optional generative_config=Configure.Generative.openai( model='gpt-4o-mini'), multi_tenancy_config=Configure.multi_tenancy( enabled=True, auto_tenant_creation=True, auto_tenant_activation=True ), properties=[ Property( name="title", data_type=DataType.TEXT, vectorize_property_name=True, # Use "title" as part of the value to vectorize tokenization=Tokenization.LOWERCASE, index_filterable=True, index_searchable=True, index_range_filters=False, ), Property( name="body", data_type=DataType.TEXT, skip_vectorization=True, # Don't vectorize this property tokenization=Tokenization.WHITESPACE ), ] ) # 각 사용자별 tenant 생성 {collection}.tenants.create( tenants=[Tenant(name="사용자ID")] )
- inverted_index_config :
- index_timestamps :
- To perform queries that are filtered by timestamps → objects' internal timestamps
- `creationTimeUnix` and `lastUpdateTimeUnix`
- multi_tenancy_config :
- Each tenant is stored on a separate shard.
- If your application serves many different users, multi-tenancy keeps their data private and makes database operations more efficient.
- auto_tenant_creation & auto_tenant_activation :
- property.tokenization :
- WORD : 토크나이저는 영숫자를 유지하고 소문자로 변환하고 공백을 분할합니다.
- 예) “Test_domain_weaviate" → ‘test’, ‘domain’, ‘weaviate’
- KAGOME_KR : 한국어 특화 !!
- Tenant State
- Active : 활성이 되고, Mem 혹은 SSD 으로 데이터 이동.
- Inactive : 비활성이 되고, SSD 으로 데이터 이동.
- Offloaded : 비활성이 되고, S3 으로 데이터 이동. (재활성시 지연시간 발생)
- Collection
- Multi-tenancy
- Sharding has several benefits
- Data isolation
- Fast, efficient querying
- Easy and robust setup and clean up
- DB서버 노드당 50,000 개 이상의 활성 샤드를 가질수 있음.
- Each tenant has a dedicated, high-performance vector index.
- Multi-tenancy is especially useful when you wnt to store data for multiple customers!
- IDs : Tenant_ID + Object_UUID 가 유니크 함!
- Cross-References :
- multi-tenancy → non-multi-tenancy (O)
- multi-tenancy → same multi-tenancy (O)
- non-multi-tenancy → multi-tenancy (X)
- multi-tenancy → diff multi-tenancy (X)
- Sharding has several benefits
- Compression
- BQ :
- PQ :
- SQ :
- Indexing
- Vector indexes (vector-search)
- HNSW :
- RAM{Hot} ↔ SSD{Warm} ↔ S3{Cold}
- ApproximateNearestNeighbor(ANN) search based vector index.
- scale well with large datasets.
- Flat :
- SSD{Warm} ↔ SSD{Warm} ↔ S3{Cold}
- for brute-force searches. (무차별 검색)
- useful for small datasets.
- Dynamic :
- RAM{Hot}/SSD{Warm} ↔ SSD{Warm} ↔ S3{Cold}
- when the dataset is small ←[switch]→ when the dataset is large.
- HNSW :
- Inverted indexes (keyword-search)
- Collection 의 각 Property 단에서 셋팅
- indexSearchable : BM25 | hybrid search. (필터링에도 사용가능 / Filterable 보단 성능down)
- indexFilterable : a match-based index for fast filtering by matching criteria.
- indexRangeFilters : a range-based index for filtering by numerical ranges.
- Collection 의 각 Property 단에서 셋팅
- Vector indexes (vector-search)
- Vector Indexing (심화)
- Filtering
- Efficient Pre-Filtered Search
- Each shard contains an inverted index right next to the HNSW index.
- This allows for efficient pre-filtering.
- Filter strategy
- 필터가 쿼리 벡터와 상관 관계가 낮을 때 특히 유용.
- (즉, 필터가 쿼리 벡터와 가장 유사한 그래프 영역에서 많은 객체를 제외할 때)
- 대규모 데이터에서 더 강함.
- Sweeping :
- The existing and current default filter strategy in Weaviate.
- Efficient Pre-Filtered Search
- Reranking
- Multi-tenancy
