INL/BlackLab
Fork: 52 Star: 91 (更新于 1970-01-01 00:00:00)
license: Apache-2.0
Language: Java .
Linguistic search for large annotated text corpora, based on Apache Lucene
最后发布版本: v3.0.1 ( 2022-10-06 19:11:25)
What is BlackLab?
BlackLab is a corpus retrieval engine built on top of Apache Lucene. It allows fast, complex searches with accurate hit highlighting on large, tagged and annotated, bodies of text. It was developed at the Dutch Language Institute (INT) to provide a fast and feature-rich search interface on our contemporary and historical text corpora.
In addition to the Java library (BlackLab Core), there is also a web service (BlackLab Server), so you can access it from any programming language.
BlackLab is licensed under the Apache License 2.0.
To learn how to index and search your data, see the official project site.
To learn about BlackLab development, see the dev docs.
Branches
The default branch, dev, corresponds to the "bleeding edge" in-development version. You can certainly run it (we do), but if you need maximum stability, it might be better to stay on a stable release instead.
The branch that corresponds to BlackLab's latest release is called main.
There are additional branches related to in-development features. These are intended to be short-lived and will be merged into dev.
Compatibility: Java, Lucene
This version of BlackLab required Java 11 or higher. It has been tested up to Java 17.
This version uses Lucene 8. This unfortunately means that corpora created with older BlackLab versions (up to 2.3) cannot be read and will need to be re-indexed. If this is a problem for you, you can stick with the 2.3 version for now. We would like to eventually provide a conversion tool, but there is no date planned for this.
Roadmap
There is a high-level roadmap page on the documentation site. There are also BlackLab Archives of Relevant Knowledge (BARKs) that go into more detail.
For the next major version (4.0), we are focused on integrating BlackLab with Solr, with the goal of enabling distributed search. We will use this to scale our largest corpus to many billions of tokens. Status and plans for this can be found in the above-mentioned BARKs and in more technical detail here.
Development workflow
We strive towards practicing Continuous Delivery.
Our intention is to:
- continuously improve both unit and integration tests (during development and whenever bugs are discovered)
- avoid long-lived feature branches but frequently merge to the dev branch
- create meaningful commits that fix a bug or add (part of) a feature
- use temporary feature flags to prevent issues with unfinished code
- deploy to our servers frequently
Code style
Configurations for IDE code formatters can be found in the build-tools/
directory:
- IntelliJ IDEA: formatter-intellij.xml
- Eclipse: formatter-eclipse.xml
Building the site
The BlackLab end-user documentation site can be built locally if you want:
# Contains the configurations for various checking plugins shared by multiple modules
cd build-tools
mvn install
# Build the actual site (result will be in core/target/site)
cd ..
mvn site
Using BlackLab with Docker
An alpha version of the Docker setup is provided on Docker Hub. For each upcoming release, we will publish a corresponding Docker image.
A Docker version supporting BuildKit is required (18.09 or higher), as well as Docker Compose version 1.27.1 or higher.
See the Docker README for more details.
Indexing with Docker
We assume here that you are familiar with the BlackLab indexing process; see indexing with BlackLab to learn more.
The easiest is to use the index-corpus.sh
Bash script in the root of the repository. It will download Docker image and run IndexTool in a container, using bind mounts for the input data and writing the indexed corpus. Run the script without arguments for documentation.
Alternatively, you can use Docker Compose to run the indexer. This will create your index on a named volume defined by the Compose file.
Create a file named test.env
with your indexing configuration:
BLACKLAB_FORMATS_DIR=/path/to/my/formats
INDEX_NAME=my-index
INDEX_FORMAT=my-file-format
INDEX_INPUT_DIR=/path/to/my/input-files
JAVA_OPTS=-Xmx10G
To index your data:
docker-compose --env-file test.env run --rm indexer
Now start the server:
docker-compose up -d
Your index should now be accessible at http://localhost:8080/blacklab-server/my-index.
Special thanks
- ej-technologies for the JProfiler Java profiler
- Matthijs Brouwer, developer of Mtas, which we used for reference while developing the custom Lucene Codec and integrating with Solr.
- Everyone who contributed to the project. BlackLab wouldn't be where it is today without all of you.
最近版本更新:(数据更新于 1970-01-01 00:00:00)
2022-10-06 19:11:25 v3.0.1
2022-07-21 16:10:44 v3.0.0
2022-07-13 17:45:06 v2.3.1
2022-04-13 16:18:13 v2.3.0
2022-02-08 22:05:11 v2.2.0
2021-12-16 17:24:18 v2.1.1
2020-06-22 22:04:47 v2.1.0
2020-01-10 20:09:34 v2.0.0
2018-12-07 21:30:32 v2.0.0-RC1
2018-11-12 19:46:39 v1.7.3
主题(topics):
corpus
INL/BlackLab同语言 Java最近更新仓库
2025-01-17 17:46:39 apache/incubator-seata
2025-01-17 07:01:22 Stirling-Tools/Stirling-PDF
2025-01-16 16:51:02 spring-projects/spring-framework
2025-01-15 10:07:02 iBotPeaches/Apktool
2025-01-14 00:10:36 PBH-BTN/PeerBanHelper
2025-01-13 23:06:50 baomidou/mybatis-plus