Software-update: Xapian / Omega 1.0.4

Xapian is een in C++ geschreven 'open source information retrieval library' en kan gebruikt worden als engine achter een zoekmachine. Het geheel omvat een eigen databaseformaat, api's om deze databases te bewerken en te doorzoeken, tools om de databases te controleren en koppelingsmogelijkheden voor andere talen zoals Java, Ruby, PHP en Python. Een applicatie die bovenop Xapian kan worden gebruikt is Omega, een zoekmachine om Xapian-databases te doorzoeken. Met Omega worden ook enkele tools meegeleverd die gebruikt kunnen worden om databases te vullen met data. Omdat de ontwikkeling van Omega nauw verbonden is met die van Xapian zelf, worden door de ontwikkelaars van beide programma's gelijktijdig nieuwe versies uitgebracht met hetzelfde versienummer.

Het ontwikkelteam van The Xapian Project heeft versie 1.0.4 van Xapian en Omega uitgebracht. De lijsten met veranderingen voor de verschillende onderdelen zien er als volgt uit:

Xapian-core 1.0.4

  • Query:
    • Add OP_SCALE_WEIGHT operator (and a corresponding constructor which takes a single subquery and a parameter of type "double"). This multiplies the weights from the subquery by the parameter, allowing adjustment of the importance of parts of the query tree.
    • Deprecate the essentially useless constructor Query(Query::op, Query).
  • QueryParser:
    • A field prefix can now be set to expand to more than one term prefix. Similarly, multiple term prefixes can now be applied by default. This is done by calling QueryParser::add_boolean_prefix() or QueryParser::add_prefix() more than once with the same field name but a different term prefix (previously subsequent calls with the same field name had no effect).
    • Trying to set the same field as probabilistic and boolean now throws InvalidOperationError.
    • Fix parsing of `term1 term2', broken by changes in 1.0.2.
    • Drop special treatment for unmatched ')' at the start of the query, as it seems rather arbitrary and not particularly useful and was causing us to parse `( -term' incorrectly.
    • The QueryParser now generates pure boolean Query objects for strings such as `' by applying OP_SCALE_WEIGHT with a factor of 0.0.
    • Fix handling of `"quoted phrase" +term' and `"quoted phrase" -term'.
    • Fix handling of ` -term'.
    • Fix problem with spelling correction of hyphenated terms (or other terms joined with phrase generators): the position of the start of the term wasn't being reset for the second term in the generated phrase, resulting in out of bounds errors when substituting the new value in the corrected query string.
    • The parser stack is now a std::vector<> rather than a fixed size, so it will typically use less memory, and can't hit the fixed limit.
    • Fix handling of STEM_ALL and update the documentation comment for QueryParser::set_stemming_strategy() to explain how it works clearly.
  • PostingIterator: positionlist_begin() and get_wdf() should now always throw InvalidOperationError where they aren't meaningful (before in some cases UnimplementedError was thrown).
  • Add tests for new features.
  • Add another valgrind suppression for a slightly different error from zlib in Ubuntu gutsy.
  • Remove quartztest's test_postlist1 and test_postlist2, replacing the coverage lost by extending and adding tests which work with other backends as well.
  • If a test throws a subclass of std::exception, the test harness now reports the class name and the extra information returned by std::exception's what() method.
  • Several performance improvements have been made, mainly to the handling of OP_AND and related operations (OP_FILTER, OP_NEAR, and OP_PHRASE). In combination, these are likely to speed up searching significantly for most users - in tests on real world data we've seen savings of 15-55% in search times). These improvements are:
    • OP_AND of 3 or more sub-queries is now processed more efficiently.
    • Sub-queries from adjacent OP_AND, OP_FILTER, OP_NEAR, and OP_PHRASE are now combined into a single multi-way OP_AND operation, and the filters which implement the near/phrase restrictions are hoisted above this so they need to check fewer documents (bug#23).
    • If an OP_OR or OP_AND_MAYBE decays to OP_AND, we now ensure that the less frequent sub-query is on the left, which OP_AND is optimised to expect.
  • When the Enquire::get_mset() parameter checkatleast is set, and we're sorting by relevance with forward ordering by docid, and the query is pure boolean, the matcher was deciding it was done before the checkatleast requirement was satisfied. Then the adjustments made to the estimated and max statistics based on checkatleast meant the results claimed there were exactly msize results. This bug has now been fixed.
  • Queries involving a ValueRangePostList filter now run around 3.5 times faster (bug#164).
  • The calculations behind MSet::get_matches_estimated() were always rounding down fractions, but now round to the nearest integer. Due to cumulative rounding, this could mean that the estimate is now a few documents higher in some cases (and hopefully a better estimate).
  • Implement explicit swap() methods for internal classes MSetItem and ESetItem which should make the final sort of the MSet and ESet a little more efficient.
flint backend:
  • Fixed a bug introduced in 1.0.3 - trying to open a flint database for reading no longer fails if it isn't writable.
  • We no longer use member function pointers in the Btree implementation which seems to speed up searching a little.
remote backend:
  • The remote protocol minor version has been increased (to accommodate OP_SCALE_WEIGHT). If you are upgrading a live system which uses the remote backend, upgrade the servers before the clients.
build system:
  • Added macro machinery to allow branch prediction hints to be specified and used by compilers which support this (current GCC and Intel C++).
  • In a developer build, look for if rst2html isn't found as some Linux distros have it installed under with an extension.
  • In the API documentation, explicitly note that Database::get_metadata() returns an empty string when the backend doesn't support user-specified metadata, and that WritableDatabase::set_metadata() throws UnimplementedError in this case. Also describe the current behaviour with multidatabases.
  • README: Remove the ancient history lesson - this material is better left to the history page on the website.
  • deprecation.html:
    • Deprecate the non-pythonic iterators in favour of the pythonic ones.
    • Move "Stem::stem_word(word)" in the bindings to the right section (it was done in 1.0.0, as already indicated).
    • Improve formatting.
  • When running rst2html, using "--verbose" was causing "info" messages to be included in the HTML output, so drop this option and really fix this issue (which was thought to have been fixed by changes in 1.0.3).
  • install.html: Reworked - this document now concentrates on giving a brief overview of building which should be suitable for most common cases, and defers to the INSTALL document in each tarball for more details.
  • PLATFORMS: Update from tinderbox and buildbot.
  • remote.html: xapian-tcpsrv has been able to handle concurrent read access since 0.3.1 (7 years ago) so update the very out-of-date information here. Also, note that some newer features aren't supported by the remote backend yet.
  • HACKING: Note specifically that std::list::size() is O(n) for GCC.
  • intro_ir.html: Add link to the forthcoming book "Introduction to Information Retrieval", which can be read online.
  • scalability.html: Update size of gmane.
  • quartzdesign.html: Note that Quartz is now deprecated.
debug code:
  • The debug assertion code has been rewritten from scratch to be cleaner and pull in fewer other headers.

Omega 1.0.4

  • If an OmegaScript template specifies the same field name as both a boolean and a probabilistic term prefix then previous the boolean setting would be ignored (e.g. $setmap{prefix,foo,A}$setmap{boolprefix,foo,H}). Now this generates an error. If you set prefixes in your templates, you may wish to check them over before upgrading.

Xapian-bindings 1.0.4:

  • Wrap new OP_SCALE_WEIGHT query operator, and corresponding Query constructor. Add feature tests for all languages.
  • The "bindings.html" file documenting each of the bindings has been renamed to "index.html".
  • Fix the PHP to work with autoconf < 2.60 to fix RPM builds for older distros.
  • Fix warnings when compiling with GCC 4.2.
  • Update to newer SWIG SVN snapshot to fix memory leaks in wrapped constructors and methods/functions which return a wrapped class.
  • For PHP4, wrap Xapian::sortable_serialise() as xapian_sortable_serialise() and Xapian::sortable_unserialise() as xapian_sortable_unserialise().
  • Document how non-class functions are wrapped.
  • Fix wrapping of NumberValueRangeProcessor for PHP4.
  • smoketest.php: Split the regression test for bug#193 into separate versions for PHP4 and PHP5 as the previous version only worked for PHP5.
  • python/docs/index.html: Promote the Pythonic iterators, and deprecate the non-pythonic iterators. Make it clearer that the "sequence API" is deprecated.
  • Add test of a custom ValueRangeProcessor (ie, one written in python).
  • Update the examples to use the new-style attributes to access MSet item values rather than the old-style MSET_* constants.
  • Document MSET_DOCUMENT.
  • smoketest.rb: Rename test of metadata access methods which had been named the same as the matchdecider test due to a copy-and-paste error.
[break]De volgende bestanden zijn binnen te halen:
* Xapian 1.0.4
* Omega 1.0.4
* Xapian bindings 1.0.4
Versienummer 1.0.4
Releasestatus Final
Besturingssystemen Windows 9x, Windows NT, Windows 2000, Linux, BSD, Windows XP, macOS, Solaris, UNIX, Windows Server 2003, Windows Vista
Website The Xapian Project
Licentietype GPL

Door Japke Rosink


01-11-2007 • 12:49

3 Linkedin

Bron: The Xapian Project

Reacties (3)

Wijzig sortering
Het wordt ook gewoon op GoT gebruikt, het schijnt ook een heftige PHP mem leak te fiksen.
prachtige indexer! kan op tegen veel commerciele pakketten. fijn intitiatief net zoals lucene en uitstekend bruikbaar!

ondanks dat het hier kennelijk niet veel mensen interesseert is het toch nog mooi dat het ook hier een plekje in de meuk krijgt.

Snel en betrouwbaar systeem! Voor mij niets anders.

Op dit item kan niet meer gereageerd worden.

Kies score Let op: Beoordeel reacties objectief. De kwaliteit van de argumentatie is leidend voor de beoordeling van een reactie, niet of een mening overeenkomt met die van jou.

Een uitgebreider overzicht van de werking van het moderatiesysteem vind je in de Moderatie FAQ

Rapporteer misbruik van moderaties in Frontpagemoderatie.

Google Pixel 7 Sony WH-1000XM5 Apple iPhone 14 Samsung Galaxy Watch5, 44mm Sonic Frontiers Samsung Galaxy Z Fold4 Insta360 X3 Nintendo Switch Lite

Tweakers vormt samen met Hardware Info, AutoTrack,, Nationale Vacaturebank, Intermediair en Independer DPG Online Services B.V.
Alle rechten voorbehouden © 1998 - 2022 Hosting door True

Tweakers maakt gebruik van cookies

Tweakers plaatst functionele en analytische cookies voor het functioneren van de website en het verbeteren van de website-ervaring. Deze cookies zijn noodzakelijk. Om op Tweakers relevantere advertenties te tonen en om ingesloten content van derden te tonen (bijvoorbeeld video's), vragen we je toestemming. Via ingesloten content kunnen derde partijen diensten leveren en verbeteren, bezoekersstatistieken bijhouden, gepersonaliseerde content tonen, gerichte advertenties tonen en gebruikersprofielen opbouwen. Hiervoor worden apparaatgegevens, IP-adres, geolocatie en surfgedrag vastgelegd.

Meer informatie vind je in ons cookiebeleid.


Toestemming beheren

Hieronder kun je per doeleinde of partij toestemming geven of intrekken. Meer informatie vind je in ons cookiebeleid.

Functioneel en analytisch

Deze cookies zijn noodzakelijk voor het functioneren van de website en het verbeteren van de website-ervaring. Klik op het informatie-icoon voor meer informatie. Meer details


    Relevantere advertenties

    Dit beperkt het aantal keer dat dezelfde advertentie getoond wordt (frequency capping) en maakt het mogelijk om binnen Tweakers contextuele advertenties te tonen op basis van pagina's die je hebt bezocht. Meer details

    Tweakers genereert een willekeurige unieke code als identifier. Deze data wordt niet gedeeld met adverteerders of andere derde partijen en je kunt niet buiten Tweakers gevolgd worden. Indien je bent ingelogd, wordt deze identifier gekoppeld aan je account. Indien je niet bent ingelogd, wordt deze identifier gekoppeld aan je sessie die maximaal 4 maanden actief blijft. Je kunt deze toestemming te allen tijde intrekken.

    Ingesloten content van derden

    Deze cookies kunnen door derde partijen geplaatst worden via ingesloten content. Klik op het informatie-icoon voor meer informatie over de verwerkingsdoeleinden. Meer details