bdscience | Unsorted

Telegram-канал bdscience - Big Data Science

4168

Big Data Science channel gathers together all interesting facts about Data Science. For cooperation: a.chernobrovov@gmail.com 💼 — https://t.me/bds_job — channel about Data Science jobs and career 💻 — https://t.me/bdscience_ru — Big Data Science [RU]

Subscribe to a channel

Big Data Science

⚡️The largest collection of datasets of ~ 1 million pairs of problems and solutions for mathematical competitions

NuminaMath - datasets consisting of 1 million pairs of problems and solutions for various mathematical problems.

🔎Chain of Reasoning (CoT): 860 thousand pairs of problems and solutions created using CoT.

🛠 Tool-Integrated Reasoning (TIR): 73K synthetic solutions derived from GPT-4 with code execution feedback to break complex problems into simpler subproblems that can be solved using Python.

According to the researchers, models trained on NuminaMath achieve best-in-class performance among open-weight models and approach or beat their own models in math competition scores.

Читать полностью…

Big Data Science

💡 Large video dataset with long duration and structured annotations

Tencent's MiraData is an off-the-shelf dataset with a total video duration of 16 thousand hours, designed to train text-to-video generation models. It includes long videos (average 72.1 seconds) with high motion intensity and detailed structured annotations (average 318 words per video).

To evaluate the quality of the dataset, a MiraBench benchmark system of 17 metrics assessing temporal consistency, motion in the frame, video quality, and other parameters was even specially created. According to their results, MiroData outperforms other known datasets available in open sources, which mostly consist of short videos with floating quality and short descriptions.

Читать полностью…

Big Data Science

🔎Lakehouse architecture: advantages and disadvantages

Lakehouse architecture is designed to provide more flexible and efficient data processing, including data storage, processing and analytics. It is a hybrid approach that combines elements of a traditional Data Warehouse and a Data Lake.

Lakehouse advantages:

1. Data unification: Lakehouse architecture allows you to store structured and unstructured data in one place. This simplifies data access and analysis, eliminating the need for separate systems for each type of data.
2. Cost-effective: By using low-cost data storage solutions such as cloud storage objects, Lakehouse architecture can be more cost-effective compared to traditional data warehouses.
3. Flexibility and Scalability: Lakehouse supports scalability, making it easy to increase data storage and processing power as needed. This is especially important for companies working with large volumes of data and requiring high performance.
4. Compatibility with modern analytical tools: Many modern analytical tools and platforms, such as Apache Spark, Delta Lake and others, integrate with the Lakehouse architecture, providing high performance and reliability of data analysis.

Disadvantages of Lakehouse

1. Implementation Difficulty: Implementing Lakehouse architecture can require significant effort and expense in planning, designing, and configuring the system. This may include training staff and adapting existing processes and tools.
2. Data Quality Management: Merging data from different sources can lead to data quality issues, especially if there are no rigorous data cleaning and validation processes in place.
3. Security and Privacy: Consolidating large amounts of data in one place increases the risks associated with data security and privacy. Additional measures are required to protect data from unauthorized access and leaks.
4. Potential Data Access Latency: In some cases, the Lakehouse architecture may experience latency in data access, especially when processing large volumes of unstructured data.

Thus, Lakehouse architecture offers many benefits such as data unification, cost efficiency and flexibility, making it attractive to many organizations. However, its implementation is associated with certain challenges, including complexity of integration, data quality management and security issues.

Читать полностью…

Big Data Science

⚡️Tool to significantly enhance the database

WrenAI is an open-source tool that makes your existing database RAG-ready.

It allows you to convert text to SQL, explore data from the database without writing SQL, and do many other things

🖥 GitHub
🟡 Documentation

Читать полностью…

Big Data Science

💻High-performance distributed database

YugabyteDB is a high-performance distributed database that supports all PostgreSQL features.

YugabyteDB is well suited for cloud-based OLTP applications (i.e. real-time and business-critical) that require absolute data correctness and require scalability or high fault tolerance.

🖥 GitHub
🟡 Documentation

Creating a local YugabyteDB cluster with Docker:

docker run -d --name yugabyte -p7000:7000 -p9000:9000 -p15433:15433 -p5433:5433 -p9042:9042 \
yugabytedb/yugabyte:2.21.1.0-b271 bin/yugabyted start \
--background=false

Читать полностью…

Big Data Science

🌎TOP DS-events all over the world in July
Jul 9 - The Martech Summit - Hong Kong, China - https://themartechsummit.com/hongkong
Jul 9-11 - DATA 2024 – Dijon, France - https://data.scitevents.org/
Jul 9-11 - Transform 2024 - San Francisco, USA - https://transform24.venturebeat.com/
Jul 11-12 - DataConnect Conference – Ohio, United States - https://www.dataconnectconf.com/
Jul 17 - Data & Analytics Live - Online - https://data-analytics-live.coriniumintelligence.com/
Jul 23 - CDAO Indonesia - Indonesia - https://cdao-id.coriniumintelligence.com/
Jul 26 - The MachineCon 2024 - New York, USA - https://machinecon.aimresearch.co/
Jul 29-30 - Gartner Data Analytics Summit - Sydney, Australia - https://www.gartner.com/en/conferences/apac/data-analytics-australia

Читать полностью…

Big Data Science

⚔️🔎ACID in Kafka vs ACID in Airflow when processing Big data: advantages and disadvantages

When considering two popular data science tools such as Apache Kafka and Apache Airflow, it is important to understand how they deal with ACID principles (Atomicity, Consistency, Isolation, Durability). These principles are critical to ensuring reliable and predictable data processing.

Benefits of Kafka ACID:
1. Durability: Kafka stores data in disk memory, which ensures its safety even in the event of a system failure.
2. Consistency: When configured correctly, Kafka ensures that all consumers receive the same data in the same order.
3. Isolation: Messages in Kafka are divided into topics and sections, which helps isolate data processing between different threads.

Disadvantages of Kafka ACID:
1. Atomicity: Kafka does not always guarantee atomicity at the message level. In some cases, duplicate messages or omissions may occur if additional tools such as Kafka Transactions are not used.
2. Complexity of Configuration: Achieving ACID properties in Kafka requires complex configuration and management, including replication and transaction configuration.

Advantages of Airflow ACID:
1. Atomicity: Airflow provides atomicity at the task level. If a task fails, the entire DAG (Directed Acyclic Graph) can be re-run or restored from the point of failure.
2. Consistency: Airflow maintains a strict sequence of tasks, ensuring a consistent state of data.
3. Dependency Management: Airflow allows you to manage dependencies between tasks, making it easier to ensure data isolation and consistency.

Disadvantages of Airflow ACID:
1. Performance: Unlike Kafka, Airflow is not designed for real-time data processing. Its main purpose is to manage long-term and complex work processes.
2. Durability: Although Airflow maintains the state of tasks and DAGs, it relies on external data stores (such as databases) for long-term data storage, which may require additional effort to ensure durability.

Thus, Apache Kafka is better suited for real-time data processing with high performance and durability, but may require complex tuning to achieve atomicity and consistency. Apache Airflow, in turn, is great at managing and orchestrating complex workflows, providing atomicity and consistency at the task level, but is not designed for real-time streaming data processing.

Читать полностью…

Big Data Science

⚡️💡Open-source data container orchestration system for running AI systems

dstack is an open-source container orchestration engine designed for AI workloads in any cloud or data center.

Cloud providers supported by this technology include AWS, GCP, Azure, OCI, Lambda, TensorDock, Vast.ai, RunPod, and CUDO.

If you have standard AWS, GCP, Azure or OCI credentials on your device, the dstack server will pick them up automatically.

🖥GitHub
🟡 Documentation

Читать полностью…

Big Data Science

📊Dataset of characters from real to fictional characters

Character Codex - a dataset that contains data on 15,939 characters from a wide variety of sources, from anime to historical figures, scholars, and popular characters, both fictional and non-fictional!

Potential uses include use for generating synthetic data, analyzing RPG data, and more.
You can use a Python script to load a dataset:
from datasets import load_dataset
dataset = load_dataset("NousResearch/CharacterCodex")

Читать полностью…

Big Data Science

💡🔎What is NoORM: advantages and disadvantages
NoORM (No Object-Relational Mapping) is an approach to working with databases that rejects the use of traditional ORM (Object-Relational Mapping) frameworks. Instead, developers interact directly with the database using native SQL queries or other specialized data manipulation techniques.
Advantages of NoORM:
1. Query Optimization: Because developers write SQL queries by hand, they can optimize them down to the last detail, often resulting in significant performance improvements over ORM-generated queries.
2. Minimize overhead: Using an ORM adds additional layers of abstraction that can slow down operations. NoORM eliminates these layers, which can also improve performance.
3. Support for complex data structures: NoORM allows you to work with non-standard data structures and relationships that may be difficult to implement through ORM.
4. Process Understanding: Developers have a thorough understanding of how data is accessed and modified, making debugging and optimization easier.
Disadvantages of NoORM:
1. Code Maintenance: Changing the database schema can require updating a lot of code, making the system difficult to maintain and develop.
2. Reduced portability: Code written for one DBMS may require significant changes to work with another DBMS, which reduces the portability of the application.
3. Repetitive code: Without an ORM, developers may find themselves writing the same type of database operations over and over again, which increases code size and reduces readability.
4. Risk of SQL Injection: When writing manual SQL queries, there is a higher risk of errors leading to vulnerabilities such as SQL injection. Developers must be especially careful about validating and escaping input data.

Thus, NoORM is a powerful approach for those who want complete control over database interactions and optimize the performance of their applications. However, it requires a greater level of knowledge and care on the part of developers.

Читать полностью…

Big Data Science

📊💡Dataset of interactions with ChatGPT
Wild Chat is a dataset of 1 million real user interactions with ChatGPT, characterized by a wide range of languages and a variety of prompts.
It was collected by providing free access to everyone to ChatGPT and GPT-4 in exchange for collecting chat history.
Using this dataset, the developers created the Llama-2-based WildLlama-7b-user-assistant bot WildLlama-7b-user-assistant, which is capable of predicting both the user's prompts and the responses that ChatGPT might choose.
The following script can also be used to load a dataset:
from datasets import load_dataset
dataset = load_dataset(“allenai/WildChat-1M”)

Читать полностью…

Big Data Science

💡🔎📉Adversarial verification: advantages and disadvantages
Adversarial Verification (AV) is a technique that evaluates a modern test data format based on operational data. This is especially useful in machine learning tasks, where the quality of predictions can matter due to the fact that the data relationship between the strategic and test samples is now clearly visible. Let's look at the main advantages and disadvantages of this situation.
Advantages of adversarial verification:
1. Detection of data inconsistencies:
AV helps identify if production and test data have very different distributions. This may signal dangerous problems with generalization models.
2. Improving the quality of models: By eliminating differences between process and test data, the quality of predictive models in test selection can be significantly improved.
3. Optimization of data selection: by using AV, organizational and validation data sets can be used more accurately, which will avoid overfitting and improve the overall quality of the model.
4. Identification of data leaks: AV helps to identify cases where information from the test sample “leaks” into the operational sample, which can lead to biased results of the model.
Disadvantages of the adversarial test:
1. Increased computational cost: Performing AV requires training additional models (usually a classifier), which increases the computational cost and time required for data analysis.
2. Difficulty in Implementation: Setting up and installing AV can require significant knowledge and experience in machine learning, which can be challenging for beginners.
3. Risk of overfitting: Using AV too often to correct data can lead to overtraining of models on operational data and deterioration of their generalization abilities.

Читать полностью…

Big Data Science

🔎💡Useful GitHub repositories for master data development and beginners
Awesome Data Engineering - Contains a list of tools, frameworks and libraries for data engineering, making it a great starting point for those looking to dive into the field.
Data Engineering Zoomcamp is a comprehensive course that provides hands-on experience in data engineering.
The Data Engineering Cookbook is a collection of articles and tutorials covering various aspects of data engineering, including data entry, data processing, and data warehousing.
Awesome Open Source Data Engineering is a list of open source data engineering tools that will be a goldmine for anyone who wants to contribute or use them to create real data engineering projects. It contains a wealth of information about open source tools and frameworks, making it an excellent resource for those who want to explore alternative data engineering solutions.
Data Engineer Handbook is a comprehensive collection of resources covering all aspects of data engineering. It includes tutorials, articles and books on all topics related to data engineering. Whether you're looking for a quick reference guide or in-depth knowledge, this reference book has something for every level of data engineer.
The Data Engineering Wiki is a community-created wiki that provides a comprehensive resource for learning data engineering. This repository covers a wide range of topics including data pipelines, data warehouses, and data modeling.
Data Engineering Practice - Offers a practical approach to learning data engineering. It features practical projects and exercises that will help you apply your knowledge and skills to real-life scenarios. By completing these projects, you will gain hands-on experience and build a portfolio that demonstrates your data engineering capabilities.

Читать полностью…

Big Data Science

💡 Large Feedback Dataset
The RLAIF-V-Dataset is a large multimodal recall dataset. The dataset is built using open source models to provide high quality feedback.
The RLAIF-V-Dataset is a novel method of using open source MLLM to provide high quality feedback from deconfined model responses. By training on this data, models can achieve a higher level of confidence than existing open source models.
Load a dataset using a Python script as follows:
from datasets import load_dataset
dataset = load_dataset("HaoyeZhang/RLAIF-V-Dataset")

Читать полностью…

Big Data Science

⚔️💡MySQL vs PostgreSQL in Data Mining: advantages and disadvantages
When it comes to choosing a database for data mining tasks, the two most popular solutions are MySQL and PostgreSQL. Both of these DBMSs have their strengths and weaknesses, and the choice between them depends on the specific requirements of the project. Let's look at the main advantages and disadvantages of each of them in the context of data mining.
MySQL Advantages:
1. High Performance: MySQL is known for its fast performance on simple read and write operations, making it suitable for high-load applications.
2. Widespread: MySQL has a large community and lots of documentation, making it easy to find solutions and get help.
3. Integration with web technologies: MySQL is often used in web development and has good compatibility
Disadvantages of MySQL:
1. Limited Analytics Capabilities: MySQL is less efficient at running complex analytical queries and does not support some advanced features such as windowing functions that are important for data mining.
2. Less support for JSON and NoSQL: Although MySQL has JSON support, it is not as developed as PostgreSQL.
3. Limited transaction capabilities: MySQL is inferior to PostgreSQL in terms of complex transaction management and ACID compliance.
Benefits of PostgreSQL:
1. Powerful analytical capabilities: PostgreSQL supports complex analytical queries, window functions and CTE (Common Table Expressions), making it an excellent choice for data mining.
2. JSON and NoSQL support: PostgreSQL has advanced JSON support and can be used as a hybrid DBMS, making it easier to work with semi-structured data.
3. Extensibility and Compatibility: PostgreSQL is easily extensible with plugins and supports many SQL standards, making it very flexible.
4. Reliability and ACID compliance: PostgreSQL provides a high level of data reliability and full compliance with ACID transactions, which is important for mission-critical data mining applications.
Disadvantages of PostgreSQL:
1. Complexity of setup and administration: PostgreSQL requires more in-depth setup and administration, which can be more difficult for beginners.
2. Performance on simple tasks: In some cases, PostgreSQL may be inferior to MySQL in terms of speed for performing simple operations.
3. Resource Intensive: PostgreSQL may require more resources to achieve high performance, especially in complex scenarios.
Thus, the choice between MySQL and PostgreSQL for data mining tasks depends on the specific needs of the project. If you need a simple and fast database system for basic operations and integration with web applications, MySQL may be the best choice. If the project requires complex data analysis, flexibility and reliability, PostgreSQL will be a more suitable option.

Читать полностью…

Big Data Science

😎Graph database implemented on the Apache Apache TinkerPop3 framework

HugeGraph is an open-source graph database implemented on the Apache TinkerPop3 framework and fully compatible with the Gremlin query language.

HugeGraph supports the import of over 10 billion vertices and edges and can process queries very quickly (at the ms level).

Typical HugeGraph application scenarios include exploring relationships between objects, association analysis, path finding, feature extraction, data clustering, community detection, and graph construction.

Quick start with Docker:

docker run -itd --name=graph -p 8080:8080 hugegraph/hugegraph
# docker exec -it graph bash

Читать полностью…

Big Data Science

⚡️🔎Fully Synthetic Dataset

A huge dataset consisting entirely of synthetic data has appeared on Hugging Face.

The LLM (in this case GPT-4o + VLLM) generates answers by representing itself each time with some character: for example, a chemical scientist or a musician.

Synthetic data can sometimes help a lot (especially when the task is abstract and there is no structured information), but they are still treated with caution. They are not realistic enough, they are not diverse enough, and they potentially harbor hallucinations. It is still unclear whether we will ever be free to use “synthetics”, but it is actively being worked on.

Читать полностью…

Big Data Science

💡Another small selection of AI tools for Big Data analytics

KNIME Analytics Platform is a free, open-source platform that allows users to stay at the forefront of data science and has 300+ connectors to various data sources. and integrates with all popular machine learning libraries.

Polymer - artificial intelligence for transforming data into an optimized, flexible and powerful database. All a user needs to do is upload their spreadsheet to the platform to instantly transform it into an optimized database that can then be mined for insights.

IBM Cognos Analytics is a componentized online business intelligence (BI) service that provides access to a wide range of functions for creating business reports, data analysis, event monitoring and metrics to develop effective business decisions.

Akkio is a business intelligence and forecasting tool that allows users to analyze their data and predict potential outcomes. The AI ​​tool allows users to upload their dataset and select the variable they want to predict, which helps Akkio build a neural network around that variable. Like many other tools, Akkio requires no prior programming experience.

Monkeylearn - uses AI data analytics capabilities to help users visualize and reorganize their data. It can also be used to set up text classifiers and text extractors, which help automatically sort data according to topic or intent, and extract product characteristics or user data.

Читать полностью…

Big Data Science

⚡️💡💻 MySQL 9.0.0 has been released

Oracle recently released MySQL DBMS 9.0.0. The developers of the project have prepared and made publicly available MySQL Community Server 9.0.0 builds for major Linux, FreeBSD, macOS and Windows distributions.

In 2023, the company announced a change in the MySQL DBMS release formation model. Developers began releasing two types of MySQL branches: Innovation (new features, frequent updates, three months of support) and LTS (with extended support time and unchanged behavior).

As the developers note, the MySQL 9.0 project is assigned to the Innovation branch, which will also include the next major releases of MySQL 9.1 and 9.2.

Distributions based on Innovation branches are recommended for those users who want to get access to new functionality earlier. They are published every 3 months and are supported only until the next major release is published (for example, after the 9.1 branch is released, support for the 9.0 branch will be discontinued).

Читать полностью…

Big Data Science

🎼Datasets and projects for music generation and analysis tasks

MAESTRO - (MIDI and Audio Edited for Synchronous Tracks and Organization) contains over 200 hours of annotated recordings of international piano competitions over the past ten years.

NSynth - the dataset consists of 305,979 musical notes and includes recordings of 1006 different instruments, such as flute, guitar, piano and organ. The dataset is annotated by instrument type (acoustic, electronic or synthetic) and other sound parameters.

Lakh MIDI v0.1 - There are 176,581 MIDI files in the dataset, of which 45,129 are associated with samples from the Million Song Dataset. This dataset is designed to simplify the search for music information based on text and audio content on a large scale.

Music21 - contains musical performances from 21 categories and is aimed at solving research problems (for example, finding an answer to the question: “Which group used these chords for the first time ?)

Читать полностью…

Big Data Science

⚡️Hyperconverged cloud open-source database

MatrixOne is a hyperconverged cloud distributed database with a structure that separates storage, compute and transactions into a single HSTAP data engine.
This mechanism allows a single database system to handle a variety of business workloads such as OLTP, OLAP, and stream computing.

MatrixOne supports deployment and use in public and private clouds, providing compatibility with a variety of infrastructures.

🖥 GitHub
🟡 Documentation

Читать полностью…

Big Data Science

📊A huge dataset of images and their captions

Pixel Prose is a dataset that contains over 16 million diverse images from three different web databases (commonPool, CC12M, RedCaps) with captions created using Google Gemini 1.0 Pro Vision.

The following Python script can be used to load a dataset using the API:
from datasets import load_dataset
# for downloading the whole data
ds = load_dataset("tomg-group-umd/pixelprose")

Читать полностью…

Big Data Science

📊💡Dataset for video analysis

CinePile is a question and answer based video understanding dataset. It was created using advanced large language models (LLMs). The dataset includes approximately 300,000 data points for training and 5,000 data points for testing.

Each row in the dataset consists of a question (dtype: string), five answer choices (dtype: list), and an answer_key (dtype: string). The auxiliary columns store the movie title, movie genre, video clip titles, etc.
To load a dataset via Python script, you can use the following command:
from datasets import load_dataset
dataset = load_dataset("tomg-group-umd/cinepile")

Читать полностью…

Big Data Science

💡🔎Interesting and useful repository
Jailbreak - a repository that contains a dataset consisting of 15,140 ChatGPT queries from Reddit, Discord, hacking websites, and open source datasets (including 1,405 jailbreak queries gpt answers).
According to the developers, they collected 15,140 messages from these four platforms between December 2022 and December 2023.

Читать полностью…

Big Data Science

💡🔎Platform Extension Framework (PXF): advantages and disadvantages
The Platform Extension Framework (PXF) is a powerful tool provided by many modern platforms to extend their functionality. PXF allows developers to create plugins and add-ons that integrate into the core platform, providing system flexibility and extensibility.
Its advantages include:
1. Time-tested solution based on open source code with the possibility of modification to suit your needs
2. Modularity: PXF allows functionality to be divided into independent modules. This makes it easier to develop, test, and maintain code.
3. Extensibility: With PXF, you can easily add new capabilities or integrate with external services and tools, allowing the platform to evolve with your business needs.
4. Speed ​​up development: PXF-enabled platforms often provide ready-made tools and APIs that speed up the development process and make it easier to implement new features.
5. A set of connectors to popular data sources available out of the box (Hadoop stack, data sources available via JDBC, cloud storage).
But there are also a number of disadvantages:
1. The need to support a separate solution based on your own stack.
2. Allocation of resources, as a rule, on the same servers where the DBMS itself is deployed.
3. Multiple transformations and transfer of the same data on the way from representation in the DBMS to the types that PXF itself operates on.
4. Security: Since extensions may have access to sensitive data and platform functions, it is important to ensure their security and prevent possible vulnerabilities.
5. Compatibility: Platform updates may introduce compatibility issues with existing extensions, requiring additional testing and adaptation.

The Platform Extension Framework provides powerful capabilities for extending and adapting platforms, allowing developers to create custom solutions and improve system functionality. However, it is important to consider the potential challenges and risks associated with integrating and supporting extensions in order to maximize the potential of PXF.

Читать полностью…

Big Data Science

💡🔎Not very well known, but very useful ETL services
Astera Centerprise is an enterprise-grade, ready-to-use ETL solution that offers data integration and transformation capabilities for raw data of any complexity and size in a variety of formats: from complex hierarchical files and unstructured documents to industry formats such as EDI, and even legacy data such as COBOL.
Talend is an open source software platform that offers data integration and management solutions. Talend specializes in big data integration. This tool provides features such as cloud, big data, enterprise application integration, data quality, and master data management. It also provides a single repository for storing and reusing metadata.
Skyvia is a web service for cloud data integration and backup. It offers ETL tools to integrate cloud CRM with other data sources and allows users to control all their business data. Data can be viewed and manipulated using SQL. Skyvia provides easy data integration without programming skills.
Pentaho is a business intelligence tool that provides clients with a wide range of business intelligence solutions. It is capable of reporting, data analysis, data integration, data extraction, etc. Pentaho also offers a complete set of BI features that can improve business productivity and efficiency.
Hevo Data is an ETL platform that supports data integration, movement, and processing. It supports a wide range of data sources and offers real-time data replication. This tool facilitates the extraction, transformation and loading of data to the designated target destinations.

Читать полностью…

Big Data Science

🌎TOP DS-events all over the world in June
Jun 2-4 - AI Con USA 2024 - Las Vegas, USA - https://aiconusa.techwell.com/
Jun 3-4 - Institute for Data Science and Artificial Intelligence Conference 2024 - Manchester, UK - https://oxfordroadcorridor.com/events/institute-for-data-science-and-artificial-intelligence-conference-2024/
Jun 5 - Digital transformation summit - RIYADH, Saudi Arabia - https://digitransformationsummit.com/ksa/
Jun 5-6 - AI & Big Data Expo North America 2024 - Santa Clara, USA - https://www.ai-expo.net/northamerica/
Jun 5-6 - Big Data and Analytics Summit - Ontario, Canada - https://www.bigdatasummitcanada.com/
Jun 12-13 - The AI Summit - London, United Kingdom - https://london.theaisummit.com/
Jun 17-19 - Data Science & Statistics - Amsterdam, Netherlands - https://datascience.thepeopleevents.com/
Jun 18 - The Martech Summit - Jakarta, Indonesia - https://themartechsummit.com/jakarta
Jun 20 - Data Architecture Melbourne - Melbourne, Australia - https://data-architecture.coriniumintelligence.com/
Jun 25 - Data Architecture Sydney - Sydney, Australia - https://data-architecture-syd.coriniumintelligence.com/
Jun 25-28 - MLCon Munich - Munich, Germany / Online - https://mlconference.ai/munich/
Jun 26-28 - International Conference on Distributed Computing and Artificial Intelligence (DCAI) - Salamanca, Spain - https://www.dcai-conference.net/

Читать полностью…

Big Data Science

📊🔎A small selection of not very popular but useful libraries for data analysis
PySheets - provides a spreadsheet user interface for Python. Use Pandas, create charts, import Excel sheets, analyze data and create reports.
py2wasm - converts Python programs and data into WebAssembly and runs them 3x faster.
databonsai - is a Python library that uses LLM for data cleaning tasks such as categorization, transformation, and extraction.

Читать полностью…

Big Data Science

💡🔎An assistant for interacting with any kind of data
Verba is a fully customizable personal assistant for querying and interacting with your data, locally or deployed via the cloud.

It can also answer questions related to your documents and retrieve information from existing knowledge bases.

Verba perfectly combines state-of-the-art RAG technology with Weaviate's context-aware database.

Читать полностью…

Big Data Science

💡😎Basics of working with Data-Mining: process, tools and techniques
Data mining is the process of processing data to identify patterns, correlations and anomalies in large datasets. It uses a variety of statistical analysis and machine learning techniques to extract meaningful information and insights from data. Companies can use these insights to make informed decisions, predict trends, and improve business strategies.
There are such data mining techniques as:
Decision trees - at the end of each branch there is a prediction or decision. In classification tasks, these endpoints separate data into categories
Detection of anomalies - anomalies can arise from fluctuations in measurements or be indicators of experimental error; in some cases they may indicate an important discovery or a new trend
Software for working with data mining is divided into:
1. Visualization tools:
Grafana - suitable for analytics and real-time monitoring
Google Charts is a web-based solution for creating interactive charts
2. Data mining platforms:
KNIME is an analytical platform that allows you to download data from various sources, transform data and load it into various databases
RapidMiner is a multi-user software platform that is an integrated environment for processing data in large information arrays, machine learning, text analytics and building predictive models, as well as for solving other Data Mining problems.

Читать полностью…
Subscribe to a channel