Skip to content

[Feature #21943] Add StringScanner#integer_at to extract capture group as Integer directly#192

Closed
jinroq wants to merge 7 commits intoruby:masterfrom
jinroq:master
Closed

[Feature #21943] Add StringScanner#integer_at to extract capture group as Integer directly#192
jinroq wants to merge 7 commits intoruby:masterfrom
jinroq:master

Conversation

@jinroq
Copy link

@jinroq jinroq commented Mar 5, 2026

This avoids intermediate String allocation when only the integer value
of a captured group is needed (e.g. date parsing). The method converts
the matched byte range to an Integer at the C level using rb_cstr2inum,
which is semantically equivalent to `scanner[n].to_i` but without
allocating a temporary String object.
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use better name than get_int?
We already have #get_byte that returns the next byte not the data from the matched group.

ptr = S_PBEG(p) + beg;

{
char *buffer = RB_ALLOCV_N(char, buffer_v, len + 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it OK that we allocate C string?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kou

Thank you for your feedback!

The original implementation did allocate a C string via RB_ALLOCV_N, which undermines the purpose of avoiding intermediate allocations. While RB_ALLOCV_N uses alloca (stack allocation) for small sizes and is much cheaper than a Ruby String, it is still an unnecessary allocation for the common case.

Updated in 4600b0a. For values up to 18 digits on 64-bit (9 on 32-bit), digits are now parsed directly from the source string's byte buffer with no allocation at all — neither Ruby String nor C string. The RB_ALLOCV_N + rb_cstr2inum path is only used as a fallback for bignum-range values (more than 18 digits).

In the primary use case of date component parsing (1-4 digit values), this means zero allocation of any kind.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use rb_int_parse_cstr.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nobu

b12e653 fixed it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, forgotten that is declared just in an internal header.
Maybe we should move them to a public header?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nobu
If we've had no issues with an internal header up until now, this change alone isn't enough to justify making it a public header.

@kou
That's my opinion, but what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to use rb_int_parse_cstr() in out side of Ruby (strscan), we should make it public.

Parse digits directly from the source string's byte buffer for values
that fit in a long (up to 18 digits on 64-bit, 9 on 32-bit), avoiding
the RB_ALLOCV_N + rb_cstr2inum overhead entirely. Fall back to
rb_cstr2inum with a temporary buffer only for bignum-range values.

This eliminates all allocation (both Ruby String and C buffer) in the
primary use case of date component parsing (1-4 digit values).
@jinroq
Copy link
Author

jinroq commented Mar 5, 2026

@kou

Can we use better name than get_int?
We already have #get_byte that returns the next byte not the data from the matched group.

Here are some alternatives to get_int:

  1. integer_at(n)
    • Symmetric with values_at. A singular form of integers_at, proposed during a Ruby Office Hour discussion.
  2. capture_integer(n)
    • Clearly retrieves from capture. Also symmetric with scan_integer (scan advances, capture retrieves from the previous match).
  3. matched_integer(n)
    • Consistent with matched / matched_size

I recommend integer_at.

@jinroq jinroq requested a review from kou March 5, 2026 07:00
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Let's use integer_at(index).

Alternative idea: scanner[index, :integer]

Comment on lines +1911 to +1928
{
long j = 0;
int negative = 0;
long digit_count;

if (ptr[0] == '-') { negative = 1; j = 1; }
else if (ptr[0] == '+') { j = 1; }

digit_count = len - j;

if (digit_count <= GET_INT_MAX_DIGITS) {
long result = 0;
for (; j < len; j++) {
result = result * 10 + (ptr[j] - '0');
}
return LONG2NUM(negative ? -result : result);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does happen when matched string contains non numbers such as .?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bug because it returns an unexpected value. 82b93ad fixes it.

jinroq added 2 commits March 5, 2026 19:13
The name "get_int" conflicts with the existing "get_byte" convention,
which advances the scanner position. "integer_at" better reflects that
this method reads from a capture group (like "values_at") rather than
advancing the scanner.
Previously, non-digit characters in the captured substring produced
silently wrong results (e.g. "1.5" returned 85). Now both the fast
path and bignum path validate that all characters after an optional
sign are digits, raising ArgumentError otherwise.
@jinroq
Copy link
Author

jinroq commented Mar 5, 2026

@kou

OK. Let's use integer_at(index).

Thanks, I renamed it to integer_at in 46a2e1d.

On a different note, I was trying to rebase but it seems I ended up implementing it on the master branch. Can I remake the PR or should I go ahead with it?

Replace the hand-rolled fast path / slow path integer parsing with a
single call to rb_int_parse_cstr, which accepts a length parameter and
parses directly from the source buffer without NUL-terminated copy.
This removes the INTEGER_AT_MAX_DIGITS constant, the manual digit
validation loops, and the RB_ALLOCV_N buffer allocation.
@jinroq jinroq changed the title [Feature #21943] Add StringScanner#get_int to extract capture group as Integer directly [Feature #21943] Add StringScanner#integer_at to extract capture group as Integer directly Mar 5, 2026
end = adjust_register_position(p, p->regs.end[i]);
len = end - beg;

if (len <= 0) return INT2FIX(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this integer_at parses more strict than Integer(string) that accepts "1_234".
So I think empty string case should also raise error like Integer("") raises ArgumentError.

In any case, I think it's worth adding empty string matched test case

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tompng

Thank you for your feedback. 0948c72 fixed it.

jinroq added 2 commits March 5, 2026 23:14
Consistent with Integer("") raising ArgumentError, empty captures now also raise instead of silently returning 0.
Cover sign-only captures (+/-), signed numbers (-42/+42),
leading zeros (007), and full match containing non-digit
separators (e.g. "2024-06-15" at index 0).
@kou
Copy link
Member

kou commented Mar 6, 2026

On a different note, I was trying to rebase but it seems I ended up implementing it on the master branch. Can I remake the PR or should I go ahead with it?

You can close this PR and re-open a new PR with a topic branch.

Comment on lines +1877 to +1883
/* rb_int_parse_cstr is declared in internal/bignum.h which is not
* available to extensions. Declare it here since the symbol is
* exported from libruby. */
VALUE rb_int_parse_cstr(const char *str, ssize_t len, char **endp,
size_t *ndigits, int base, int flags);
#define RB_INT_PARSE_SIGN 0x01

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this from between the call-seq comment and strscan_integer_at() definition for RDoc?

@kou
Copy link
Member

kou commented Mar 6, 2026

BTW, could you enable GitHub Actions on your fork to run CI in your side too.

@jinroq jinroq closed this Mar 6, 2026
@jinroq jinroq reopened this Mar 6, 2026
@jinroq jinroq closed this Mar 6, 2026
eregon added a commit to truffleruby/truffleruby-gem-tracker that referenced this pull request Mar 7, 2026
* The problem is otherwise it would include PR workflows from a fork branch named "master",
  like ruby/strscan#192.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants